<a href="https://colab.research.google.com/github/uditmanav17/CoreySchafer/blob/master/Pandas/Pandas_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Index
* [Pandas Coding Snippets](#Pandas_Coding_Snippets)
    * [Part 1: Loading Data](#video-01)
    * [Part 2: DataFrrame and Series Objects](#video-02)
    * [Part 3: Indexes](#video-03)
    * [Part 4: Filtering data from Dataframe and series objects](#video-04)
    * [Part 5: Updating rows and columns, modifying data within Dataframe](#video-05)
        * [Apply](#apply)
        * [ApplyMap](#applymap)
        * [Map](#map)
        * [Replace](#replace)
    * [Part 6: Adding and Removing Rows and Columns from DataFrame](#video-06)
    * [Part 7: Sorting Data](#video-07)
    * [Part 8: Grouping, Aggregating, Analysing and Exploring Data](#video-08)
    * [Part 9: Cleaning Data - Casting Data Types and Handling Missing Values](#video-09)
    * [Part 10: Working with Dates and Time Series Data ](#video-10)
    * [Part 11: Loading Data](#video-11)


# Pandas Coding Snippets <a name='Pandas_Coding_Snippets'></a>
This section contains all of the coding snippetes that are required by us to understand the basics of __pandas__.

In [0]:
import pandas as pd

## Part 1: Loading Data <a name='video-01'></a>
Checkout real world examples notebook.


## Part 2: DataFrame and Series Objects <a name='video-02'></a>
df.value_counts(), df.loc, df.iloc, 

In [0]:
# Creating Data Frame from dict of list
people = {
    'first' : ['A', 'B', 'C', 'Udit'], 
    'last' : ['X', 'Y', 'Z', 'Manav'],
    'mail' : ['a.x@mail.com', 'b.y@mail.com', 'c.z@mail.com', 'udit.manav@gmail.com'],
    'response' : ['Yes', 'No', 'No', 'Yes']
}
df = pd.DataFrame(people)
df

Unnamed: 0,first,last,mail,response
0,A,X,a.x@mail.com,Yes
1,B,Y,b.y@mail.com,No
2,C,Z,c.z@mail.com,No
3,Udit,Manav,udit.manav@gmail.com,Yes


In [0]:
# count the number of responses 
df['response'].value_counts()

No     2
Yes    2
Name: response, dtype: int64

In [0]:
print(df['mail'])
print(type(df['mail']))  # Every column of DataFrame is a serie object

0            a.x@mail.com
1            b.y@mail.com
2            c.z@mail.com
3    udit.manav@gmail.com
Name: mail, dtype: object
<class 'pandas.core.series.Series'>


In [0]:
# columns can also be accessed by dot(.)
df.mail

0            a.x@mail.com
1            b.y@mail.com
2            c.z@mail.com
3    udit.manav@gmail.com
Name: mail, dtype: object

In [0]:
# to access multiple columns use list
df[['first', 'last']]

Unnamed: 0,first,last
0,A,X
1,B,Y
2,C,Z
3,Udit,Manav


To fetch the rows from dataframe we use loc (location) and iloc (integer location). These returns a series or DataFrame based on the parameters supplied to the functions. To access the data via integer locations we use iloc and t access daat via row/column name/index we use loc.

In [0]:
df.iloc[0] # returns 0th column of DataFrame as series

first                  A
last                   X
mail        a.x@mail.com
response             Yes
Name: 0, dtype: object

In [0]:
df.iloc[[0, 1], 
        [0, 2]]  # returns oth and 1st rows 0th and 2nd columns

Unnamed: 0,first,mail
0,A,a.x@mail.com
1,B,b.y@mail.com


In [0]:
df.loc[[0, 1], 
       ['first', 'last']]

Unnamed: 0,first,last
0,A,X
1,B,Y


## Part 3: Indexes <a name='video-03'></a>
df.set_index, df.sort_index, df.reset_index

In [0]:
# setting a column as an index
df.set_index('mail')  # this is not an in-place function
# to apply this change to DataFrame we need to pass inplace=True

Unnamed: 0_level_0,first,last,response
mail,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a.x@mail.com,A,X,Yes
b.y@mail.com,B,Y,No
c.z@mail.com,C,Z,No
udit.manav@gmail.com,Udit,Manav,Yes


In [0]:
df.set_index('mail', inplace=True)
df

Unnamed: 0_level_0,first,last,response
mail,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a.x@mail.com,A,X,Yes
b.y@mail.com,B,Y,No
c.z@mail.com,C,Z,No
udit.manav@gmail.com,Udit,Manav,Yes


In [0]:
# to get the index
df.index

Index(['a.x@mail.com', 'b.y@mail.com', 'c.z@mail.com', 'udit.manav@gmail.com'], dtype='object', name='mail')

In [0]:
# now we can use index to fetch a record
df.loc['a.x@mail.com', 'last']

'X'

In [0]:
# sort the rows via index
df.sort_index(ascending=False)

Unnamed: 0_level_0,first,last,response
mail,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
udit.manav@gmail.com,Udit,Manav,Yes
c.z@mail.com,C,Z,No
b.y@mail.com,B,Y,No
a.x@mail.com,A,X,Yes


In [0]:
# to reset the index 
df.reset_index(inplace=True)
df

Unnamed: 0,mail,first,last,response
0,a.x@mail.com,A,X,Yes
1,b.y@mail.com,B,Y,No
2,c.z@mail.com,C,Z,No
3,udit.manav@gmail.com,Udit,Manav,Yes


## Part 4: Filtering data from Dataframe and series objects <a name='video-04'></a>

In [0]:
# get a series indicating which rows to select
df['last'] == 'Z'

0    False
1    False
2     True
3    False
Name: last, dtype: bool

In [0]:
# use the above series as filter
filtr = df['last'] == 'Z'

# use the filter to get the required rows of DataFrame
display(df[filtr])
display(df.loc[filtr])

Unnamed: 0,mail,first,last,response
2,c.z@mail.com,C,Z,No


Unnamed: 0,mail,first,last,response
2,c.z@mail.com,C,Z,No


In [0]:
# Combining Filters using & and |
display(df.loc[(df['last'] == 'Z') & (df['first'] == 'C')])
display(df.loc[(df['last'] == 'Z') | (df['first'] == 'A')])

Unnamed: 0,mail,first,last,response
2,c.z@mail.com,C,Z,No


Unnamed: 0,mail,first,last,response
0,a.x@mail.com,A,X,Yes
2,c.z@mail.com,C,Z,No


In [0]:
# using not in a filter
df.loc[~((df['last'] == 'Z') | 
         (df['first'] == 'A'))]

Unnamed: 0,mail,first,last,response
1,b.y@mail.com,B,Y,No
3,udit.manav@gmail.com,Udit,Manav,Yes


In [0]:
# another neat way of filtering is using df.isin()
first_names = ['A', 'Udit']
filtr = df['first'].isin(first_names)
df[filtr]

Unnamed: 0,mail,first,last,response
0,a.x@mail.com,A,X,Yes
3,udit.manav@gmail.com,Udit,Manav,Yes


In [0]:
# filtering based on substring 
filtr = df['mail'].str.contains('@mail')
df[filtr]

Unnamed: 0,mail,first,last,response
0,a.x@mail.com,A,X,Yes
1,b.y@mail.com,B,Y,No
2,c.z@mail.com,C,Z,No


In [0]:
# reordering columns back to what they were
cols = ['first', 'last', 'mail',]
df = df[cols]
df

Unnamed: 0,first,last,mail
0,A,X,a.x@mail.com
1,B,Y,b.y@mail.com
2,C,Z,c.z@mail.com
3,Udit,Manav,udit.manav@gmail.com


## Part 5: Updating rows and columns, modifying data within Dataframe <a name='video-05'></a>
df.rename, df.apply, df.applymap, df.map, df.replace

In [0]:
# Changing Columns name
df.columns = ['First Name', 'Last Name', 'Email']
df

Unnamed: 0,First Name,Last Name,Email
0,A,X,a.x@mail.com
1,B,Y,b.y@mail.com
2,C,Z,c.z@mail.com
3,Udit,Manav,udit.manav@gmail.com


In [0]:
# remove space and convert to lower case all column names
df.columns = [x.lower() for x in df.columns]
df.columns = df.columns.str.replace(' ', '_') # this also works with columns
df

Unnamed: 0,first_name,last_name,email
0,A,X,a.x@mail.com
1,B,Y,b.y@mail.com
2,C,Z,c.z@mail.com
3,Udit,Manav,udit.manav@gmail.com


In [0]:
# rename only few columns
df.rename(columns={'first_name':'first', 'last_name':'last'}, inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,first,last,email
0,A,X,a.x@mail.com
1,B,Y,b.y@mail.com
2,C,Z,c.z@mail.com
3,Udit,Manav,udit.manav@gmail.com


In [0]:
# updating values of a row
df.loc[2, ['first', 'last']] = ['M', 'N']
df.loc[3, 'first'] = 'Aadi' 
# single value can also be changed by using (df2.at[0, 'first'] = 'Aadi')
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,first,last,email
0,A,X,a.x@mail.com
1,B,Y,b.y@mail.com
2,M,N,c.z@mail.com
3,Aadi,Manav,udit.manav@gmail.com


In [0]:
# updating a complete row
df.loc[0] = ['Jane', 'Doe', 'jane.doe@mail.com']
# change all lastnames to lower case
df.loc[:, 'last'] = df.loc[:, 'last'].str.lower() 
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value


Unnamed: 0,first,last,email
0,Jane,doe,jane.doe@mail.com
1,B,y,b.y@mail.com
2,M,n,c.z@mail.com
3,Aadi,manav,udit.manav@gmail.com


Note - Here is something that doesn't work. Try this.
```Python
filtr = (df['email'] == 'c.z@mail.com')
df2[filtr]['last']
df2[filtr]['last'] = 'Smith'
```
It can be observed that 2nd statement fetches expected result, but we can't use assignment as in 3rd line to change its value.

So now, we are done with basics, lets move on to something advanced. Four common methods used to apply functions on DataFrame and series.

### Apply Method <a name='apply'></a>
It is used for calling a function on values, it works with series and DataFrame. Let's use this to get the length of all the email addresses.

In [0]:
# apply length (len) to DataFrame series
df['email'].apply(len)

0    17
1    12
2    12
3    20
Name: email, dtype: int64

In [0]:
# applying a user defined function on a DataFrame series 
def upper_mail(email):
    return email.upper()

df['email'] = df['email'].apply(upper_mail)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,first,last,email
0,Jane,doe,JANE.DOE@MAIL.COM
1,B,y,B.Y@MAIL.COM
2,M,n,C.Z@MAIL.COM
3,Aadi,manav,UDIT.MANAV@GMAIL.COM


In [0]:
# using inline lambda function
df['email'] = df['email'].apply(lambda x:x.lower())
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,first,last,email
0,Jane,doe,jane.doe@mail.com
1,B,y,b.y@mail.com
2,M,n,c.z@mail.com
3,Aadi,manav,udit.manav@gmail.com


In [0]:
# using apply on DataFrame
df.apply(len)

first    4
last     4
email    4
dtype: int64

In the above cell, len is applied on each series of the Dataframe, so basically its telling us that each series object contains 3 elements. similar to ***len(df['email'])*** but for every column. 

Now let's say we wanna grab min value from each column. In this case, first one in alphabetical order.

In [0]:
# min value from each series
df.apply(pd.Series.min)

first            Aadi
last              doe
email    b.y@mail.com
dtype: object

In [0]:
# another way to get min value from each series
df.apply(lambda x: x.min())

first            Aadi
last              doe
email    b.y@mail.com
dtype: object

Note - The best way to get min value from df is to use df.min(). The functions used here are merely for illustration purposes.

In [0]:
df.min()

first            Aadi
last              doe
email    b.y@mail.com
dtype: object

So, running apply on series applies the function to each value in a series and applying it on a DataFrame, applies the function on every series. 
### ApplyMap Method <a name='applymap'></a>
To apply a particular function on every cell we use applymap. It only works with DataFrames.

In [0]:
# get length of string in every cell
df.applymap(len)

Unnamed: 0,first,last,email
0,4,3,17
1,1,1,12
2,1,1,12
3,4,5,20


In [0]:
# convert every string to lowercase
df.applymap(str.lower)

Unnamed: 0,first,last,email
0,jane,doe,jane.doe@mail.com
1,b,y,b.y@mail.com
2,m,n,c.z@mail.com
3,aadi,manav,udit.manav@gmail.com


That's all for applymap.
### Map Method <a name='map'></a>
It is used for mapping values of DataFrame column to some other values.

In [0]:
# Mapping via dictionary key to value
df['first'].map({'Aadi':'Udit', 'B':'c'})

0     NaN
1       c
2     NaN
3    Udit
Name: first, dtype: object

Note - the values that we have not provided a substitute for, while using map, will be replaced by NaN. If we don't want that to happen we can use replace method. All these methods are ***NOT inplace*** methods.
### Replace Method <a name='replace'></a>
Similar to map, expect it does not change values that have not been supplied in dict.

In [0]:
# replacing via dictionary key to value
df['first'].replace({'Udit':'A', 'B':'c'})

0    Jane
1       c
2       M
3    Aadi
Name: first, dtype: object

## Part 6: Adding and Removing Rows and Columns from DataFrame <a name='video-06'></a>
df.drop, df.append


In [0]:
# adding a column to DataFrame
df['full_name'] = df['first'] + ' ' + df['last']
df

Unnamed: 0,first,last,email,full_name
0,Jane,doe,jane.doe@mail.com,Jane doe
1,B,y,b.y@mail.com,B y
2,M,n,c.z@mail.com,M n
3,Aadi,manav,udit.manav@gmail.com,Aadi manav


Note - Another neat way to add new columns ***temporarily*** is via df.assign() function. But it only returns the copy of a new dataframe and does not modify the original one. Try this.
```Python
df = pd.DataFrame({'temp_c': [17.0, 25.0]},
                  index=['Portland', 'Berkeley'])
df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32, inplace=True)
```

In [0]:
# remove columns
df.drop(columns=['first', 'last'], inplace=True)
df

Unnamed: 0,email,full_name
0,jane.doe@mail.com,Jane doe
1,b.y@mail.com,B y
2,c.z@mail.com,M n
3,udit.manav@gmail.com,Aadi manav


In [0]:
# add back first and last columns using full_name column
df['full_name'].str.split(' ')

0      [Jane, doe]
1           [B, y]
2           [M, n]
3    [Aadi, manav]
Name: full_name, dtype: object

In [0]:
# to get result of above cell in DataFrame use expand=True
df[['first', 'last']] = df['full_name'].str.split(' ', expand=True)
df

Unnamed: 0,email,full_name,first,last
0,jane.doe@mail.com,Jane doe,Jane,doe
1,b.y@mail.com,B y,B,y
2,c.z@mail.com,M n,M,n
3,udit.manav@gmail.com,Aadi manav,Aadi,manav


In [0]:
# adding a single row, single value, NOT inplace
df.append({'first':'Tony'}, ignore_index=True)  # try removing ignore_index=True

Unnamed: 0,email,full_name,first,last
0,jane.doe@mail.com,Jane doe,Jane,doe
1,b.y@mail.com,B y,B,y
2,c.z@mail.com,M n,M,n
3,udit.manav@gmail.com,Aadi manav,Aadi,manav
4,,,Tony,


In [0]:
# adding 2 DataFrames together
people2 = {
    'first' : ['P', 'Q'], 
    'last' : ['R', 'S'],
    'email' : ['p.q@mail.com', 'r.s@mail.com'], 
}
df2 = pd.DataFrame(people2)
df2

Unnamed: 0,first,last,email
0,P,R,p.q@mail.com
1,Q,S,r.s@mail.com


In [0]:
# need to use ignore_index since df are not in same order, NOT inplace
df.append(df2, ignore_index=True)

Unnamed: 0,email,full_name,first,last
0,jane.doe@mail.com,Jane doe,Jane,doe
1,b.y@mail.com,B y,B,y
2,c.z@mail.com,M n,M,n
3,udit.manav@gmail.com,Aadi manav,Aadi,manav
4,p.q@mail.com,,P,R
5,r.s@mail.com,,Q,S


In [0]:
df = df.append(df2, ignore_index=True)
df

Unnamed: 0,email,full_name,first,last
0,jane.doe@mail.com,Jane doe,Jane,doe
1,b.y@mail.com,B y,B,y
2,c.z@mail.com,M n,M,n
3,udit.manav@gmail.com,Aadi manav,Aadi,manav
4,p.q@mail.com,,P,R
5,r.s@mail.com,,Q,S


In [0]:
# removing rows from DataFrame, NOT inplace
df.drop(index=4)

Unnamed: 0,email,full_name,first,last
0,jane.doe@mail.com,Jane doe,Jane,doe
1,b.y@mail.com,B y,B,y
2,c.z@mail.com,M n,M,n
3,udit.manav@gmail.com,Aadi manav,Aadi,manav
5,r.s@mail.com,,Q,S


In [0]:
# removing rows based on some condition
filtr = ((df['last']=='n')|(df['first']=='Q'))
df.drop(index=df[filtr].index)

Unnamed: 0,email,full_name,first,last
0,jane.doe@mail.com,Jane doe,Jane,doe
1,b.y@mail.com,B y,B,y
3,udit.manav@gmail.com,Aadi manav,Aadi,manav
4,p.q@mail.com,,P,R


In [0]:
# concat different series object to a DataFrame
df3 = pd.concat([df['first'], df['last'], df['email']], axis='columns')
df3

Unnamed: 0,first,last,email
0,Jane,doe,jane.doe@mail.com
1,B,y,b.y@mail.com
2,M,n,c.z@mail.com
3,Aadi,manav,udit.manav@gmail.com
4,P,R,p.q@mail.com
5,Q,S,r.s@mail.com


In [0]:
df

Unnamed: 0,email,full_name,first,last
0,jane.doe@mail.com,Jane doe,Jane,doe
1,b.y@mail.com,B y,B,y
2,c.z@mail.com,M n,M,n
3,udit.manav@gmail.com,Aadi manav,Aadi,manav
4,p.q@mail.com,,P,R
5,r.s@mail.com,,Q,S


## Part 7: Sorting Data <a name='video-07'></a>
sort columns, sort multiple columns, get max and min from different rows

In [0]:
people = {
    'first' : ['A', 'B', 'C', 'Udit'], 
    'last' : ['Y', 'Y', 'Z', 'Manav'],
    'mail' : ['a.y@mail.com', 'b.y@mail.com', 'c.z@mail.com', 'udit.manav@gmail.com'], 
}
df = pd.DataFrame(people)
df

Unnamed: 0,first,last,mail
0,A,Y,a.y@mail.com
1,B,Y,b.y@mail.com
2,C,Z,c.z@mail.com
3,Udit,Manav,udit.manav@gmail.com


In [0]:
# sort by last name
df.sort_values(by='last', ascending=False)

Unnamed: 0,first,last,mail
2,C,Z,c.z@mail.com
0,A,Y,a.y@mail.com
1,B,Y,b.y@mail.com
3,Udit,Manav,udit.manav@gmail.com


In [0]:
# sorting on multiple columns
df.sort_values(by=['last', 'first'], 
               ascending=False)

Unnamed: 0,first,last,mail
2,C,Z,c.z@mail.com
1,B,Y,b.y@mail.com
0,A,Y,a.y@mail.com
3,Udit,Manav,udit.manav@gmail.com


In [0]:
# sorting last in ascending, first in descending, NOT inplace
df.sort_values(by=['last', 'first'], 
               ascending=[True, False], 
               inplace=True)
df

Unnamed: 0,first,last,mail
3,Udit,Manav,udit.manav@gmail.com
1,B,Y,b.y@mail.com
0,A,Y,a.y@mail.com
2,C,Z,c.z@mail.com


In [0]:
# sorting via indexes
df.sort_index()

Unnamed: 0,first,last,mail
0,A,Y,a.y@mail.com
1,B,Y,b.y@mail.com
2,C,Z,c.z@mail.com
3,Udit,Manav,udit.manav@gmail.com


In [0]:
# sorting series 
df['last'].sort_values(ascending=False)

2        Z
0        Y
1        Y
3    Manav
Name: last, dtype: object

In [0]:
# add another column and find top n 
df['salary'] = [50, 40, 99, 45]
df

Unnamed: 0,first,last,mail,salary
3,Udit,Manav,udit.manav@gmail.com,50
1,B,Y,b.y@mail.com,40
0,A,Y,a.y@mail.com,99
2,C,Z,c.z@mail.com,45


In [0]:
# get top 2 salaries
df['salary'].nlargest(2)

0    99
3    50
Name: salary, dtype: int64

In [0]:
# get all details of top n salaries
df.nlargest(2, 'salary')

Unnamed: 0,first,last,mail,salary
0,A,Y,a.y@mail.com,99
3,Udit,Manav,udit.manav@gmail.com,50


In [0]:
# get last 2 salaries
df['salary'].nsmallest(2)

1    40
2    45
Name: salary, dtype: int64

In [0]:
# get all details of last n salaries
df.nsmallest(2, 'salary')

Unnamed: 0,first,last,mail,salary
1,B,Y,b.y@mail.com,40
2,C,Z,c.z@mail.com,45


## Part 8: Grouping, Aggregating, Analysing and Exploring Data <a name='video-08'></a>
This section is completely hands-on. Go [here](#part-08).

## Part 9: Cleaning Data - Casting Data Types and Handling Missing Values <a name='video-09'></a>

In [0]:
import pandas as pd
import numpy as np
people = {
    'first' : ['Corey', 'Jane', 'John', 'Chris', np.nan, None, 'NA'],
    'last' : ['Schafer', 'Doe', 'Doe', 'Schafer', np.nan, np.nan, 'Missing'],
    'email' : ['CoreySchefer@gmail.com', 'JaneDoe@emai;.com', "JohnDoe@gmail.com",
               None, np.nan, 'anon@gmail.com', 'NA'],
    'age' : ['33', '55', '63', '36', None, None, 'Missing']
}

In [0]:
# create DataFrame with missing values
df = pd.DataFrame(people)
df

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreySchefer@gmail.com,33
1,Jane,Doe,JaneDoe@emai;.com,55
2,John,Doe,JohnDoe@gmail.com,63
3,Chris,Schafer,,36
4,,,,
5,,,anon@gmail.com,
6,,Missing,,Missing


In [0]:
# drop rows with missing np.nan values
# defaults args for df.dropna(axis='index', how='any')
df.dropna()

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreySchefer@gmail.com,33
1,Jane,Doe,JaneDoe@emai;.com,55
2,John,Doe,JohnDoe@gmail.com,63
6,,Missing,,Missing


Note - <br>
axis='index'/'columns' : decide what to drop rows/columns.<br>
how='any' : drop rows with any missing values. if 'all', drop rows having all missing values.

In [0]:
# drop rows only with missing in ANY specific columns
df.dropna(axis='index', how='any', subset=['email', 'age'])
# here NA is a custom missing value, so it isn't dropped 

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreySchefer@gmail.com,33
1,Jane,Doe,JaneDoe@emai;.com,55
2,John,Doe,JohnDoe@gmail.com,63
6,,Missing,,Missing


In [0]:
# drop rows only with missing in ALL specific columns
df.dropna(axis='index', how='all', subset=['email', 'age'])

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreySchefer@gmail.com,33
1,Jane,Doe,JaneDoe@emai;.com,55
2,John,Doe,JohnDoe@gmail.com,63
3,Chris,Schafer,,36
5,,,anon@gmail.com,
6,,Missing,,Missing


In [0]:
# treating custom missing values
df.replace('NA', np.nan, inplace=True)
df.replace('Missing', np.nan, inplace=True)
df

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreySchefer@gmail.com,33.0
1,Jane,Doe,JaneDoe@emai;.com,55.0
2,John,Doe,JohnDoe@gmail.com,63.0
3,Chris,Schafer,,36.0
4,,,,
5,,,anon@gmail.com,
6,,,,


In [0]:
df.dropna()

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreySchefer@gmail.com,33
1,Jane,Doe,JaneDoe@emai;.com,55
2,John,Doe,JohnDoe@gmail.com,63


In [0]:
# check whether a value in df is classified as na
df.isna()

Unnamed: 0,first,last,email,age
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,True,False
4,True,True,True,True
5,True,True,False,True
6,True,True,True,True


In [0]:
# fill na with some Value
df.fillna('MISSING')

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreySchefer@gmail.com,33
1,Jane,Doe,JaneDoe@emai;.com,55
2,John,Doe,JohnDoe@gmail.com,63
3,Chris,Schafer,MISSING,36
4,MISSING,MISSING,MISSING,MISSING
5,MISSING,MISSING,anon@gmail.com,MISSING
6,MISSING,MISSING,MISSING,MISSING


In [0]:
# casting DataTypes
print(df.dtypes)

first    object
last     object
email    object
age      object
dtype: object


In [0]:
# convert age (string) to number
df.age = df.age.astype(np.float32)
print(df.dtypes)
print(f'Mean Age- {df.age.mean()}')
df

first     object
last      object
email     object
age      float32
dtype: object
Mean Age- 46.75


Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreySchefer@gmail.com,33.0
1,Jane,Doe,JaneDoe@emai;.com,55.0
2,John,Doe,JohnDoe@gmail.com,63.0
3,Chris,Schafer,,36.0
4,,,,
5,,,anon@gmail.com,
6,,,,


## Part 10: Working with Dates and Time Series Data <a name='video-10'></a>
Data and notebook can be downloaded from [here](https://github.com/CoreyMSchafer/code_snippets/tree/master/Python/Pandas/10-Datetime-Timeseries).<br> Done in other notebook.

## Part 11: Reading/Writing Data to Different Sources - Excel, JSON, SQL, Etc <a name='video-11'></a>
Go [here](https://www.youtube.com/watch?v=N6hyN6BW6ao&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS&index=11). This section is under development.