<a href="https://colab.research.google.com/github/sahil301290/Python-for-Data-Science/blob/main/06_3_Pandas_Missing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Agenda

##Missing Data

- Keep the data

- Drop the data

- Fill the data

##Missing Data

Real World data has missing data at various occasions. Handling missing data effectively enables us to get valuable insights from the data.

There are two types of missing values:

- NaN - Not a Number

- pd.NaT - Not a TimeStamp (in newer Pandas version)

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Missing Data Pandas Operations
np.nan

nan

In [3]:
pd.NA

<NA>

In [4]:
pd.NaT

NaT

In [5]:
np.nan == np.nan

False

In [6]:
np.nan is np.nan

True

In [7]:
a = np.nan
a

nan

In [8]:
print(a is np.nan)
print(a == np.nan)

True
False


Pros and Cons of keeping the missing data

Pros:

- Does not change or manipulate the original data

Cons:

- Many methods do not support NaN

- Often there can be reasonable guesses

Pros and Cons of removing/dropping the missing data

Pros:

- Can be based on rules

Cons:

- Might lose lots of useful information.

- Limits the trained model for test data.

In [9]:
myindex = ['India', 'USA', 'Canada']
mydata = [[1947,1390, 10], [1776, None, None], [1867, 20, 12]]
mycolumns = ['Independence', 'Population', 'GDP']
df = pd.DataFrame(data = mydata, index = myindex, columns = mycolumns)
df

Unnamed: 0,Independence,Population,GDP
India,1947,1390.0,10.0
USA,1776,,
Canada,1867,20.0,12.0


In [10]:
#Dropping a row
df.dropna()

Unnamed: 0,Independence,Population,GDP
India,1947,1390.0,10.0
Canada,1867,20.0,12.0


In [11]:
myindex = ['India', 'USA', 'Canada']
mydata = [[1947,1390, None], [1776, 33, None], [1867, 20, 12]]
mycolumns = ['Independence', 'Population', 'GDP']
df = pd.DataFrame(data = mydata, index = myindex, columns = mycolumns)
df

Unnamed: 0,Independence,Population,GDP
India,1947,1390,
USA,1776,33,
Canada,1867,20,12.0


In [12]:
#Dropping a column
df.drop('GDP', axis = 1)

Unnamed: 0,Independence,Population
India,1947,1390
USA,1776,33
Canada,1867,20


Pros and Cons of filling in the missing data

Pros:

- Potential to save a lot of data for ML model

Cons:

- Hard to do as it is arbitrary

- Potential to lead to false conclusions

In [13]:
df

Unnamed: 0,Independence,Population,GDP
India,1947,1390,
USA,1776,33,
Canada,1867,20,12.0


In [14]:
df.at['India', 'GDP']

nan

In [15]:
df['GDP'][0]

nan

In [16]:
df['GDP'][2]

12.0

In [17]:
df['GDP'].replace('NaN', 0)

India      NaN
USA        NaN
Canada    12.0
Name: GDP, dtype: float64

In [18]:
df

Unnamed: 0,Independence,Population,GDP
India,1947,1390,
USA,1776,33,
Canada,1867,20,12.0


In [19]:
import warnings
warnings.filterwarnings('ignore')

In [20]:
df['GDP'][0] = 0
df['GDP'][1] = 0
df

Unnamed: 0,Independence,Population,GDP
India,1947,1390,0.0
USA,1776,33,0.0
Canada,1867,20,12.0


In [21]:
#Interpolated or estimated value
df['Perc'] = ['75%', 'NaN', '25%']
df

Unnamed: 0,Independence,Population,GDP,Perc
India,1947,1390,0.0,75%
USA,1776,33,0.0,
Canada,1867,20,12.0,25%


In [22]:
df['Perc'][1] = '50%'
df

Unnamed: 0,Independence,Population,GDP,Perc
India,1947,1390,0.0,75%
USA,1776,33,0.0,50%
Canada,1867,20,12.0,25%


In [23]:
df = pd.read_csv('movie_scores.csv')
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [24]:
df.isnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,False,False,False,False,False,False
1,True,True,True,True,True,True
2,False,False,False,False,True,True
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [25]:
df.notnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,True,True,True,True,True,True
1,False,False,False,False,False,False
2,True,True,True,True,False,False
3,True,True,True,True,True,True
4,True,True,True,True,True,True


In [26]:
df['pre_movie_score'].notnull()

0     True
1    False
2    False
3     True
4     True
Name: pre_movie_score, dtype: bool

In [27]:
df[df['pre_movie_score'].notnull()]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [28]:
df[df['pre_movie_score'].isnull()]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
1,,,,,,
2,Hugh,Jackman,51.0,m,,


In [29]:
df[(df['pre_movie_score'].isnull()) & (df['first_name'].notnull())]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
2,Hugh,Jackman,51.0,m,,


In [30]:
#Keep Data
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [31]:
#Drop Data
#help(df.dropna)

In [32]:
df.dropna()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [33]:
df.dropna(thresh=1)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [34]:
df.dropna(thresh=5)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [35]:
df.dropna(axis=1)

0
1
2
3
4


In [36]:
df.dropna(subset=['last_name'])

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [37]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [38]:
#Filling the data
#help(df.fillna)

In [39]:
df.fillna('new')

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63,m,8,10
1,new,new,new,new,new,new
2,Hugh,Jackman,51,m,new,new
3,Oprah,Winfrey,66,f,6,8
4,Emma,Stone,31,f,7,9


In [40]:
df['pre_movie_score'].fillna(0)

0    8.0
1    0.0
2    0.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

In [41]:
df['sex'].fillna('m')

0    m
1    m
2    m
3    f
4    f
Name: sex, dtype: object

In [42]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [43]:
#Replace with average pre_movie_score
df['pre_movie_score'].fillna(df['pre_movie_score'].mean())

0    8.0
1    7.0
2    7.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

In [44]:
df.fillna(df.mean())

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,52.75,,7.0,9.0
2,Hugh,Jackman,51.0,m,7.0,9.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [45]:
#Interpolation
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [46]:
df = df.sort_values(by = 'age')
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
4,Emma,Stone,31.0,f,7.0,9.0
2,Hugh,Jackman,51.0,m,,
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
1,,,,,,


In [47]:
df.interpolate()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
4,Emma,Stone,31.0,f,7.0,9.0
2,Hugh,Jackman,51.0,m,7.5,9.5
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
1,,,66.0,,6.0,8.0


End of code