### Handling Missing Data
1. Keep it
2. Remove it
3. Replcae it

* [isnull()](#isnull)
* [notnull()](#notnull)
* [dropna()](#dropna)
* [fillna()](#fillna)

In [2]:
import numpy as np
import pandas as pd


### What Null/NA/nan objects look like:¶
Source: https://github.com/pandas-dev/pandas/issues/28095
A new pd.NA value (singleton) is introduced to represent scalar missing values. Up to now, pandas used several values to represent missing data: np.nan is used for this for float data, np.nan or None for object-dtype data and pd.NaT for datetime-like data. The goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types. pd.NA is currently used by the nullable integer and boolean data types and the new string data type


In [9]:
np.nan

nan

In [7]:
pd.NA

<NA>

In [15]:
pd.NaT

NaT

In [17]:
np.nan == np.nan

False

In [19]:
np.nan in [np.nan]

True

In [21]:
np.nan is np.nan

True

In [23]:
pd.NA == pd.NA

<NA>

In [3]:
df = pd.read_csv("..\\0.datasets\\movie_scores.csv")

In [31]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


<a id = 'isnull'></a>
### isnull()
This function returns a dataframe of True/False values  
'True' if corresponding value in the given dataframe is null  
'False' otherwise

In [4]:
df.isnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,False,False,False,False,False,False
1,True,True,True,True,True,True
2,False,False,False,False,True,True
3,False,False,False,False,False,False
4,False,False,False,False,False,False


<a id = 'notnull'></a>
### notnull()
return df  
'True' if corresponding value of df is not null  
'False' otherwise

In [5]:
df.notnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,True,True,True,True,True,True
1,False,False,False,False,False,False
2,True,True,True,True,False,False
3,True,True,True,True,True,True
4,True,True,True,True,True,True


In [7]:
df[(df['pre_movie_score'].isnull()) & (df['first_name'].notnull())]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
2,Hugh,Jackman,51.0,m,,


<a id = 'dropna'></a>
### dropna()
help(dropna)   for doc[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html]


Help on method dropna in module pandas.core.frame:

dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
    
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
    
        .. versionchanged:: 1.0.0
    
           Pass tuple or list to drop on multiple axes.
           Only a single axis is allowed.
    
    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at least one NA or all NA.
    
        * 'any' : If any NA values are present, dro

<a id = 'dropna'></a>
### dropna()
By defalut dropna() will drop all the rows in df containing any null/na/nan/naT value    
parameters:  
axis  = 0 or 'index' to remove rows for any null value  
        1 or 'columns' to drop column in case of null value  
how : {'any', 'all'} -- default -any  
        any -- drop row/column if row/column contain any null value  
        all -- drop row/column if all of its values are null  
       
thresh :  if any row/column have non-null values < thresh drop it  
etc..  



In [14]:

df.dropna()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


## Filling the Null Values


<a id = 'fillna'></a>
### fillna()


In [16]:
df.fillna("new_value")

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63,m,8,10
1,new_value,new_value,new_value,new_value,new_value,new_value
2,Hugh,Jackman,51,m,new_value,new_value
3,Oprah,Winfrey,66,f,6,8
4,Emma,Stone,31,f,7,9


In [17]:
df['pre_movie_score'] = df['pre_movie_score'].fillna(0)

In [18]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,0.0,
2,Hugh,Jackman,51.0,m,0.0,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


<a id = 'interpolate'></a>
### interpolate()


In [19]:
airline_tix = {'first':100,'business':np.nan,'economy-plus':50,'economy':30}

In [20]:
ser = pd.Series(airline_tix)

In [21]:
ser

first           100.0
business          NaN
economy-plus     50.0
economy          30.0
dtype: float64

In [22]:
ser.interpolate()

first           100.0
business         75.0
economy-plus     50.0
economy          30.0
dtype: float64

In [27]:
ser.interpolate()

first           100.0
business         75.0
economy-plus     50.0
economy          30.0
dtype: float64