* [np.nan, pd.NA, pd.NaT](#nan)
* [.isnull() method](#isnull)
* [.notnull() method](#notnull)
* [.dropna() method](#drop)
* [.fillna() method](#fill)
* [.interpolate() method](#fill1)

---


__NaN__ - not a number

__NaT__ - not a timestamp (pd.NaT)

<u> Options for Missing Data</u>
 - Keep it
 - Remove it
 - Replace it

___

- __<u>Keeping the missing data</u>__

    __PROS__: Doesn't manipulate or change the true data
    <br> __CONS__: Many ML methods do NOT support NaN
    
    
- __<u>Dropping or Removing the missing data</u>__

    __PROS__: easy to do and can be based on rules (with pandas, you can based this on rules. For example, you can drop a row     that's missing two data points or three data points, etc.)
    <br>__CONS__: Potential to lose a lot of data or useful information.
    

- __<u>Filling in the missing data</u>__

    __PROS__: Potential to save a lot of data for use in training a model
     <br>__CONS__: - Hardest to do and somewhat arbitrary<br>- Potential to lead false conclusions

___

In [3]:
import numpy as np
import pandas as pd

<a id='nan'></a>

In [4]:
np.nan # missing value

nan

In [3]:
pd.NA

<NA>

In [5]:
pd.NaT

NaT

___

---


#### !! Typical comparisons should be avoided with missing values.

In [6]:
np.nan == np.nan

False

In [7]:
np.nan is np.nan

True

In [8]:
myvar = np.nan
myvar is np.nan # This should be used to check whether a variable is a missing value or not

True

___

In [4]:
df = pd.read_csv('movie_scores.csv')

In [5]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


<a id='isnull'></a>

__`.isnull() method`__

simply returns a boolean, true or false, if you have a null value

In [6]:
df.isnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,False,False,False,False,False,False
1,True,True,True,True,True,True
2,False,False,False,False,True,True
3,False,False,False,False,False,False
4,False,False,False,False,False,False


<a id='notnull'></a>

__`.notnull() method`__

In [7]:
df.notnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,True,True,True,True,True,True
1,False,False,False,False,False,False
2,True,True,True,True,False,False
3,True,True,True,True,True,True
4,True,True,True,True,True,True


In [8]:
df['pre_movie_score'].notnull()

0     True
1    False
2    False
3     True
4     True
Name: pre_movie_score, dtype: bool

In [9]:
df[df['pre_movie_score'].notnull()]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [10]:
df[(df['pre_movie_score'].isnull()) & (df['first_name'].notnull())]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
2,Hugh,Jackman,51.0,m,,


___

#### Dropping data

In [11]:
help(df.dropna)

Help on method dropna in module pandas.core.frame:

dropna(axis: 'Axis' = 0, how: 'str' = 'any', thresh=None, subset=None, inplace: 'bool' = False) method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
    
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
    
        .. versionchanged:: 1.0.0
    
           Pass tuple or list to drop on multiple axes.
           Only a single axis is allowed.
    
    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at least one NA or all NA.
    
        * 'any' : If a

<a id='drop'></a>

__`.dropna() method`__

In [14]:
df.dropna() # drop any rows that have any missing values

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


- __<u>thresh argument</u>__ (requires that many non-NA values)

In [22]:
df.dropna(thresh=1)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [23]:
df.dropna(thresh=4)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [24]:
df.dropna(thresh=5) # Hugh Jackman row is dropped because it doesn't have at least five non null values.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


- __<u>axis argument</u>__

In [25]:
df.dropna(axis=1) # every single column has at least one instance of a missing value

0
1
2
3
4


- __<u>subset argument</u>__ (only going to consider certain columns, define in which columns to look for missing values)

In [27]:
df.dropna(subset=['last_name']) # only considers last_name column

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


____

#### Filling in the data

<a id='fill'></a>

__`.fillna() method`__

In [29]:
df.fillna('New Value!')

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,New Value!,New Value!,New Value!,New Value!,New Value!,New Value!
2,Hugh,Jackman,51.0,m,New Value!,New Value!
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [30]:
# grab the column you're interested in

df['pre_movie_score'].fillna(0)

0    8.0
1    0.0
2    0.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

And if I want to make this change permanent, just assign that change.

In [31]:
df['pre_movie_score'] = df['pre_movie_score'].fillna(0)

In [32]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,0.0,
2,Hugh,Jackman,51.0,m,0.0,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [33]:
df = pd.read_csv('movie_scores.csv')

In [34]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


Let's say I want to fill in these null values with the average pre movie score.

In [37]:
df['pre_movie_score'].mean() 

# what's nice about mean is it's going to give you the average based off the existing values

7.0

In [38]:
df['pre_movie_score'].fillna(df['pre_movie_score'].mean())

0    8.0
1    7.0
2    7.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

In [40]:
df.fillna(df.mean()) # only pure numeric columns each by their mean (not always reasonable)

  """Entry point for launching an IPython kernel.


Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,52.75,,7.0,9.0
2,Hugh,Jackman,51.0,m,7.0,9.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


___

<a id='fill1'></a>

__.`interpolate() method`__

In [41]:
airline_tix = {
    'first': 100,
    'business': np.nan,
    'economy_plus': 50,
    'economy': 30
}

In [42]:
ser = pd.Series(airline_tix)
ser

first           100.0
business          NaN
economy_plus     50.0
economy          30.0
dtype: float64

#### !!! big assumption here

It might make sense to try to interpolate this, assuming __it's already in the correct order__.

In [43]:
ser.interpolate()

first           100.0
business         75.0
economy_plus     50.0
economy          30.0
dtype: float64