### Dealing with Missing Data

Pandas provide a lot of flexible ways to handle missing data.

In [1]:
import io
import pandas as pd


data = '''Name|Age|Color
          Fred|22|Red
          Sally|29|Blue
          George|24|
          Fido||Black'''

df = pd.read_table(io.StringIO(data), sep='|')

df

Unnamed: 0,Name,Age,Color
0,Fred,22.0,Red
1,Sally,29.0,Blue
2,George,24.0,
3,Fido,,Black


### Finding Missing Data

With data science we are often dealing with very large dataset that isn't feasible to check if there is missing data by eye. Pandas provide several ways to conveniently check this. The `.isnull()` method returns a DataFrame filled with Boolean values:

In [2]:
df.isnull()

Unnamed: 0,Name,Age,Color
0,False,False,False
1,False,False,False
2,False,False,True
3,False,True,False


Combinding this with the `.any()` method we can check to see if there is missing data in each row:

In [3]:
df.isnull().any()

Name     False
Age       True
Color     True
dtype: bool

### Dropping Missing Data

The `.dropna()` method simply drop all rows that contain nan entries:

In [5]:
df.dropna() # This does not perform in place

Unnamed: 0,Name,Age,Color
0,Fred,22.0,Red
1,Sally,29.0,Blue


Alternatively, we can use `.notnull()` (the opposite of `.isnull()`) first to general a Boolean mask, then apply it to a certain column of the DataFrame:

In [7]:
valid = df.notnull()
df[valid.loc[:, 'Age']]

Unnamed: 0,Name,Age,Color
0,Fred,22.0,Red
1,Sally,29.0,Blue
2,George,24.0,


In [8]:
df[valid.loc[:, 'Color']]

Unnamed: 0,Name,Age,Color
0,Fred,22.0,Red
1,Sally,29.0,Blue
3,Fido,,Black


Pandas allows us to create even more complex Boolean masks:

In [10]:
mask = valid.loc[:, 'Age'] & valid.loc[:, 'Color']

mask

0     True
1     True
2    False
3    False
dtype: bool

In [12]:
df[mask]

Unnamed: 0,Name,Age,Color
0,Fred,22.0,Red
1,Sally,29.0,Blue


### Inserting Data for Missing Data

Pandas provide the `.fillna()` method to fill NaN values with whatever we'd like:

In [13]:
df.fillna('missing') # This does not perform in place

Unnamed: 0,Name,Age,Color
0,Fred,22,Red
1,Sally,29,Blue
2,George,24,missing
3,Fido,missing,Black


We can pass a dictionary to specify values on a per column basis:

In [14]:
df.fillna({'Age': df.loc[:, 'Age'].median(),
           'Color': 'Pink'})

Unnamed: 0,Name,Age,Color
0,Fred,22.0,Red
1,Sally,29.0,Blue
2,George,24.0,Pink
3,Fido,24.0,Black


The `.fillna()` has a parameter `method=` which we can pass `ffill` for forward filling (take the value before the missing value0 or `bfill` for backwards filling (use the value after the missing value):

In [15]:
df.fillna(method='ffill')

Unnamed: 0,Name,Age,Color
0,Fred,22.0,Red
1,Sally,29.0,Blue
2,George,24.0,Blue
3,Fido,24.0,Black


In [16]:
df.fillna(method='bfill')

Unnamed: 0,Name,Age,Color
0,Fred,22.0,Red
1,Sally,29.0,Blue
2,George,24.0,Black
3,Fido,,Black


These can be applied row-wise as well by passing `axis=1`:

In [18]:
df.fillna(method='ffill', axis=1)

Unnamed: 0,Name,Age,Color
0,Fred,22,Red
1,Sally,29,Blue
2,George,24,24
3,Fido,Fido,Black


With numerical data, we can use `.interpolate()` to fill in values by linear interpolation:

In [19]:
df.interpolate()

Unnamed: 0,Name,Age,Color
0,Fred,22.0,Red
1,Sally,29.0,Blue
2,George,24.0,
3,Fido,24.0,Black


The scipy package provide additional interpolation options.

Finally, the `.replace()` method fill in missing values just like any other replacement operations:

In [20]:
import numpy as np


df.replace(np.nan, value=-1)

Unnamed: 0,Name,Age,Color
0,Fred,22.0,Red
1,Sally,29.0,Blue
2,George,24.0,-1
3,Fido,-1.0,Black
