# 7.1 Handling Missing Data

1. [General Info](#general)
2. [Filtering Out Missing Data](#filter)
3. [Filling In Missing Data](#fill)

<a name="general"></a>
# General Info

Remember that all descriptive statistics on pandas objects exclude missing data by default.  

The term *sentinel value* refers to a value that a null/missing value. In pandas `NaN` (not a number) refers to missing data in `float64` dtype and in base Python `None` is used.  

Both of these are analogous to R's `NA` (not available)

Both `NaN` and `None` are treated the same in pandas (see the examples below)

<img src="./myImages/table7.1_naHandlingMethods.png" width = 600>


In [42]:
import pandas as pd
import numpy as np

In [43]:
# Make a pandas Series with fload data
float_data = pd.Series([1.2, -3.5, np.nan, 0])
float_data


0    1.2
1   -3.5
2    NaN
3    0.0
dtype: float64

In [44]:
# Determine the NA-ness of each value with the isna method
float_data.isna()

0    False
1    False
2     True
3    False
dtype: bool

In [45]:
# New series with string data
string_data = pd.Series(["aardvark", np.nan, None, "avocado"])
string_data

0    aardvark
1         NaN
2        None
3     avocado
dtype: object

In [46]:
# NaN and None are both NAs
string_data.isna()

0    False
1     True
2     True
3    False
dtype: bool

In [47]:
# Python None is converted to pandas NaN if data are float
float_data = pd.Series([1, 2, None], dtype='float64')
print(float_data)
float_data.isna()

0    1.0
1    2.0
2    NaN
dtype: float64


0    False
1    False
2     True
dtype: bool

<a name="filter"></a>
# Filtering Out Missing Data

The above table has the method `dropna` which simplifies the process of removing missing entries.  

The process is straightforward for a Series - just returns a smaller Series with those items removed

In [48]:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [49]:
# Remove NAs with dropna
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [50]:
# Equivalent using notna
data[data.notna()]

0    1.0
2    3.5
4    7.0
dtype: float64

DataFrame objects have more options - should rows/columns with ALL NA be dropped, or those with ANY NA? 

The default behavior of `dropna` for a DataFrame is to drop ANY ROW containing a missing value.  

Include the `how` argument to change this. (`how="all"` will change to only dropping rows with ALL NAs).  

Instead of `how` (which can only handle "all" and "any"), you can use `thresh` and set a particular threshold of NAs to determine if a row/column should be dropped.

Just like other DataFrame methods, use `axis="columns"` to change the behavior to work in the other direction.  

In [51]:
# Make a DataFrame with NAs
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [52]:
# Default - drop rows with ANY NA
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [53]:
# Specify to drop rows with ALL instead
data.dropna(how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [54]:
# Add a new column of all NAs
data[4] = np.nan
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [55]:
# Use axis arg to drop columns
data.dropna(axis="columns", how="all")

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [56]:
# Another DataFrame
df = pd.DataFrame(np.random.standard_normal((7, 3)))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df

Unnamed: 0,0,1,2
0,-0.055999,,
1,0.160595,,
2,1.938864,,0.792714
3,-0.109892,,-0.347517
4,-0.354505,1.184194,0.231361
5,0.357778,0.330267,-1.423421
6,-1.502563,0.326452,-0.088034


In [57]:
# Default - drop rows with ANY NA
df.dropna()

Unnamed: 0,0,1,2
4,-0.354505,1.184194,0.231361
5,0.357778,0.330267,-1.423421
6,-1.502563,0.326452,-0.088034


In [58]:
# Drop rows with >= 2 NA
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,1.938864,,0.792714
3,-0.109892,,-0.347517
4,-0.354505,1.184194,0.231361
5,0.357778,0.330267,-1.423421
6,-1.502563,0.326452,-0.088034


In [59]:
# Note that THRESH DOESN'T WORK WITH COLUMNS!
df.dropna(axis="columns", thresh=3)

Unnamed: 0,0,1,2
0,-0.055999,,
1,0.160595,,
2,1.938864,,0.792714
3,-0.109892,,-0.347517
4,-0.354505,1.184194,0.231361
5,0.357778,0.330267,-1.423421
6,-1.502563,0.326452,-0.088034


<a name="fill"></a>
# Filling In Missing Data

Similar to quickly dropping NAs with `dropna` (as opposed to subseting with a Boolean DF returned by a call to `isna`), we can fill in NAs easily with `fillna`

1. Use a single value to replace all missing values with it
1. Provide a dictionary to replace each column with a different value (dictionary keys correspond to column keys)

Originally, could use the interpolation methods of `reindex`, values can be filled either forward `ffill` or backward `bfill` from the nearest existing value. Now, use the methods `ffill` and `bfill` themselves (see below)

You can even do imputations, like replacing missing values with the median of the row/column.

<img src="./myImages/table7.2_fillnaArgs.png>

In [60]:
df

Unnamed: 0,0,1,2
0,-0.055999,,
1,0.160595,,
2,1.938864,,0.792714
3,-0.109892,,-0.347517
4,-0.354505,1.184194,0.231361
5,0.357778,0.330267,-1.423421
6,-1.502563,0.326452,-0.088034


In [61]:
# Replace all NA with 0
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.055999,0.0,0.0
1,0.160595,0.0,0.0
2,1.938864,0.0,0.792714
3,-0.109892,0.0,-0.347517
4,-0.354505,1.184194,0.231361
5,0.357778,0.330267,-1.423421
6,-1.502563,0.326452,-0.088034


In [62]:
# Replace column 1 with 0.5 and column 2 with 0


In [63]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,-0.055999,0.5,0.0
1,0.160595,0.5,0.0
2,1.938864,0.5,0.792714
3,-0.109892,0.5,-0.347517
4,-0.354505,1.184194,0.231361
5,0.357778,0.330267,-1.423421
6,-1.502563,0.326452,-0.088034


In [64]:
# Backfill the top value
df.bfill()

Unnamed: 0,0,1,2
0,-0.055999,1.184194,0.792714
1,0.160595,1.184194,0.792714
2,1.938864,1.184194,0.792714
3,-0.109892,1.184194,-0.347517
4,-0.354505,1.184194,0.231361
5,0.357778,0.330267,-1.423421
6,-1.502563,0.326452,-0.088034


In [65]:
# With limit
df.bfill(limit=2)

Unnamed: 0,0,1,2
0,-0.055999,,0.792714
1,0.160595,,0.792714
2,1.938864,1.184194,0.792714
3,-0.109892,1.184194,-0.347517
4,-0.354505,1.184194,0.231361
5,0.357778,0.330267,-1.423421
6,-1.502563,0.326452,-0.088034


In [66]:
# New data for forward fill
df = pd.DataFrame(np.random.standard_normal((6, 3)))
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan
df

Unnamed: 0,0,1,2
0,0.874104,1.076601,0.872578
1,0.876667,0.229609,-0.019523
2,-1.385373,,0.598241
3,1.303023,,-1.045214
4,-1.308482,,
5,-0.352827,,


In [67]:
# Forward fill all
df.ffill()

Unnamed: 0,0,1,2
0,0.874104,1.076601,0.872578
1,0.876667,0.229609,-0.019523
2,-1.385373,0.229609,0.598241
3,1.303023,0.229609,-1.045214
4,-1.308482,0.229609,-1.045214
5,-0.352827,0.229609,-1.045214


In [68]:
# Forward fill some
df.ffill(limit = 2)

Unnamed: 0,0,1,2
0,0.874104,1.076601,0.872578
1,0.876667,0.229609,-0.019523
2,-1.385373,0.229609,0.598241
3,1.303023,0.229609,-1.045214
4,-1.308482,,-1.045214
5,-0.352827,,-1.045214


In [69]:
# New data for imputation example
data = pd.Series([1., np.nan, 3.5, np.nan, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [70]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

In [71]:
data.fillna(data.median())

0    1.0
1    3.5
2    3.5
3    3.5
4    7.0
dtype: float64