# 
# Pandas Topic: Filling in Missing Data


Xinhe Wang

xinhew@umich.edu

## Fill in Missing Data

- I will introduce some ways of using ```fillna()``` to fill in missing 
data (```NaN``` values) in a DataFrame.
- One of the most easiest ways is to drop the rows with missing values.
- However, data is generally expensive and we do not want to lose all 
the other columns of the row with missing data.
- There are many ways to fill in the missing values:
    - Treat the ```NaN``` value as a feature -> fill in with 0;
    - Use statistics -> fill in with column mean/median/percentile/a
    random value;
    - Use the "neighbors" -> fill in with the last or next values;
    - Prediction methods -> use regression/machine learning models to 
    predict the missing value.

## Example Data
- Here we generate a small example dataset with missing values.

- Notice that if we want to indicate if the value in column "b" is larger
than 0 in column "f", but for the missiing value in column "b", 
```df['b'] > 0``` returns ```False``` instead of a ```NaN``` value.
Therefore, ```NaN``` values need to be delt with before further steps.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame(np.random.randn(5, 4),
                  columns=['a', 'b', 'c', 'd'])
df.iloc[2, 1] = np.nan
df.iloc[3:5, 0] = np.nan
df['e'] = [0, np.nan, 0, 0, 0]
df['f'] = df['b']  > 0
df

Unnamed: 0,a,b,c,d,e,f
0,-0.24766,-0.059548,0.134923,0.696526,0.0,False
1,0.69001,1.830447,0.97568,-1.869097,,True
2,0.31765,,0.915959,1.935111,0.0,False
3,,0.619568,1.714032,-0.63752,0.0,True
4,,0.391781,-1.681306,-0.247382,0.0,True


## Fill in with a scalar value
- We can fill in ```NaN``` values with a designated value using 
```fillna()```.

In [3]:
df['e'].fillna(0)

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: e, dtype: float64

In [4]:
df['e'].fillna("missing")

0        0.0
1    missing
2        0.0
3        0.0
4        0.0
Name: e, dtype: object

## Fill in with statistics (median, mean, ...)
- One of the most commonly used techniques is to fill in missing values
with column median or mean.
- We show an instance of filling in missing values in column "b" with 
column mean.

In [5]:
df['b'].fillna(df.mean()['b'])

0   -0.059548
1    1.830447
2    0.695562
3    0.619568
4    0.391781
Name: b, dtype: float64

## Fill in with forward or backward values
- We can fill in with the missing values using its "neighber" using 
```fillna()```.
- Can be used if the data is a time series.
- When the ```method``` argument of ```fillna()``` is set as ```pad``` 
or ```ffill```, values are filled forward; when ```method``` is set as
```bfill```or ```backfill```, values are filled backward.
- The ```limit``` argument of ```fillna()``` sets the limit of number 
of rows it is allowed to fill.

In [6]:
df['a'].fillna(method='pad', limit=1)

0   -0.24766
1    0.69001
2    0.31765
3    0.31765
4        NaN
Name: a, dtype: float64

In [7]:
df['a'].fillna(method='bfill', limit=1)

0   -0.24766
1    0.69001
2    0.31765
3        NaN
4        NaN
Name: a, dtype: float64