# Building Good Training Sets – Data Preprocessing

Most contents of this article (or note) are from [*python machine learning - 2nd edition*](https://github.com/rasbt/python-machine-learning-book-2nd-edition).

## Dealing with missing data

### Identifying missing values in tabular data

First of all, creat a CSV file as a sample:

In [13]:
import pandas as pd
from io import StringIO
import sys

# note: do not add extra space in the csv_data
csv_data = \
'''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [14]:
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

In [15]:
df.values

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6., nan,  8.],
       [10., 11., 12., nan]])

### Eliminating samples or features with missing values

In [19]:
# remove rows that contain missing values
df.dropna(axis=0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [20]:
# remove colums that contain missing values
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


In [23]:
# only drop rows that all colums are NaN
df.dropna(axis=0,how='all')
# or just:
# df.dropna(how='all')

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [24]:
# drop rows that have less than 4 real values
df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [25]:
# only drop rows where NaN appear in specific colums (here: 'C')
df.dropna(subset=['C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


### Imputing missing values
Use interpolation technique.  
One of the most common method is mean interpolation, which is included in `scikit-learn` module as `Imputer`.

In [30]:
# imputing missing values via colums mean
from sklearn.preprocessing import Imputer

imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])