# Chapter 4: Building Good Training Data Sets - Data Preprocessing

# 4-1: Dealing with Missing Data

Most computational models cannot handle missing values in a data set, and would produce unpredicatable results if not addressed.

In [4]:
import pandas as pd
from io import StringIO
csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''
csv_data=unicode(csv_data)
df=pd.read_csv(StringIO(csv_data))
print df

df.isnull().sum()

      A     B     C    D
0   1.0   2.0   3.0  4.0
1   5.0   6.0   NaN  8.0
2  10.0  11.0  12.0  NaN


A    0
B    0
C    1
D    1
dtype: int64

### Eliminating samples or features with missing values

In [5]:
# drop observations with Null values
null_obs=df.dropna()

# drop columns with any null values
null_col=df.dropna(axis=1)

# only drop rows were all columns are NaN
null_col_all=df.dropna(how='all')

# drop rows that do not have at least 4 non-NaN values
at_least = df.dropna(thresh=4)

# only drop rows where NaN appears in specific columns (here: 'C')
drop_specific = df.dropna(subset=['C'])

### Imputing missing values

When dropping NaNs is not feasible, youc an use different interpolation techniques to estimate the missing values from the other training samples in your dataset.

In [9]:
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy = 'mean', axis=0)
imr = imr.fit(df)
imputed_data = imr.transform(df.values)
imputed_data

array([[  1. ,   2. ,   3. ,   4. ],
       [  5. ,   6. ,   7.5,   8. ],
       [ 10. ,  11. ,  12. ,   6. ]])

### Understanding scikit-learn *transformer* classes

The two essential methods of estimators in this class are:

**(1) fit**: Used to learn the parametrs from the training data

-and- 

**(2) transform**: Uses those parameters to transform the data

*Note: the classifiers from Chapter 3 belong to the **estimator** class within scikit-learn, and have a **predict** method*

# 4-2: Handling categorical data

First, understand **nominal** vs. **ordinal** features. Ordinal features can be sorted (i.e. t-shirt sizes), while nominal features do not imply an order (i.e. t-shirt colors).

We'll create a new df to illustrate:

In [12]:
import pandas as pd
df = pd.DataFrame([
    ['green','M',10.1,'class1'],
    ['red','L',13.5,'class2'],
    ['blue','XL',15.3,'class1']])
df.columns = ['color','size','price','classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


### Mapping ordinal features

Mapping an oridinal feature to an integer is a manual process, and we must scale the difference in values accordingly

In [13]:
size_mapping = {
    'XL':3,
    'L':2,
    'M':1}
df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1
