# Building Good Training Sets - Data Preprocessing

The things that we learn here are as follows:-

- Removing and imputing missing values from the dataset
- Getting categorical data into shape for machine learning algorithms
- Selecting relevant features for the model construction

In [1]:
import numpy as np
from io import StringIO
import pandas as pd

## Dealing With Missing Data

In [2]:
csv_data = \
'''
A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,

'''

In [3]:
df = pd.read_csv(StringIO(csv_data))

In [4]:
df.head()

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [5]:
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

In [6]:
#Eliminating samples or features with missing values

df.dropna(axis = 0) #here axis 0 means sample wise

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [7]:
df.dropna(axis=1) #here axis 1 means feature wise

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


In [8]:
df.dropna(how = 'all')

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [9]:
#thresh -> drop rows that have less than thresh values (not nan)
df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [10]:
#only drop rows where NaN appear in specific columns 
df.dropna(subset=['C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


## Imputing missing values

In [11]:
#mean imputation
from sklearn.preprocessing import Imputer

In [12]:
imr = Imputer(missing_values='NaN', strategy='mean', axis = 0)



In [13]:
imr = imr.fit(df.values)

In [14]:
imputed_data = imr.transform(df.values)

In [15]:
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

## Handling categorical data

In [16]:
df = pd.DataFrame([
    ['green','M',10.1,'class1'],
    ['red','L',13.5,'class2'],
    ['blue','XL',15.3,'class1']
])

In [17]:
df.columns = ['color','size','price','classlabel']

In [18]:
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


In [19]:
#mapping ordinal features
size_mapping = {
    'XL': 3,
    'L': 2,
    'M': 1
}

In [20]:
df['size'] = df['size'].map(size_mapping)

In [21]:
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


In [22]:
#encoding class labels
#many machine learning libraries require that class labels are encoded as integer values.
#remember class labesl are not ordinal, therefore it doesn't matter which integer values are assigned to which class label.

class_mapping = {label:idx for idx, label in enumerate(np.unique(df.classlabel))}

In [23]:
class_mapping

{'class1': 0, 'class2': 1}

In [24]:
df['classlabel'] = df.classlabel.map(class_mapping)

In [25]:
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,0
1,red,2,13.5,1
2,blue,3,15.3,0


In [26]:
#encoding nominal categorical variables
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [27]:
X = df[['color','size','price']].values
color_le = LabelEncoder()
X[:,0] = color_le.fit_transform(X[:,0])
X

array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

In [28]:
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([[ 0. ,  1. ,  0. ,  1. , 10.1],
       [ 0. ,  0. ,  1. ,  2. , 13.5],
       [ 1. ,  0. ,  0. ,  3. , 15.3]])

In [29]:
ohe = OneHotEncoder()
temp = ohe.fit_transform(df['color'].values.reshape(-1,1))

In [30]:
temp.toarray()

array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [31]:
#another way for doing one hot encoding is by using pd.get_dummies() in pandas
pd.get_dummies(df)

Unnamed: 0,size,price,classlabel,color_blue,color_green,color_red
0,1,10.1,0,0,1,0
1,2,13.5,1,0,0,1
2,3,15.3,0,1,0,0


When we are using one-hot encoding datasets, we have to keep in mind that it introduces **mulit-collinearity**, which can be an issue for certain methods(for instance, methods that require matrix inversion). If features are highly correlated, matrics are computationally difficult to invert, which can lead to numerically unstable estimates.

To reduce the correlation among variables, we can simply remove one features column from the one-hot encoded array. (we don't loose any valuable information)

In [32]:
pd.get_dummies(df, drop_first=True)

Unnamed: 0,size,price,classlabel,color_green,color_red
0,1,10.1,0,1,0
1,2,13.5,1,0,1
2,3,15.3,0,0,0


## Partitoning a dataset into separate training and test sets

In [33]:
from io import StringIO

In [34]:
col_names = '''Class Label
Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280/OD315 of diluted wines
Proline'''

In [35]:
col_names = [i for i in col_names.split('\n')]

In [36]:
col_names

['Class Label',
 'Alcohol',
 'Malic acid',
 'Ash',
 'Alcalinity of ash',
 'Magnesium',
 'Total phenols',
 'Flavanoids',
 'Nonflavanoid phenols',
 'Proanthocyanins',
 'Color intensity',
 'Hue',
 'OD280/OD315 of diluted wines',
 'Proline']

In [37]:
df = pd.read_csv('wine.data', header=None, names = col_names) #using chemical analysis determine the origin of wine dataset

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
Class Label                     178 non-null int64
Alcohol                         178 non-null float64
Malic acid                      178 non-null float64
Ash                             178 non-null float64
Alcalinity of ash               178 non-null float64
Magnesium                       178 non-null int64
Total phenols                   178 non-null float64
Flavanoids                      178 non-null float64
Nonflavanoid phenols            178 non-null float64
Proanthocyanins                 178 non-null float64
Color intensity                 178 non-null float64
Hue                             178 non-null float64
OD280/OD315 of diluted wines    178 non-null float64
Proline                         178 non-null int64
dtypes: float64(11), int64(3)
memory usage: 19.5 KB


In [39]:
df.head()

Unnamed: 0,Class Label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [40]:
#distribution of our classes
np.bincount(df['Class Label'].values)

array([ 0, 59, 71, 48], dtype=int64)

In [41]:
#other way
df['Class Label'].value_counts()

2    71
1    59
3    48
Name: Class Label, dtype: int64

In [42]:
#splitting the dataset
from sklearn.model_selection import train_test_split

In [43]:
X, y = df.iloc[:,1:].values, df.iloc[:,0].values

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, random_state = 0,stratify = y) #stratification sampling

In [45]:
np.bincount(y_train), np.bincount(y_test) #proportions will be sample

(array([ 0, 41, 50, 33], dtype=int64), array([ 0, 18, 21, 15], dtype=int64))

## Bringing features onto the same scale

some model are scale invariant (like Decision tree, random forest classifier) they doesn't need feature scaling, but we have to do feature scaling for the model that needs it(linear Regression, Adlaine, Perceptron) as it imporves their performances.

two approach to bring different features onto the same scale: **normalization** and **standarization**.

Normalization refers to the scaling of feature to a range [0,1], which is a special case of min-max scaling.

$ x ^{(i)} _{norm} = \frac {x ^{(i)} - x _{min}} {x _{max} - x _{min}} $

here,

X(i) is a particular sample,<br>
$ x _{min} $ is the smallest value in a feature column<br>
$ x _{max} $ is the largest value

In [46]:
from sklearn.preprocessing import MinMaxScaler

In [47]:
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

Normalization is a technique to bound our value in an interval [0,1].

Standarization can be more practical for many machine learning algorithms, especially for optimization algorithms such as gradient descent.

Using standarization, we center the features columns at mean 0 and standard deviation 1 so that the feature columns takes the form of a normal distribution, which make it easier to learn the weights. Furthermore, standarization maintains useful information about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling, which scales the data to a limited range values.

$ x ^{(i)} _{std} = \frac {x ^{(i)} - \nu _x} {\sigma _x} $

In [48]:
from sklearn.preprocessing import StandardScaler

In [49]:
std = StandardScaler()

In [50]:
X_train_std = std.fit_transform(X_train)
X_test_std = std.transform(X_test)

## Selecting meaningful features

If our model perfroms better on training dataset and poortly on test dataset than we can say that our model is overfitted, that is our model have high variance. The reason for overfitting is that our model is too complex for the given training data.

Solutions to reduce the generalization error are as follows:
- Collect more training data
- Introduce a penalty for complexity via regularization
- Choose a simpler model with fewer parameter
- Reduce the dimensionality of the data