# Data Preprocessing Template

This notebook explains the basic steps in preprocessing the dataset before applying any models.

## Import the required libraries

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Import the dataset

In this case we have a dummy dataset which has a few entries.

In [2]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [3]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [4]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

## Take care of missing data
In the data set there are some missing data. To get rid of this we replace the missing data with the mean of the values. What we replace the missing data with is completely depend on the situation and the dataset. For this example we will replace them with the mean.

For this purpose we import the Imputer library from sklearn.preprocessing. 

In [5]:
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [6]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

There! its done. The missing values have been replace by the mean values.

## Encoding categorical data
The names 'France', 'Spain' or 'Germany' wont make any sense to our model. So we need to encode these categorical data into numbers. For this purpose we import LabelEncoder from sklearn.preprocessing. LabelEncoder will encode the labels into numbers. 

For e.g. it may give the value 0 to 'France', 1 to 'Spain' and 2 to 'Germany'. 
While this solves the purpose of encoding them, it may not work well for our modelling as this type of encoding suggests a relation between the categorical data. i.e. It is someway suggests that 'Germany' is greater than 'Spain' which is in turn greater than 'France'. This is not correct and to deal with this we use something called a OneHotEncoder.http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

OneHotEncoder splits the columns into the number of separate entries in the original column and encodes a '1' for the column corresponding to the actual value and '0' otherwise.

In [7]:
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()

In [8]:
X

array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
        5.40000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 5.00000000e+01,
        8.30000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04]])

Lets do this for the labels as well. But here we do not need to do OneHotEncoding.

In [9]:
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

In [10]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## Creating the training and the test set

Now lets split the dataset into a training and a test set.
For this purpose sklearn has a beautiful library called cross_validation

In [11]:
# Create the training set and the test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)



In [12]:
X_train

array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04]])

In [13]:
X_test

array([[0.0e+00, 1.0e+00, 0.0e+00, 3.0e+01, 5.4e+04],
       [0.0e+00, 1.0e+00, 0.0e+00, 5.0e+01, 8.3e+04]])

## Feature Scaling

There is one last step to be done and this is feature scaling. This is a very import step in data preprocessing. If feature scaling is not performed, the presence of a skewed feature can cause all the modelling to be weighted incorrectly.

For e.g. in our feature set, the last feature is in a much higher scale than any other feature. Due to this any Euclidean distance calculation or any other math operation can get skewed towards it. This can cause other the other features to be not taken into consideration.

To avoid this, we perform feature scaling.

In [14]:
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

Notice here that the StandarScaler object was fit to the training set alone and that fit was used to transform the test set as well.

## Thats it!!!

That is all the steps in data preprocessing. Use this as a template.