In [67]:
## Required libraries

## libraries for data handling
import numpy as np
import pandas as pd

## library for plotting static graphs
import matplotlib.pyplot as plt

## library for plotting interactive graphs 
from bokeh.plotting import figure, show
from bokeh.io import output_notebook

## Classes used from scikit-learn for data preparation
from sklearn.preprocessing import Imputer, LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split

output_notebook()

# Machine Learning A-Z™: Data Preprocessing

Personal course notes for [Machine Learning A-Z™: Hands-On Python & R In Data Science](https://www.udemy.com/machinelearning/) created by Kirill Eremenko, Hadelin de Ponteves and SuperDataScience Team, and offered by Udemy.

Data for this course is available for download on this [link](https://www.superdatascience.com/pages/machine-learning).

This notebook covers Part 1 of the course, i.e., Data Preprocessing.

Data preprocessing is one of the most important step in any data science or machine learning projects. Based on my experience if you are building the entire application or the system from scratch, you have the freedom to decide the features (it an individual measurable property) required for the algorithm which you intend to use. If not, then it is estimated that 80% of the entire project duration is mostly spent on data preprocessing. Especially since the data tend to be complex, unstructured and messy in nature.


## 1. Importing Data

The data is available and can be imported from several sources, be it large or small. The source data can be as large as a data lake, data warehouse, database or files of several formats (most popular being csv and excel). We require database connectors to extract data from these sources or we can use pandas to export data from files.

In [40]:
## importing that data
dataset = pd.read_csv("Data.csv")

## view data
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
Country      10 non-null object
Age          9 non-null float64
Salary       9 non-null float64
Purchased    10 non-null object
dtypes: float64(2), object(2)
memory usage: 400.0+ bytes


In [41]:
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [49]:
## Features of the data, i.e., independant variables
X = dataset.iloc[:,:-1].values
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [50]:
## Target variable or the dependant variable
y = dataset.iloc[:,-1].values
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

## 2. Missing Data

There are several ways to handle missing values in the data. These include imputation of the mean, median or mode of these individual features. Another most effective technique would be to use prediction imputation, where an algorithm like k-NN can be used to predict and impute these values.

Considering the fact that the range of the values of a feature can be vast and can include outliers, my personal choice would be to impute the median or go for prediction imputation. But, based on my experience, the prediction imputation might not be able to impute all the missing values and thus you might have to cross check for missing values and then replace these values using one of the 3 standard imputation techniques.  

In [51]:
## define the imputer by defining the parameters
imputer = Imputer(missing_values='NaN',strategy='median',axis=0)
## fit the data into the imputer
imputer = imputer.fit(X[:,1:3])
## transform the data by replacing the missing values from the imputer 
X[:,1:3] = imputer.transform(X[:,1:3])
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 61000.0],
       ['France', 35.0, 58000.0],
       ['Spain', 38.0, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## 3. Categorical Data



In [52]:
labelenc_X = LabelEncoder()
X[:,0] = labelenc_X.fit_transform(X[:,0])
X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 61000.0],
       [0, 35.0, 58000.0],
       [2, 38.0, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

In [55]:
enc_X = OneHotEncoder(categorical_features= [0])
X = enc_X.fit_transform(X).toarray()
X

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([[1.0e+00, 0.0e+00, 0.0e+00, 4.4e+01, 7.2e+04],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.7e+01, 4.8e+04],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.0e+01, 5.4e+04],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.8e+01, 6.1e+04],
       [0.0e+00, 1.0e+00, 0.0e+00, 4.0e+01, 6.1e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.5e+01, 5.8e+04],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.8e+01, 5.2e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.8e+01, 7.9e+04],
       [0.0e+00, 1.0e+00, 0.0e+00, 5.0e+01, 8.3e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.7e+01, 6.7e+04]])

In [57]:
labelenc_y = LabelEncoder()
y = labelenc_y.fit_transform(y)
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## 4. Split Data

Before the dataset is used for training a model, it is split into 2 parts, test data and train data. This is done to avoid overfitting of the model, i.e., basically to avoid testing of the model on the data that it has been trained on. In case you have less data and want to train the model on the entire data, then you can use cross validation.

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [71]:
X_train

array([[0.0e+00, 1.0e+00, 0.0e+00, 3.0e+01, 5.4e+04],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.8e+01, 6.1e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.4e+01, 7.2e+04],
       [0.0e+00, 1.0e+00, 0.0e+00, 4.0e+01, 6.1e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.7e+01, 6.7e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.8e+01, 7.9e+04],
       [0.0e+00, 1.0e+00, 0.0e+00, 5.0e+01, 8.3e+04],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.7e+01, 4.8e+04]])

In [72]:
X_test

array([[1.0e+00, 0.0e+00, 0.0e+00, 3.5e+01, 5.8e+04],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.8e+01, 5.2e+04]])

In [73]:
y_train

array([0, 0, 0, 1, 1, 1, 0, 1])

In [74]:
y_test

array([1, 0])

## 5. Feature Scaling

Feature scaling is extremely important when building machine learning models, this ensures that all the features are on the same scale and comparable. Also, most of the algorithms are based on euclidean distance and thus if the features are not of the same scale, it might result in bias towards certain features (This is true in case of k-NN). In case of decision trees, it results in faster model convergence.

This step might not be required in most of the cases, since most of the ML libraries tend handle scaling.

There are 2 ways to do this:
1. Standardisation
    \begin{equation*}
    x_{stand} = \frac{x - mean(x)}{SD(x)}
    \end{equation*}
2. Normalisation
    \begin{equation*}
    x_{norm} = \frac{x - min(x)}{max(x) - min(x)}
    \end{equation*}

In [75]:
scaler_X = StandardScaler()
X_train = scaler_X.fit_transform(X_train)
X_test = scaler_X.transform(X_test)
X_train

array([[-0.77459667,  1.29099445, -0.57735027, -1.22318227, -1.03365241],
       [-0.77459667, -0.77459667,  1.73205081, -0.1652949 , -0.41123806],
       [ 1.29099445, -0.77459667, -0.57735027,  0.62812062,  0.56684164],
       [-0.77459667,  1.29099445, -0.57735027,  0.09917694, -0.41123806],
       [ 1.29099445, -0.77459667, -0.57735027, -0.29753082,  0.12225996],
       [ 1.29099445, -0.77459667, -0.57735027,  1.15706431,  1.189256  ],
       [-0.77459667,  1.29099445, -0.57735027,  1.42153615,  1.54492135],
       [-0.77459667, -0.77459667,  1.73205081, -1.61989003, -1.56715043]])

In [76]:
X_test

array([[ 1.29099445, -0.77459667, -0.57735027, -0.56200266, -0.67798707],
       [-0.77459667, -0.77459667,  1.73205081, -0.1652949 , -1.21148508]])