# Data Preprocessing Tools

<div class="alert alert-block alert-info">
    <b>Refresher on object-oriented programming</b>
    <ul>
        <li>A class is a model or blueprint of something we want to build. </li>
        <li>An object is an instance of a class, and follows the class' blueprint.</li>
        <li>A method is a tool/function we can use on an object.</li>
    </ul><br>
    <b>Further thoughts</b>
    <ul>
        <li> It's probably best practice to perform all your module/library imports at the start of each script; however for the purposes of this course, imports will generally be done just before their contents are used. This is to illustrate exactly where each class/function we're using originates from.</li>
    </ul>
</div>

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [2]:
dataset = pd.read_csv('./data/Data.csv')
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [3]:
X = dataset.iloc[:, :-1].values # Use .values since a lot of these functions only take NumPy arrays, not pandas DataFrames
y = dataset.iloc[:, -1].values

X, y

(array([['France', 44.0, 72000.0],
        ['Spain', 27.0, 48000.0],
        ['Germany', 30.0, 54000.0],
        ['Spain', 38.0, 61000.0],
        ['Germany', 40.0, nan],
        ['France', 35.0, 58000.0],
        ['Spain', nan, 52000.0],
        ['France', 48.0, 79000.0],
        ['Germany', 50.0, 83000.0],
        ['France', 37.0, 67000.0]], dtype=object),
 array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
       dtype=object))

## Taking care of missing data

Missingness can cause issues in machine learning algorithms. There are a few strategies to handle missing values:
* Ignoring the observation - which can be an okay method as long as the values are missing at random and your dataset is large enough.
* Another is to impute the missing value. In this course, we'll start by filling in these values with the average of the column. We'll use `scikit-learn`'s `SimpleImputer` to do this.

In [7]:
from sklearn.impute import SimpleImputer

# Instantiate imputer object
imputer = SimpleImputer(
    missing_values = np.nan, # We want to replace all np.nans
    strategy       = 'mean'  # Replace with the mean
)

# Fit the imputer to the data - note this method only takes numeric columns
imputer.fit(X[:, 1:3])

# Apply transformation and assign the result to the original columns
X[:, 1:3] = imputer.transform(X[:, 1:3])

X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## Encoding categorical data

How do we encode categorical values in a way that machine learning algorithms will be able to work with them?
* We could give them a sequence of numbers; e.g. 1: France, 2: Spain, so on - but the ML algorithm might then incorrectly assume that there is a numerical order in this variable.
* The answer to this is one-hot encoding where $n - 1$ columns are created ($n$ being the unique number of categories in the variable) and a value's existence is indicated by a combination of 0s and 1s.
    * Using $n$ columns is known as the ['dummy variable trap'](https://www.learndatasci.com/glossary/dummy-variable-trap/) in regression models, wherein the last column is perfectly multicollinear with the other combination of columns causing issues when interpreting coefficients.
    * Note as a result of the above reasoning, this is only a problem for models where collinear features are an issue. If they are, we can make use of the `drop` argument in `OneHotEncoder()`. This isn't done in this course so we'll continue with the default value for `drop` (which is `None`).

### Encoding the Independent Variable

In [8]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# First, create an object from the ColumnTransformer class
ct = ColumnTransformer(
    # Specify: kind of transformation (encoding), what kind of encoding, and the columns we want to encode
    transformers = [('encoder', OneHotEncoder(), [0])], 
    # Specifies that we want to keep all other columns that we haven't encoded with this object (everything but Country)
    remainder    = 'passthrough'
)

# Connect this to our dataset; 
#  There is a fit_transform method that handles both fitting and transforming for ColumnTransformer
#  ct.fit_transform doesn't return a numpy array, and we need this to feed into our ML algorithms, so force to array
X = np.array(ct.fit_transform(X))
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

### Encoding the Dependent Variable

Similarly, we'll need our dependent variable to be numeric. Since it's a binary (Yes/No) variable, converting this to 0s and 1s will suffice. We can use `LabelEncoder()` for this.

In [9]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y  = le.fit_transform(y) # NB the dependent variable does not need to be a numpy array - likely because it's one column
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## Splitting the dataset into the Training set and Test set

This is an exercise to split the dataset into one set that you train your model on and another set that you'll then test that trained model on. This concept can be taken even further and you'll often also create a validation set.

The reason for this is as follows:
1. The fundamental issue that causes you to need to consider a training and test set to begin with is **overfitting**; i.e. fitting your model not only to the signal in your dataset, but also the noise and peculiarities that come with it. This would mean that your model performs very well when predicting values within your training set, but will fall over when faced with data that it hasn't seen before.
2. To try to minimise the risk of this occurring, you split your dataset into a training and test set, then train your model on the training set whilst evaluating the model's performance on the test set (i.e. the data the model hasn't seen before).
    * This is all well and good, but after a large amount of iterations, you will likely end up overfitting to the test set as well since you are using and measuring performance over the same set repeatedly.
    * The generally accepted solution is to create a training, test, and validation set; where once you're confident in the model you've build using your training and test set, you then test it once again on your validation set to confirm that overfitting hasn't substantially impacted your model's ability to predict on data it hasn't seen before.


* Note overfitting is a fundamental problem to avoid in all forms of predictive modelling, and there are many other techniques that aim to reduce the chance of this occurring (including regularization).

`sklearn` provides us with a function to split our dataset into train and test sets; this function returns four datasets: train set of features, test set of features, train set of labels, and test set of labels.

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

## Feature Scaling

This consists of scaling all your features so that they all take values on the same scale. We do this to prevent one feature from dominating another during the model training process. Note though that this doesn't have to be done for every type of model; and also doesn't need to be done on dummy variables.

Consider multiple linear regression that takes the form $y = b_0 + b_1x_1 + \dots + b_nx_n$; if you have $x_i$ that takes large values, the coefficient for that feature will compensate by becoming smaller. This is what we're trying to avoid when we scale our features.

### A note on sequencing
You have to apply feature scaling **after** splitting the dataset into the training and test set. The simple reason for this is that the test set is meant to be a **brand new set** on which you evaluate your ML model. 

Performing feature scaling on the whole dataset before splitting it into train/test would actually cause information leakage since information from the training set is now present in the test set.

As a product of this, we'll also need to use the mean and standard deviation (or min and max) of $X_{train}$ to scale $X_{test}$.

### Types of feature scaling
* Standardisation: $x_{stand} = \frac{x - mean(x)}{standard\;deviation(x)}$; this generally bounds features between -3 and +3 (assuming normally distributed?)
* Normalisation: $x_{norm} = \frac{x - min(x)}{max(x)-min(x)}$; this generally bounds features between 0 and 1.

Which one is better? 
* Normalisation is recommended when you have a normal distribution in most of your features.
* Standardisation works all the time. Hence, we'll go with this for the most part.

In [11]:
from sklearn.preprocessing import StandardScaler

# Instantiate a StandardScaler object
sc = StandardScaler()

# Fit this object to X_train and transform, after removing dummy variables
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

# Use this fitted object to transform X_test as well (note, don't fit it on X_test)
X_test[:, 3:] = sc.transform(X_test[:, 3:])

X_train, X_test

(array([[0.0, 0.0, 1.0, -0.19159184384578545, -1.0781259408412425],
        [0.0, 1.0, 0.0, -0.014117293757057777, -0.07013167641635372],
        [1.0, 0.0, 0.0, 0.566708506533324, 0.633562432710455],
        [0.0, 0.0, 1.0, -0.30453019390224867, -0.30786617274297867],
        [0.0, 0.0, 1.0, -1.9018011447007988, -1.420463615551582],
        [1.0, 0.0, 0.0, 1.1475343068237058, 1.232653363453549],
        [0.0, 1.0, 0.0, 1.4379472069688968, 1.5749910381638885],
        [1.0, 0.0, 0.0, -0.7401495441200351, -0.5646194287757332]],
       dtype=object),
 array([[0.0, 1.0, 0.0, -1.4661817944830124, -0.9069571034860727],
        [1.0, 0.0, 0.0, -0.44973664397484414, 0.2056403393225306]],
       dtype=object))