# Data Pre-Processing

After data preparation is done, you should have a clean dataset. But before applyimg machine learning algorithms on the dataset, there is need to do some data pre-processing.

The following two types of data pre-processing are discussed below:

- Encoding categorical variables
- Scaling and standardizing data

In [1]:
# Import libraries
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns

# We do this to ignore several specific warnings
import warnings
warnings.filterwarnings("ignore")

### Encoding Categorical Variables

Almost all practical datasets will contain categorical variables. These variables are normally stored as text values.  Some examples include Gender("Male" or "Female"), Level ("Low", "Medium" or "High"), or geographic designations (State, Region or Country). Some machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. Therefore, we will need to turn these text categorical attributes into numerical values for further processing.

There are many ways to approach this problem. In this notebook, we used pandas and scikit-learn modules to transform the categorical data into suitable numeric values. 

Categorical features can take several forms. For example, a categorical feature can be categorized into nominal and ordinal features (note that other classes are also possible).

**Nominal feature**: A nominal feature is either in a category or it isn't, and there are no relationships between the different categories. For example, the gender category is nominal since there is no numerical relation or ordering among the possible values, male and female.  
    
**Ordinal feature**: An ordinal feature is a categorical feature where the possible values have an intrinsic relationship. For example, if we encode the results of a race as first, second, and third, these values have a relationship, in that first comes before second and second comes before third.

The process to convert categorical features to numerical values is generally known as encoding, and the scikit-learn library provides several different encodings in the preprocessing module.

To begin with, we first create a fictitious dataset which contains the categorical features of Gender, Affluency and Region.

In [2]:
# Craete a simple dataframe
customer_demo = pd.DataFrame({'customerID':['A021', 'B341', 'C006', 'D122', 'E874', 'F442', 'G433', 'H343', 'I532', 'J451'],
                              'gender':['F', 'M', 'M', 'M', 'F',  'M', 'F', 'M', 'F', 'M'],
                              'affluency':['Low', 'High', 'Medium', 'Low', 'Medium', 'High', 'High', 'High', 'Low', 'Low'],
                              'region':['West', 'Central', 'East', 'East', 'Central', 'East', 'West', 'Central', 'Central', 'West']
                            })
customer_demo                            

Unnamed: 0,customerID,gender,affluency,region
0,A021,F,Low,West
1,B341,M,High,Central
2,C006,M,Medium,East
3,D122,M,Low,East
4,E874,F,Medium,Central
5,F442,M,High,East
6,G433,F,High,West
7,H343,M,High,Central
8,I532,F,Low,Central
9,J451,M,Low,West


#### Label Encoding
The simplest approach to encode categorical values is with a technique called Label Encoding, which allows you to convert each value in a column to a number. In the following Code cell, we create a new column gender_cat to hold encoded gender. Gender 'F' is encoded as 0 and 'M' as 1 - usually alphabetical order.

In [3]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
customer_demo['gender_cat'] = le.fit_transform(customer_demo.gender)
customer_demo

Unnamed: 0,customerID,gender,affluency,region,gender_cat
0,A021,F,Low,West,0
1,B341,M,High,Central,1
2,C006,M,Medium,East,1
3,D122,M,Low,East,1
4,E874,F,Medium,Central,0
5,F442,M,High,East,1
6,G433,F,High,West,0
7,H343,M,High,Central,1
8,I532,F,Low,Central,0
9,J451,M,Low,West,1


#### Ordinal Encoding

LabelEncoder finds the unique values present in a column and map the values in range [0, n-1], n bening the number of unique values in the column. The values are mapped in alphabetical order. Thus, in the previous case, 'F' is mapped to 0 and 'M' is mapped to 1.

If we use same approach to encode the affluency column, the mapping will be:  

High: 0  
Low: 1  
Medium: 2  

This mapping is not ideal since affluency is an ordinal categorical feature. The three categories, Low, Medium and High, have an order associated with them. We would like to have this mapping instead:

Low: 0  
Medium: 1  
High: 2  

There are multiple ways to achieve this. One of the simplest ways is to use a pandas Series map() function as shown below. First, we will need to find all unique values in the column, then define mapping dictionary, then create new column with mapped numeric values.

In [4]:
# Unique values in affluency
customer_demo.affluency.unique()

array(['Low', 'High', 'Medium'], dtype=object)

In [5]:
# Define mapping dictionary
mapping_dict = {'Low':0, 'Medium':1, 'High':2}

# Encode Size column
customer_demo['affluency_cat'] = customer_demo.affluency.map(mapping_dict)
customer_demo

Unnamed: 0,customerID,gender,affluency,region,gender_cat,affluency_cat
0,A021,F,Low,West,0,0
1,B341,M,High,Central,1,2
2,C006,M,Medium,East,1,1
3,D122,M,Low,East,1,0
4,E874,F,Medium,Central,0,1
5,F442,M,High,East,1,2
6,G433,F,High,West,0,2
7,H343,M,High,Central,1,2
8,I532,F,Low,Central,0,0
9,J451,M,Low,West,1,0


#### One Hot Encoding
Label Encoding is straightforward but it has a disadvantage in that the numeric values can be “misinterpreted” by the algorithms. For example, the value of 0 is obviously less than the value of 1, but does that really correspond to the data set in real life? Consider region column in our customer_demo dataset. If we use label encoding, Central is mapped to 0, East is mapped to 1 and West is mapped to 2, but Central is not supposed to be "smaller" than West. Ordinal encoding doesn't help in this case for the same reason.

A common alternative approach is called One Hot Encoding. The basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly but does have the downside of adding more columns to the data set.

Again, there are multiple ways to do One Hot Encoding. The pandas way of `get_dummies` function is used here, there are other techniques. We encode region in the following Code cell. Three extra columns are created, one for each unique values in region: region_Central, region_East and region_West. Depending on the value of region, only one out of the three dummy columns has value 1.

In [6]:
# duplicate Color column to keep original values
customer_demo['region_cat'] = customer_demo.region

# convert region_cat to dummy variables.
customer_demo_ohe = pd.get_dummies(customer_demo, columns=["region_cat"], prefix=["region"])
customer_demo_ohe

Unnamed: 0,customerID,gender,affluency,region,gender_cat,affluency_cat,region_Central,region_East,region_West
0,A021,F,Low,West,0,0,0,0,1
1,B341,M,High,Central,1,2,1,0,0
2,C006,M,Medium,East,1,1,0,1,0
3,D122,M,Low,East,1,0,0,1,0
4,E874,F,Medium,Central,0,1,1,0,0
5,F442,M,High,East,1,2,0,1,0
6,G433,F,High,West,0,2,0,0,1
7,H343,M,High,Central,1,2,1,0,0
8,I532,F,Low,Central,0,0,1,0,0
9,J451,M,Low,West,1,0,0,0,1


### Data Scaling

Many machine learning estimators in the scikit-learn library are sensitive to variations in the spread of features within a data set. For example, if all features but one span similar ranges (e.g., zero through one) and one feature spans a much larger range (e.g., zero through one hundred), an algorithm might focus on the one feature with a larger spread, even if this produces a sub-optimal result. To prevent this, we generally scale the features to improve the performance of a given scikit-learn estimator.

The following are two of several forms of data scaling:  
    
**Standardizing**: the data are scaled to have zero mean and unit (i.e., one) variance.  
    
**Normalizing**: the data are scaled to span a defined range, such as $[0, 1]$.

One important caveat to scaling is that any scaling technique should be _trained_ via the fit method on the training data used for the machine learning algorithm. Once trained, the scaling technique can be applied equally to the training and testing data. In this manner, the testing data will always match the space spanned by the training data, which is what is used to generate the predictive model.

We demonstrate this approach in the following Code cell, where we first split our data into train and test set then compute a standardization from our training data. This transformation is applied to both the training and testing data. We will then demonstrate standardizing with sklearn StandardScaler, then normalizing with sklearn MinMaxScaler.

In [7]:
# Load the Iris Data
iris = pd.read_csv("iris.csv")
print(iris.shape)
iris.head()

(150, 5)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [8]:
# Define features and label
features = iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
label = iris['class']

In [9]:
from sklearn.model_selection import train_test_split

# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.25, random_state=0)
X_train.shape, X_test.shape

((112, 4), (38, 4))

In [10]:
from sklearn.preprocessing import StandardScaler

# Create and fit scaler
ss = StandardScaler()

# Don't cheat, fit only the train data
ss.fit(X_train)

ss_X_train = ss.transform(X_train)
ss_X_test = ss.transform(X_test)

ss_X_train[:5], ss_X_test[:5]

(array([[ 0.01543995, -0.11925475,  0.22512685,  0.35579762],
        [-0.09984503, -1.04039491,  0.11355956, -0.02984109],
        [ 1.05300481, -0.11925475,  0.95031423,  1.12707506],
        [-1.36797986,  0.34131533, -1.39259884, -1.31530348],
        [ 1.1682898 ,  0.11103029,  0.72717965,  1.38416753]]),
 array([[-0.09984503, -0.57982483,  0.72717965,  1.51271377],
        [ 0.13072494, -1.96153508,  0.11355956, -0.28693357],
        [-0.44569998,  2.64416573, -1.33681519, -1.31530348],
        [ 1.62942973, -0.34953979,  1.39658338,  0.74143634],
        [-1.0221249 ,  0.80188541, -1.28103155, -1.31530348]]))

In [11]:
from sklearn.preprocessing import MinMaxScaler

# Create and fit scaler
mms = MinMaxScaler()

mms.fit(X_train)

mms_X_train = mms.transform(X_train)
mms_X_test = mms.transform(X_test)

mms_X_train[:5], mms_X_test[:5]

(array([[0.44444444, 0.41666667, 0.53448276, 0.58333333],
        [0.41666667, 0.25      , 0.5       , 0.45833333],
        [0.69444444, 0.41666667, 0.75862069, 0.83333333],
        [0.11111111, 0.5       , 0.03448276, 0.04166667],
        [0.72222222, 0.45833333, 0.68965517, 0.91666667]]),
 array([[0.41666667, 0.33333333, 0.68965517, 0.95833333],
        [0.47222222, 0.08333333, 0.5       , 0.375     ],
        [0.33333333, 0.91666667, 0.05172414, 0.04166667],
        [0.83333333, 0.375     , 0.89655172, 0.70833333],
        [0.19444444, 0.58333333, 0.06896552, 0.04166667]]))

#### Standardizing or Normalizing?
Use normalizing as the default if you are transforming a feature. It is non-distorting. If there are outliers in the dataset, however, normalizing may be problematic. You might be better off removing the outliers before applying normalizing. There are other scaling methods that deal with outliers better (RobustScaler). 

If a feature is relatively normally distributed, you may consider using standardizing. Outliers will have less impact when using standardizing. But if the feature is not normally distributed, standardizing is less effective than normalizing.

Not all machine learning algorithms require data scaling. For example, scaling is not necessary for decision tree or random forest.
