<a href="https://colab.research.google.com/github/rs2pydev/pythonic_topics/blob/main/Annotated_Tutorial__NominalEncoding_ColumnTransformer_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><center><b>Annotated tutorial on using the Scikit-Learn library for data preprocessing: Encoding categorical features and building pipelines</b></center>

## **Table of Contents**

1. **[Section-1: Tutorial overview](#1)**
2. **[Section-2: Imports and other associated operations](#2)**
3. **[Section-3: Load and prepare a data set](#3)**
4. **[Section-4: Cross validating a simple model](#4)**
5. **[Section-5: Encoding categorical features and setting up a pipeline](#5)**

## **<a name="1">Section-1: Tutorial overview</a>**

To include preprocess data for building, training, and deploying a machine learning model, we need to preprocess the data. One such data preprocessing step is encoding categorical features (text data) numerically.

Here, in this tutorial, we'll learn the following: 

* First we'll learn how to use the `OneHotEncoder` and `ColumnTransformer` functions of the Scikit-Learn library to encode categorical features and prepare the feature matrix in a single step, respectively.

* Then we'll learn how to include these steps within a Scikit-Learn `Pipeline` so that we can cross-validate our model and our preprocessing steps, simultaneously. 

* Finally, we'll learn why we should use the Scikit-Learn library, rather than the Pandas library, for data preprocessing.

This tutorial is based on the following vidoes from the [Data School's](https://www.youtube.com/dataschool) YouTube channel:

* [How do I encode categorical features using scikit-learn?](https://www.youtube.com/watch?v=irHhDMbw3xo)

* [Encode categorical features using `OneHotEncoder` or `OrdinalEncoder`](https://www.youtube.com/watch?v=0w78CHM_ubM&list=PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6&index=6)

* [Seven ways to select columns using ColumnTransformer](https://www.youtube.com/watch?v=sCt4LVD5hPc&list=PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6&index=3&t=155s)

## **<a name="2">Section-2: Imports and other associated operations</a>**

In [33]:
import pandas as pd
import sklearn as skl

In [34]:
print(f'pandas version: {pd.__version__:s}')
print(f'sklearn version: {skl.__version__:s}')

pandas version: 1.3.5
sklearn version: 1.0.2


In [35]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline

## **<a name="3">Section-3: Load and prepare a data set</a>**

In [36]:
# Titanic - Train data set
df_train = pd.read_csv('http://bit.ly/kaggletrain')

In [37]:
# Titanic - Test data set
df_test = pd.read_csv('http://bit.ly/kaggletest')

In [38]:
print(f'Shape - Train data set: {df_train.shape}\n')
print(f'Shape - Test data set: {df_test.shape}\n')

Shape - Train data set: (891, 12)

Shape - Test data set: (418, 11)



In this tutorial we will be working with only the train data set!

In [39]:
df = df_train.copy(deep=True)

In [40]:
def draw_line():
    return print('****' * 20)

In [41]:
cols = df.columns.to_list()

draw_line()
print('Columns:')
draw_line()
print(*cols, sep=", ", end="\n")

********************************************************************************
Columns:
********************************************************************************
PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked


In [42]:
draw_line()
print('Null values:')
draw_line()
print(df.isna().sum())

********************************************************************************
Null values:
********************************************************************************
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


In [43]:
# For this tutorial we will select 4 columns
select_cols = ['Survived', 'Pclass', 'Sex', 'Embarked']

df = df.loc[df.Embarked.notna(), select_cols] # Drop null values in `Embarked` 

In [44]:
print(f'Shape - New dataframe: {df.shape}')

Shape - New dataframe: (889, 4)


In [45]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked
0,0,3,male,S
1,1,1,female,C
2,1,3,female,S
3,1,1,female,S
4,0,3,male,S


## **<a name="4">Section-4: Cross-validating a simple model</a>**

In [46]:
X = df.loc[:, ['Pclass']]
y = df['Survived']

In [47]:
X.shape

(889, 1)

In [48]:
y.shape

(889,)

In [49]:
logReg = LogisticRegression(solver='lbfgs')

In [50]:
avg_accur = cross_val_score(logReg, X, y, scoring='accuracy', \
                           cv=5, n_jobs=-1).mean()
print(f'Average accuracy of logistic regression model: {avg_accur: 0.5f}')

Average accuracy of logistic regression model:  0.67834


* We will compare the above average acccuracy with the **null accuracy**. 

* **Null accuracy** is the accuracy obtained by predicting the most frequent class.

In [51]:
null_accur = y.value_counts(normalize=True)
print(f'Null accuracy:\n{null_accur}')

Null accuracy:
0    0.617548
1    0.382452
Name: Survived, dtype: float64


**Finally! The motivation for this tutorial - How can I add more features to my model and cross-validate it?**

## **<a name="5">Section-5: Encoding categorical features and setting up a pipeline</a>**

Categorical (text) features are of two types:

* **Nominal:** Classes are intrinsically unordered. For example: The `Sex` variable with classes, "Male" and "Female", is a **nominal** variable because there is no natural/logical order among its classes. The same is true for the categorical variable `Embarked`.

* **Ordinal:** Classes have a inherent (natural) order. For example: A categorical variable with the classes, "Low", "High", "Medium", is **ordinal** because there is a natural (and also logical) order among the classes, namely, "Low" < "Medium" < "High". 

* Encoding **nominal** categorical variables is known as "dummy" or "one-hot" encoding. The former term is used by the Pandas library while the latter term is used by the Scikit-Learn library. 


* Encoding of **ordinal** features is performed with the `OrdinalEncoder` function from the Scikit-Learn library. 


* To gain a quick but effective understanding of the `OneHotEncoder` and the `OrdinalEncoder`, refer to [this](https://www.youtube.com/watch?v=0w78CHM_ubM&list=PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6&index=6) YouTube tutorial by Data School.

Let us now see how we can use the `OneHotEncoder` for encoding the **nominal** categorical variables present in our dataframe, namely, `Sex` and `Embarked`.

* Instantiation of the `OneHotEncoder()` function with `sparse=False` is done below, for illustration purposes only. In real-life scenarios, `sparse=True` is the default option.


* The `drop` parameter allows us to avoid the dummy variable trap.
    * `drop=None`: Default and implies, retain all features.  
    * `drop='first'`: Drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
    * `drop='if_binary'`: Drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.
    

* The `handle_unknown='ignore'` parameter allows us to ignore any new data (incoming data) whose "format" does not match the one  used to initialize the `OneHotEncoder()` function. The default value is `handle_unknown='error'` and works with any value of the `drop` argument. However, for the `handle_unknown='ignore'` to work, `drop` must be set to `None`.

In [52]:
ohe = OneHotEncoder(drop='if_binary', sparse=False)  
ohe.fit_transform(df.loc[:, ['Sex']]);

In [53]:
ohe.categories_

[array(['female', 'male'], dtype=object)]

In [54]:
ohe = OneHotEncoder(drop='first', sparse=False)
ohe.fit_transform(df.loc[:, ['Embarked']]);

In [55]:
ohe.categories_

[array(['C', 'Q', 'S'], dtype=object)]

**We use column transformer whenever the features in our dataset needs different preprocessing. Let us now learn how to use Scikit-Learn's column transformer!** 

In [56]:
# Approach 1

cat_var_preprocess_1 = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), ['Sex', 'Embarked']), 
    remainder='passthrough') 

In [57]:
# Approach 2

cat_var_preprocess_2 = make_column_transformer(
    (OneHotEncoder(drop='if_binary') , ['Sex']), 
    (OneHotEncoder(drop='first'), ['Embarked']), 
    remainder='passthrough') 

In the above examples, the `cat_var_preprocess_1` and `cat_var_preprocess_2` are our column transformers, created to illustrate a specific purpose as we will see below. All they do is, one-hot encode the **nominal** categorical variables, `Sex` and `Embarked`. 

Instead of directly calling the columns to preprocess, we can make use of another function from the `sklearn.compose` module, namely, `make_column_selector()`. To understand the use of this function, refer to the following YouTube video by Data School:
[Seven ways to select columns using ColumnTransformer](https://www.youtube.com/watch?v=sCt4LVD5hPc&list=PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6&index=3&t=155s)

Let us illustrate one of the three ways of using the `make_column_selector()` function inside a `make_column_transformer()` function. 

In [58]:
# Using approach 1 from above
# Below we will declare the columns inside the `make_column_selector`
# function using regex

cat_var_preprocess_3 = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), make_column_selector('S|E')),   
    remainder='passthrough') 

Now, let us use this column transformer and the logistic regression model to build a `sklearn` pipeline which will be eventually fed into the `cross_val_score()` function.

In [59]:
################################################################
# Recreate the freature matrix with all the necessary columns
################################################################

X = df.loc[:, ['Pclass', 'Sex', 'Embarked']]
X.head()

Unnamed: 0,Pclass,Sex,Embarked
0,3,male,S
1,1,female,C
2,3,female,S
3,1,female,S
4,3,male,S


In [60]:
pipe1 = make_pipeline(cat_var_preprocess_1, logReg)
pipe2 = make_pipeline(cat_var_preprocess_2, logReg)
pipe3 = make_pipeline(cat_var_preprocess_3, logReg)

In [61]:
avg_accur1 = cross_val_score(pipe1, X, y, cv=5, scoring='accuracy').mean()
print(f'Average accuracy using `pipe1`: {avg_accur1: 0.5f}\n')
avg_accur2 = cross_val_score(pipe2, X, y, cv=5, scoring='accuracy').mean()
print(f'Average accuracy using `pipe2`: {avg_accur2: 0.5f}\n')
avg_accur3 = cross_val_score(pipe3, X, y, cv=5, scoring='accuracy').mean()
print(f'Average accuracy using `pipe3`: {avg_accur3: 0.5f}')

Average accuracy using `pipe1`:  0.77279

Average accuracy using `pipe2`:  0.77279

Average accuracy using `pipe3`:  0.77279


We thus find that the average accuracy of our logistic regression model increases when `Sex` and `Embarked` features are included on top of the existing feature, `Pclass`.

__NOTE:__ All the pipelines have the same average accuracy! This cleanly illustrates:
* The use of `make_column_selector()` function inside `make_column_transformer()` is completely optional.
* Inside a pipeline, declaring the `drop` argument of the `OneHotEncoder()` function is also optional. 

In [62]:
################################################
# Quick generation of some new data through
# random selection of few columns from the train
# data
################################################

X_new = df.sample(n=10, random_state=101).drop(columns=['Survived'], axis=1)
X_new.head()

Unnamed: 0,Pclass,Sex,Embarked
511,3,male,S
613,3,male,Q
615,2,female,S
337,1,female,C
718,3,male,Q


In [63]:
# Fit pipeline on train data

pipe1.fit(X=X, y=y)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Sex', 'Embarked'])])),
                ('logisticregression', LogisticRegression())])

In [64]:
# Make predictions on new data with pipeline

pipe1.predict(X=X_new)

array([0, 0, 1, 1, 0, 0, 0, 0, 0, 1])