# Feature Engine - Unit 03 - Handle Categorical Variable Encoding

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn how to Handle Categorical Variable Encoding, using One Hot Encoder, Ordinal Encoder and Rare Label Encoder



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Handle Categorical Variable Encoding

A categorical encoder replaces variable labels with a calculated or arbitrary number. We will study:
* One Hot Encoder
* Ordinal Encoder
* Rare Label Encoder

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">  One Hot Encoder

This technique replaces the categorical variable with a combination of binary variables (which takes value 0 or 1) where each new binary variable is related to a label from the categorical variable. The function is called `OneHotEncoder()` and its documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/encoding/OneHotEncoder.html)
* For example, imagine if our variable is `Color`, and has 3 labels: Yellow, Blue and Green
* When you One Hot Encode (OHE) `Color`, it is replaced by 3 binary variables `Color_Yellow`, `Color_Blue` and `Color_Green`
* Imagine if a given row of Color is Yellow. Once One Hot Encoded, this row will be transformed to  Color_Yellow = 1, Color_Blue = 0 and Color_Green = 0.
* There is a concept called redundant feature. You may think for a moment: do I need 3 binary variables to represent the variable `Color`? 
  * The answer is no. If you have 2 binary variables for Color, say Color_Yellow and Color_Blue, you will represent all possibilities since: 
    * Color_Yellow = 1 and Color_Blue = 0, means yellow
    * Color_Yellow = 0  and Color_Blue = 1, means blue
    * Color_Yellow = 0  and Color_Blue = 0, means green

from feature_engine.encoding import OneHotEncoder

Let's consider only categorical variables from the penguin dataset

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(url)
# df = sns.load_dataset('penguins').filter(['species', 'island', 'sex'])
df.head()

Let's create the pipeline with 2 steps (Handle Missing data and categorical encoding), and then use `.fit_transform()`
* Note: we can't encode a categorical variable that has missing data. For the exercise, we dropped the missing data using the transformer from the previous unit (DropMissingData)
* Using OneHotEncoder we parse a list of variables that we are interested to OHE.

from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ('drop_na', DropMissingData() ),
      ('ohe', OneHotEncoder(variables=['species', 'island', 'sex']) )
])


df = pipeline.fit_transform(df)
df

But what about the redundant feature?
* You just have to parse `drop_last=True` at `OneHotEncoder()`
* But first we reload the dataset

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(url).filter(['species', 'island', 'sex'])
# df = sns.load_dataset('penguins').filter(['species', 'island', 'sex'])
df.head()

Then set the same pipeline, but now adding `drop_last=True`. Compare to the previous transformation and check which binary variables were removed
* Note there are only 2 binary variables related to species and island. There is only one binary variable related to sex. This same set of variables carries the same amount of information as the previous OHE transformation.
* You probably noticed that this transformation has the potential to generate a lot of new columns. That increases the feature space and may increase the chance of overfitting your model. To manage that, you may use, when possible, a FeatureSelection() step in your pipeline to select the most relevant features in your dataset. Don't worry. This topic will be covered in the next lesson.

pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ('ohe', OneHotEncoder(variables=['species', 'island', 'sex'], drop_last=True) )
])


df = pipeline.fit_transform(df)
df

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Ordinal Encoder

It replaces categories with ordinal numbers, like 0, 1, 2, 3 etc.  
* The numbers can be on a first seen first basis.
* You can parse a list of variables to encode, or it will encode all categorical variables.

The function is `OrdinalEncoder()` and its documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/encoding/OrdinalEncoder.html)

from feature_engine.encoding import OrdinalEncoder

Let's consider categorical variables from the penguin dataset

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(url).filter(['species', 'island', 'sex'])
# df = sns.load_dataset('penguins').filter(['species', 'island', 'sex'])
df.head()

Let's create the pipeline with 2 steps (Handle Missing data and ordinal encoding), and then use `.fit_transform()`
* We will not parse the variables argument to `OrdinalEncoder()`, that means we will encode all variables. We set `encoding_method='arbitrary'`

from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ('ordinal_encoder', OrdinalEncoder(encoding_method='arbitrary') )
])

df = pipeline.fit_transform(df)
df

Let's check the frequencies and labels names.
* We use a for loop on DataFrame columns and print the variable name + the value counts for that variable
* Note the labels were replaced by numbers. For example Male and Female, were replaced by 0 and 1

for col in df.columns.to_list():
  print(f"{col} \n{df[col].value_counts()} \n\n")

Let's check the encoder dictionary, to see how the transformer mapped the labels to numbers.

pipeline['ordinal_encoder'].encoder_dict_

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Rare Label Encoder

This encoder groups infrequent categories in a new category called 'Rare' (or other defined name)
* For example, if your variable is Fruit, and the  percentage of rows for the labels banana, grape and apple is less than < 6 %, all these labels will be replaced by 'Rare'. That helps to decrease the chance of a model to overfit.
* The function is `RareLabelEncoder()` and its documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/encoding/RareLabelEncoder.html). The arguments are:
  * tol, which is the tolerance, or the minimum frequency a label should have to be considered frequent. Categories with frequencies lower than tol will be replaced as 'Rare'.
  * n_categories: The minimum number of categories a variable should have for the encoder to find frequent labels. If the variable contains fewer categories, all of them will be considered frequent.
  * variables: list of variables that you would like to apply this transformation on. If you don't parse anything, it will select all categorical variables

from feature_engine.encoding import RareLabelEncoder

Let's consider a few variables from the Titanic dataset. It holds passengers records from the last Titanic's ride
* Note we are converting the variables to 'object' with `.astype()` since some of them were listed as numerical yet being represented as a ``category``.

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url).filter(['parch', 'sibsp']).astype('object')
# df = sns.load_dataset('titanic').filter(['parch', 'sibsp']).astype('object')
print(df.shape)
df.head()

Let's assess missing levels

df.isnull().sum()

Now let's check the label's frequencies for each variable
* We loop on each variable and count its labels frequencies using .value_counts(normalize=True)
* We note that there some labels which are infrequent, like 6 for parch

for col in df.columns.to_list():
  print(f"{col} \n{df[col].value_counts(normalize=True)} \n\n")

Let's create the pipeline with 2 steps (rare label encoding), and then use `.fit_transform()`. We show here the use case where we can perform multiple rare label encoding
* The first RareLabelEncoder deals with parch and sets the tolerance to 10% (this is a random number and is used to explain the concept). In the end, any parch label that is less frequent than 10%, will be replaced by 'Rare'
* The second RareLabelEncoder deals with sibsp and sets the tolerance to 8% (again, random number to illustrate the concept). In the end, any sibsp label that is less frequent than 8%, will be replaced by 'Rare'
* Note: you can perform this technique with a set of variables. We created the example with single variables with different tolerance to illustrate the concept. In the workplace, the tol level will be selected based on the business context.
* We set ``n_categories=2`` since we want to encode all possible labels.

from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ('rle_parch', RareLabelEncoder(tol=0.1,
                                     n_categories=2,
                                     variables=['parch']) ), 
      ('rle_sibsp', RareLabelEncoder(tol=0.08,
                                     n_categories=2,
                                     variables=['sibsp']) )
])

df = pipeline.fit_transform(df)
df.head()

Now let's check the label's frequencies for each variable again
* Note the labels were grouped into a label called 'Rare' according to the rules defined in the pipeline.

for col in df.columns.to_list():
  print(f"{col} \n{df[col].value_counts(normalize=True)} \n\n")

You may think for a moment: but my variable is still a category, what should I do?
* The answer is, arrange a Ordinal Encoder or OHE after a rare label encoder, so your categorical variables can be properly encoded
* Just as an example, let's reload the data and inspect labels frequencies

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pd.read_csv(url).filter(['parch', 'sibsp']).astype('object')
# df = sns.load_dataset('titanic').filter(['parch', 'sibsp']).astype('object')
for col in df.columns.to_list():
  print(f"{col} \n{df[col].value_counts(normalize=True)} \n\n")

In one cell, we will do the following tasks:
* create a pipeline with 4 steps: drop missing data, 2 rare label encoders and ordinal encoder
* then we fit and transform the data
* finally, we loop over the variables to check labels frequencies 

from feature_engine.imputation import DropMissingData
pipeline = Pipeline([
      ( 'drop_na', DropMissingData() ),
      ('rle_parch', RareLabelEncoder(tol=0.1,
                                     n_categories=2,
                                     variables=['parch']) ), 
      ('rle_sibsp', RareLabelEncoder(tol=0.08,
                                     n_categories=2,
                                     variables=['sibsp']) ),
      ('ordinal_encoder', OrdinalEncoder(encoding_method='arbitrary',
                                         variables= ['parch', 'sibsp']) )
])

df = pipeline.fit_transform(df)

for col in df.columns.to_list():
  print(f"{col} \n{df[col].value_counts(normalize=True)} \n\n")



In [2]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OneHotEncoder, OrdinalEncoder, RareLabelEncoder

# Setting style for seaborn
sns.set_style('whitegrid')

# Reload the penguins dataset
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(url).filter(['species', 'island', 'sex'])

In [3]:
# Creating a pipeline with OneHotEncoder
pipeline = Pipeline([
    ('drop_na', DropMissingData()),
    ('ohe', OneHotEncoder(variables=['species', 'island', 'sex']))
])

# Fitting and transforming the pipeline
df_ohe = pipeline.fit_transform(df)
df_ohe

# Creating a pipeline with OneHotEncoder and drop_last=True
pipeline_drop_last = Pipeline([
    ('drop_na', DropMissingData()),
    ('ohe', OneHotEncoder(variables=['species', 'island', 'sex'], drop_last=True))
])

In [4]:
# Fitting and transforming the pipeline with drop_last=True
df_ohe_drop_last = pipeline_drop_last.fit_transform(df)
df_ohe_drop_last

# Reload the penguins dataset
df = pd.read_csv(url).filter(['species', 'island', 'sex'])

# Creating a pipeline with OrdinalEncoder
pipeline_ordinal = Pipeline([
    ('drop_na', DropMissingData()),
    ('ordinal_encoder', OrdinalEncoder(encoding_method='arbitrary'))
])

# Fitting and transforming the pipeline
df_ordinal = pipeline_ordinal.fit_transform(df)
df_ordinal

Unnamed: 0,species,island,sex
0,0,0,0
1,0,0,1
2,0,0,1
4,0,0,1
5,0,0,0
...,...,...,...
338,2,1,1
340,2,1,1
341,2,1,0
342,2,1,1


In [5]:
# Checking the encoder dictionary for ordinal encoder
print(pipeline_ordinal['ordinal_encoder'].encoder_dict_)

# Reload the titanic dataset
url_titanic = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df_titanic = pd.read_csv(url_titanic).filter(['parch', 'sibsp']).astype('object')

# Checking missing levels
print(df_titanic.isnull().sum())

# Checking label frequencies for each variable
for col in df_titanic.columns.to_list():
    print(f"{col} \n{df_titanic[col].value_counts(normalize=True)} \n\n")

{'species': {'Adelie': 0, 'Chinstrap': 1, 'Gentoo': 2}, 'island': {'Torgersen': 0, 'Biscoe': 1, 'Dream': 2}, 'sex': {'MALE': 0, 'FEMALE': 1}}
parch    0
sibsp    0
dtype: int64
parch 
0    0.760943
1    0.132435
2    0.089787
5    0.005612
3    0.005612
4    0.004489
6    0.001122
Name: parch, dtype: float64 


sibsp 
0    0.682379
1    0.234568
2    0.031425
4    0.020202
3    0.017957
8    0.007856
5    0.005612
Name: sibsp, dtype: float64 




In [6]:
# Creating a pipeline with RareLabelEncoder for parch and sibsp
pipeline_rare_label = Pipeline([
    ('drop_na', DropMissingData()),
    ('rle_parch', RareLabelEncoder(tol=0.1, n_categories=2, variables=['parch'])),
    ('rle_sibsp', RareLabelEncoder(tol=0.08, n_categories=2, variables=['sibsp']))
])

# Fitting and transforming the pipeline
df_rle = pipeline_rare_label.fit_transform(df_titanic)
df_rle

# Checking label frequencies for each variable after RareLabelEncoder
for col in df_rle.columns.to_list():
    print(f"{col} \n{df_rle[col].value_counts(normalize=True)} \n\n")

parch 
0       0.760943
1       0.132435
Rare    0.106622
Name: parch, dtype: float64 


sibsp 
0       0.682379
1       0.234568
Rare    0.083053
Name: sibsp, dtype: float64 




In [7]:
# Creating a pipeline with RareLabelEncoder and OrdinalEncoder for parch and sibsp
pipeline_rare_label_ordinal = Pipeline([
    ('drop_na', DropMissingData()),
    ('rle_parch', RareLabelEncoder(tol=0.1, n_categories=2, variables=['parch'])),
    ('rle_sibsp', RareLabelEncoder(tol=0.08, n_categories=2, variables=['sibsp'])),
    ('ordinal_encoder', OrdinalEncoder(encoding_method='arbitrary', variables=['parch', 'sibsp']))
])

# Fitting and transforming the pipeline
df_rle_ordinal = pipeline_rare_label_ordinal.fit_transform(df_titanic)

# Checking label frequencies for each variable after RareLabelEncoder and OrdinalEncoder
for col in df_rle_ordinal.columns.to_list():
    print(f"{col} \n{df_rle_ordinal[col].value_counts(normalize=True)} \n\n")

parch 
0    0.760943
1    0.132435
2    0.106622
Name: parch, dtype: float64 


sibsp 
1    0.682379
0    0.234568
2    0.083053
Name: sibsp, dtype: float64 


