# One Hot Encoding

In [None]:
# Import pandas library and disable warnings
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter("ignore")
# Import train_test_split to separate train and test set
from sklearn.model_selection import train_test_split
# Import OneHotEncoder for one hot encoding with sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

### Encoding - "Usable and Useful ML Product"

Just like imputation, all methods of categorical encoding should be performed over the training set, and then propagated to the test set. 

Why? 

Because these methods will "learn" patterns from the train data, and therefore you want to avoid leaking information and overfitting. But more importantly, because we don't know whether in future / live data, we will have all the categories present in the train data, or if there will be more or less categories. Therefore, we want to anticipate this uncertainty by setting the right processes right from the start. We want to create transformers that learn the categories from the train set, and used those learned categories to create the dummy variables in both train and test sets.

In this notebook we'll be using Titanic dataset.

In [None]:
# Load Titanic dataset using columns 'Survived','Sex','Embarked','Cabin'
data = pd.read_csv('Data/titanic_data.csv', usecols = ['Survived','Sex','Embarked','Cabin'])
data.head()

In [None]:
# Use str[] to capture only first letter of Cabin
data['Cabin'] = data['Cabin'].str[0]
data.head()

In [None]:
# Separate the DataFrame into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    data[['Sex', 'Embarked','Cabin']],  
    data['Survived'],  
    test_size=0.3,  
    random_state=0)

X_train.shape, X_test.shape

### Let's explore the cardinality

In [None]:
# Print unique values of columns in data
for col in list(data):
    print(col)
    print(data[col].unique())


# 1. One-Hot Encoding with Pandas

**Advantages:**
- quick
- returns Pandas DataFrame 
- returns feature names for the dummy variables
- accepts missing values

**Limitations:**
- it does not preserve information from train data to propagate to test data

-----

The pandas method `get_dummies()`, will create as many binary variables as categories in the variable:

If the variable colour has 3 categories in the train data, it will create 2 dummy variables. However, if the variable colour has 5 categories in the test data, it will create 4 binary variables, therefore train and test sets will end up with different number of features and will be incompatible with training and scoring using Scikit-learn.

In practice, we shouldn't be using `get_dummies()` in our machine learning pipelines. It is however useful, for a quick data exploration. Let's look at this with examples.

### into k-dummy variables

In [None]:
dummies = pd.get_dummies(X_train['Sex'])
dummies.head()

In [None]:
# Concat original Pclass Series with created dummy variables for visualization what happend
result = pd.concat([X_train['Sex'], pd.get_dummies(X_train['Sex'])], axis = 1)
result

In [None]:
# TASK 1 >>>> Get dummy variables for column 'Embarked'
#             Concat original 'Embarked' Series and store it in variable result_2

We can get dummy variable for all variables at once.

In [None]:
# Get dummy variable for entire train set
dummy_data = pd.get_dummies(X_train)
dummy_data

In [None]:
# TASK 2 >>>> Get dummy variable for entire test set
dummy = pd.get_dummies(X_test)
dummy

The resulting DataFrames have features names what is the advantage of Pandas `get_dummies()`. On the other hand there is an issue and so train set contain more features than test set. The reason of this is that test set does not contain feature 'Cabin_T', therefore train and test sets do not have the same shape. 

# 2. One-Hot Encoding with Scikit-learn

### Advantages

- quick
- creates the same number of features in train and test set
- by default, the encoder derives the categories based on the unique values in each feature

### Limitations

- it returns a numpy array instead of a pandas dataframe if we do not specify otherwise
- it does not return the variable names, therefore inconvenient for variable exploration
- it does not except missing values (Pandas `.get_dummies()` does)

You can find more information about One-Hot Encoder [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

In [None]:
# Create the Encoder
# Set parameter sparse = False to return sparse matrix
# Set parameter handle_unknown = 'error' to raise an error if an unknown categorical feature is present during transform
encoder = OneHotEncoder(categories = 'auto', sparse = False, handle_unknown = 'error')

#  Fit the Encoder and fill in missing values with method ffill
encoder.fit(X_train.fillna('Missing'))

In [None]:
# We can get categories with the .categories_ attribute
encoder.categories_

In [None]:
# Transform X_train using one-hot encoding
training_set = encoder.transform(X_train.fillna('Missing'))
pd.DataFrame(training_set).head()

As we can see, after transforming of the data, the features names are not returned. We can retrieve these features names using `.get_feature_names()`, we'll repeat the entire process of transforming.

In [None]:
# Transform X_train using one-hot encoding and return feature names for output features 
# Convert it to DataFrame
training_set = encoder.transform(X_train.fillna('Missing'))
training_set = pd.DataFrame(training_set)
training_set.columns = encoder.get_feature_names()
training_set.head()

In [None]:
# Transform X_test using one-hot encoding and return feature names for output features 
# Convert it to DataFrame
testing_set = encoder.transform(X_test.fillna('Missing'))
testing_set = pd.DataFrame(testing_set)
testing_set.columns = encoder.get_feature_names()
testing_set.head()

We can see that the training set and testing set contain the same number of features.