# One Hot Encoding

In [382]:
# Import pandas library and disable warnings
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter("ignore")
# Import train_test_split to separate train and test set
from sklearn.model_selection import train_test_split
# Import OneHotEncoder for one hot encoding with sklearn
from sklearn.preprocessing import OneHotEncoder

### Encoding - "Usable and Useful ML Product"

Just like imputation, all methods of categorical encoding should be performed over the training set, and then propagated to the test set. 

Why? 

Because these methods will "learn" patterns from the train data, and therefore you want to avoid leaking information and overfitting. But more importantly, because we don't know whether in future / live data, we will have all the categories present in the train data, or if there will be more or less categories. Therefore, we want to anticipate this uncertainty by setting the right processes right from the start. We want to create transformers that learn the categories from the train set, and used those learned categories to create the dummy variables in both train and test sets.

In this notebook we'll be using Titanic dataset.

In [383]:
# Load Titanic dataset
data = pd.read_csv('Data/titanic_data.csv', usecols = ['Survived','Sex','Embarked','Cabin'])
data.head()

Unnamed: 0,Survived,Sex,Cabin,Embarked
0,0,male,,S
1,1,female,C85,C
2,1,female,,S
3,1,female,C123,S
4,0,male,,S


In [384]:
# Use str[] to capture only first letter of Cabin
data['Cabin'] = data['Cabin'].str[0]
data.head()

Unnamed: 0,Survived,Sex,Cabin,Embarked
0,0,male,,S
1,1,female,C,C
2,1,female,,S
3,1,female,C,S
4,0,male,,S


In [385]:
# Separate the DataFrame into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    data[['Sex', 'Embarked','Cabin']],  
    data['Survived'],  
    test_size=0.3,  
    random_state=0)

X_train.shape, X_test.shape

((623, 3), (268, 3))

### Let's explore the cardinality

In [386]:
# Print unique values of columns in data
for col in list(data):
    print(col)
    print(data[col].unique())

Survived
[0 1]
Sex
['male' 'female']
Cabin
[nan 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']
Embarked
['S' 'C' 'Q' nan]



# 1. One-Hot Encoding with Pandas

**Advantages:**
- quick
- returns Pandas DataFrame 
- returns feature names for the dummy variables 

**Limitations:**
- it does not preserve information from train data to propagate to test data

-----

The pandas method `get_dummies()`, will create as many binary variables as categories in the variable:

If the variable colour has 3 categories in the train data, it will create 2 dummy variables. However, if the variable colour has 5 categories in the test data, it will create 4 binary variables, therefore train and test sets will end up with different number of features and will be incompatible with training and scoring using Scikit-learn.

In practice, we shouldn't be using `get_dummies()` in our machine learning pipelines. It is however useful, for a quick data exploration. Let's look at this with examples.

### into k-dummy variables

In [387]:
dummies = pd.get_dummies(X_train['Sex'])
dummies.head()

Unnamed: 0,female,male
857,0,1
52,1,0
386,0,1
124,0,1
578,1,0


In [388]:
# Concat original Pclass Series with created dummy variables for visualization what happend
result = pd.concat([X_train['Sex'], pd.get_dummies(X_train['Sex'])], axis = 1)
result

Unnamed: 0,Sex,female,male
857,male,0,1
52,female,1,0
386,male,0,1
124,male,0,1
578,female,1,0
...,...,...,...
835,female,1,0
192,female,1,0
629,male,0,1
559,female,1,0


In [389]:
# TASK 1 >>>> Get dummy variables for column 'Embarked'
#             Concat original 'Embarked' Series and store it in variable result_2
result_2 = pd.concat([X_train['Embarked'], pd.get_dummies(X_train['Embarked'])], axis = 1)
result_2

Unnamed: 0,Embarked,C,Q,S
857,S,0,0,1
52,C,1,0,0
386,S,0,0,1
124,S,0,0,1
578,C,1,0,0
...,...,...,...,...
835,C,1,0,0
192,S,0,0,1
629,Q,0,1,0
559,S,0,0,1


In [390]:
# Get dummy variable for entire train set
dummy_data = pd.get_dummies(X_train)
dummy_data

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T
857,0,1,0,0,1,0,0,0,0,1,0,0,0
52,1,0,1,0,0,0,0,0,1,0,0,0,0
386,0,1,0,0,1,0,0,0,0,0,0,0,0
124,0,1,0,0,1,0,0,0,1,0,0,0,0
578,1,0,1,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
835,1,0,1,0,0,0,0,0,0,1,0,0,0
192,1,0,0,0,1,0,0,0,0,0,0,0,0
629,0,1,0,1,0,0,0,0,0,0,0,0,0
559,1,0,0,0,1,0,0,0,0,0,0,0,0


In [391]:
# TASK 2 >>>> Get dummy variable for entire test set
dummy = pd.get_dummies(X_test)
dummy

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G
495,0,1,1,0,0,0,0,0,0,0,0,0
648,0,1,0,0,1,0,0,0,0,0,0,0
278,0,1,0,1,0,0,0,0,0,0,0,0
31,1,0,1,0,0,0,1,0,0,0,0,0
255,1,0,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
263,0,1,0,0,1,0,1,0,0,0,0,0
718,0,1,0,1,0,0,0,0,0,0,0,0
620,0,1,1,0,0,0,0,0,0,0,0,0
786,1,0,0,0,1,0,0,0,0,0,0,0


The resulting DataFrames have features names what is the advantage of Pandas `get_dummies()`. On the other hand there is an issue and so train set contain more features than test set. The cause of this is that test set does not contain feature 'Cabin_T', therefore train and test sets do not have the same shape. 

# 2. One-Hot Encoding with Scikit-learn

### Advantages

- quick
- creates the same number of features in train and test set
- by default, the encoder derives the categories based on the unique values in each feature

### Limitations

- it returns a numpy array instead of a pandas dataframe if we do not specify otherwise
- it does not return the variable names, therefore inconvenient for variable exploration
- it does not except missing values (Pandas `.get_dummies()` does)

In [392]:
# Create the Encoder
# Set parameter sparse = False to return sparse matrix
# Set parameter handle_unknown = 'error' to raise an error if an unknown categorical feature is present during transform
encoder = OneHotEncoder(categories = 'auto', sparse = False, handle_unknown = 'error')

#  Fit the Encoder and fill in missing values with method ffill
encoder.fit(X_train.fillna(method = 'ffill'))

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='error', sparse=False)

In [393]:
# Get categories with the .categories_ attribute
encoder.categories_

[array(['female', 'male'], dtype=object),
 array(['C', 'Q', 'S'], dtype=object),
 array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'], dtype=object)]

In [394]:
training_set = encoder.transform(X_train.fillna(method= 'ffill'))
pd.DataFrame(training_set).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


As we can see, after transforming of the data, the features names are not returned. We can retrieve these features names using `.get_feature_names()`

In [395]:
training_set = encoder.transform(X_train.fillna(method= 'ffill'))
training_set = pd.DataFrame(training_set)
training_set.columns = encoder.get_feature_names()
training_set.head()

Unnamed: 0,x0_female,x0_male,x1_C,x1_Q,x1_S,x2_A,x2_B,x2_C,x2_D,x2_E,x2_F,x2_G,x2_T
0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
