# OneHotEncoder
Performs One Hot Encoding.

The encoder can select how many different labels per variable to encode into binaries. When top_categories is set to None, all the categories will be transformed in binary variables. 

However, when top_categories is set to an integer, for example 10, then only the 10 most popular categories will be transformed into binary, and the rest will be discarded.

The encoder has also the possibility to create binary variables from all categories (drop_last = False), or remove the binary for the last category (drop_last = True), for use in linear models.

Finally, the encoder has the option to drop the second dummy variable for binary variables. That is, if a categorical variable has 2 unique values, for example colour = ['black', 'white'], setting the parameter drop_last_binary=True, will automatically create only 1 binary for this variable, for example colour_black.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from feature_engine.encoding import OneHotEncoder

In [2]:
# Load titanic dataset from OpenML

def load_titanic(filepath='titanic.csv'):
    # data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
    data = pd.read_csv(filepath)
    data = data.replace('?', np.nan)
    data['cabin'] = data['cabin'].astype(str).str[0]
    data['pclass'] = data['pclass'].astype('O')
    data['age'] = data['age'].astype('float').fillna(data.age.median())
    data['fare'] = data['fare'].astype('float').fillna(data.fare.median())
    data['embarked'].fillna('C', inplace=True)
    # data.drop(labels=['boat', 'body', 'home.dest', 'name', 'ticket'], axis=1, inplace=True)
    return data

In [3]:
# data = load_titanic("../data/titanic.csv")
data = load_titanic("../data/titanic-2/Titanic-Dataset.csv")
data.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,n,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,n,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,n,S


In [4]:
X = data.drop(['survived', 'name', 'ticket'], axis=1)
y = data.survived

In [5]:
# we will encode the below variables, they have no missing values
X[['cabin', 'pclass', 'embarked']].isnull().sum()

cabin       0
pclass      0
embarked    0
dtype: int64

In [6]:
''' Make sure that the variables are type (object).
if not, cast it as object , otherwise the transformer will either send an error (if we pass it as argument) 
or not pick it up (if we leave variables=None). '''

X[['cabin', 'pclass', 'embarked']].dtypes

cabin       object
pclass      object
embarked    object
dtype: object

In [7]:
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train.shape, X_test.shape

((623, 9), (268, 9))

One hot encoding consists in replacing the categorical variable by a
combination of binary variables which take value 0 or 1, to indicate if
a certain category is present in an observation.

Each one of the binary variables are also known as dummy variables. For
example, from the categorical variable "Gender" with categories 'female'
and 'male', we can generate the boolean variable "female", which takes 1
if the person is female or 0 otherwise. We can also generate the variable
male, which takes 1 if the person is "male" and 0 otherwise.

The encoder has the option to generate one dummy variable per category, or
to create dummy variables only for the top n most popular categories, that is,
the categories that are shown by the majority of the observations.

If dummy variables are created for all the categories of a variable, you have
the option to drop one category not to create information redundancy. That is,
encoding into k-1 variables, where k is the number if unique categories.

The encoder will encode only categorical variables (type 'object'). A list
of variables can be passed as an argument. If no variables are passed as 
argument, the encoder will find and encode categorical variables (object type).


#### Note:
New categories in the data to transform, that is, those that did not appear
in the training set, will be ignored (no binary variable will be created for them).


### Create all k dummy variables, top_categories=False

In [8]:
'''
Parameters
----------

top_categories: int, default=None
    If None, a dummy variable will be created for each category of the variable.
    Alternatively, top_categories indicates the number of most frequent categories
    to encode. Dummy variables will be created only for those popular categories
    and the rest will be ignored. Note that this is equivalent to grouping all the
    remaining categories in one group.
    
variables : list
    The list of categorical variables that will be encoded. If None, the  
    encoder will find and select all object type variables.
    
drop_last: boolean, default=False
    Only used if top_categories = None. It indicates whether to create dummy
    variables for all the categories (k dummies), or if set to True, it will
    ignore the last variable of the list (k-1 dummies).
'''

ohe_enc = OneHotEncoder(top_categories=None,
                        variables=['pclass', 'cabin', 'embarked'],
                        drop_last=False)
ohe_enc.fit(X_train)

In [9]:
ohe_enc.encoder_dict_

{'pclass': [1, 3, 2],
 'cabin': ['E', 'D', 'n', 'B', 'C', 'A', 'F', 'G', 'T'],
 'embarked': ['S', 'C', 'Q']}

In [10]:
train_t = ohe_enc.transform(X_train)
test_t = ohe_enc.transform(X_train)

test_t.head()

Unnamed: 0,passengerid,sex,age,sibsp,parch,fare,pclass_1,pclass_3,pclass_2,cabin_E,...,cabin_n,cabin_B,cabin_C,cabin_A,cabin_F,cabin_G,cabin_T,embarked_S,embarked_C,embarked_Q
857,858,male,51.0,0,0,26.55,1,0,0,1,...,0,0,0,0,0,0,0,1,0,0
52,53,female,49.0,1,0,76.7292,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
386,387,male,1.0,5,2,46.9,0,1,0,0,...,1,0,0,0,0,0,0,1,0,0
124,125,male,54.0,0,1,77.2875,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
578,579,female,28.0,1,0,14.4583,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0


### Selecting top_categories to encode

In [11]:
ohe_enc = OneHotEncoder(top_categories=2,
                        variables=['pclass', 'cabin', 'embarked'],
                        drop_last=False)
ohe_enc.fit(X_train)

ohe_enc.encoder_dict_

{'pclass': [3, 1], 'cabin': ['n', 'C'], 'embarked': ['S', 'C']}

In [12]:
train_t = ohe_enc.transform(X_train)
test_t = ohe_enc.transform(X_train)
test_t.head()

Unnamed: 0,passengerid,sex,age,sibsp,parch,fare,pclass_3,pclass_1,cabin_n,cabin_C,embarked_S,embarked_C
857,858,male,51.0,0,0,26.55,0,1,0,0,1,0
52,53,female,49.0,1,0,76.7292,0,1,0,0,0,1
386,387,male,1.0,5,2,46.9,1,0,1,0,1,0
124,125,male,54.0,0,1,77.2875,0,1,0,0,1,0
578,579,female,28.0,1,0,14.4583,1,0,1,0,0,1


### Dropping the last category for linear models

In [13]:
ohe_enc = OneHotEncoder(top_categories=None,
                        variables=['pclass', 'cabin', 'embarked'],
                        drop_last=True)

ohe_enc.fit(X_train)

ohe_enc.encoder_dict_

{'pclass': [1, 3],
 'cabin': ['E', 'D', 'n', 'B', 'C', 'A', 'F', 'G'],
 'embarked': ['S', 'C']}

In [14]:
train_t = ohe_enc.transform(X_train)
test_t = ohe_enc.transform(X_train)

test_t.head()

Unnamed: 0,passengerid,sex,age,sibsp,parch,fare,pclass_1,pclass_3,cabin_E,cabin_D,cabin_n,cabin_B,cabin_C,cabin_A,cabin_F,cabin_G,embarked_S,embarked_C
857,858,male,51.0,0,0,26.55,1,0,1,0,0,0,0,0,0,0,1,0
52,53,female,49.0,1,0,76.7292,1,0,0,1,0,0,0,0,0,0,0,1
386,387,male,1.0,5,2,46.9,0,1,0,0,1,0,0,0,0,0,1,0
124,125,male,54.0,0,1,77.2875,1,0,0,1,0,0,0,0,0,0,1,0
578,579,female,28.0,1,0,14.4583,0,1,0,0,1,0,0,0,0,0,0,1


### Automatically select categorical variables

This encoder selects all the categorical variables, if None is passed to the variable argument when calling the encoder.

In [15]:
ohe_enc = OneHotEncoder(top_categories=None,
                        drop_last=True)

ohe_enc.fit(X_train)

In [16]:
# the parameter variables is None
ohe_enc.variables

In [17]:
# but the attribute variables_ has the categorical variables 
# that will be encoded

ohe_enc.variables_

['pclass', 'sex', 'cabin', 'embarked']

In [18]:
# and we can also find which variables from those
# are binary

ohe_enc.variables_binary_

['sex']

In [19]:
train_t = ohe_enc.transform(X_train)
test_t = ohe_enc.transform(X_train)

test_t.head()

Unnamed: 0,passengerid,age,sibsp,parch,fare,pclass_1,pclass_3,sex_male,cabin_E,cabin_D,cabin_n,cabin_B,cabin_C,cabin_A,cabin_F,cabin_G,embarked_S,embarked_C
857,858,51.0,0,0,26.55,1,0,1,1,0,0,0,0,0,0,0,1,0
52,53,49.0,1,0,76.7292,1,0,0,0,1,0,0,0,0,0,0,0,1
386,387,1.0,5,2,46.9,0,1,1,0,0,1,0,0,0,0,0,1,0
124,125,54.0,0,1,77.2875,1,0,1,0,1,0,0,0,0,0,0,1,0
578,579,28.0,1,0,14.4583,0,1,0,0,0,1,0,0,0,0,0,0,1


### Automatically create 1 dummy from binary variables (sex)

We can encode categorical variables that have more than 2 categories into k dummies, and, at the same time, encode categorical variables that have 2 categories only in 1 dummy. The second 1 is completely redundant.

We do so as follows:

In [20]:
ohe_enc = OneHotEncoder(top_categories=None,
                        drop_last=False,
                        drop_last_binary=True,
                        )

ohe_enc.fit(X_train)

In [21]:
# the encoder dictionary
ohe_enc.encoder_dict_

{'pclass': [1, 3, 2],
 'sex': ['male'],
 'cabin': ['E', 'D', 'n', 'B', 'C', 'A', 'F', 'G', 'T'],
 'embarked': ['S', 'C', 'Q']}

In [22]:
# and we can also find which variables from those
# are binary

ohe_enc.variables_binary_

['sex']

In [23]:
train_t = ohe_enc.transform(X_train)
test_t = ohe_enc.transform(X_train)

test_t.head()

Unnamed: 0,passengerid,age,sibsp,parch,fare,pclass_1,pclass_3,pclass_2,sex_male,cabin_E,...,cabin_n,cabin_B,cabin_C,cabin_A,cabin_F,cabin_G,cabin_T,embarked_S,embarked_C,embarked_Q
857,858,51.0,0,0,26.55,1,0,0,1,1,...,0,0,0,0,0,0,0,1,0,0
52,53,49.0,1,0,76.7292,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
386,387,1.0,5,2,46.9,0,1,0,1,0,...,1,0,0,0,0,0,0,1,0,0
124,125,54.0,0,1,77.2875,1,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
578,579,28.0,1,0,14.4583,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
