#### Feature Engineering  - One Hot Encoding

What is one hot encoding?
- It is conversion of an categorical column/feature into an binary vector
- Let's say we have an attribute called city and it has 4 possible values such as 'A', 'B', 'C' and 'D'. Then we create three columns and represent the cities using below
    - 'A' as 100
    - 'B' as 010
    - 'C' as 001
    - 'D' as 000
Why do we need one hot encoding?
- When we have an categorical column with ordinal relationship such as age category like child, teenager and adult then converting them into integers like '1' for child, '2' for teenager and '3' for adult would make sense as there is an ordinal relationship between them
- But in case of a feature with no ordinal relationship like cities then if we directly convert them into numeric categories then it might be damaging to the algorithm
- By using one-hot encoding we are able to provide more power to the model to extract the useful information from it

In [1]:
import numpy as np 
import pandas as pd

In [2]:
train = pd.read_csv("../Data/Titanic/train.csv")
test = pd.read_csv("../Data/Titanic/test.csv")
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


- Categorical attribute that are not Ordinal:
    - Survived, Sex, and Embarked
<br>
- Categorical attributes that are Ordinal: 
    - Pclass

In [3]:
# Finding the most frequently occuring port
frequent_port = train['Embarked'].dropna().mode()[0]
frequent_port

'S'

In [4]:
# Replacing the missing value with the frequent value
train['Embarked'] = train['Embarked'].fillna(frequent_port)

In [5]:
# Converting to numerical category
embark = {"S": 0, "C": 1, "Q": 2}
data = [train, test]

for dataset in data:
    dataset['Embarked'] = dataset['Embarked'].map(embark).astype(int)

In [6]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,0


In [7]:
from sklearn.preprocessing import OneHotEncoder

In [8]:
copy_train=train.copy()
copy_test=test.copy()

In [9]:
train_Embarked = copy_train["Embarked"].values.reshape(-1,1)
test_Embarked = copy_test["Embarked"].values.reshape(-1,1)

In [10]:
onehot_encoder = OneHotEncoder(sparse=False)
train_OneHotEncoded = onehot_encoder.fit_transform(train_Embarked)
test_OneHotEncoded = onehot_encoder.fit_transform(test_Embarked)

In [11]:
copy_train["EmbarkedS"] = train_OneHotEncoded[:,0]
copy_train["EmbarkedC"] = train_OneHotEncoded[:,1]
copy_train["EmbarkedQ"] = train_OneHotEncoded[:,2]
copy_test["EmbarkedS"] = test_OneHotEncoded[:,0]
copy_test["EmbarkedC"] = test_OneHotEncoded[:,1]
copy_test["EmbarkedQ"] = test_OneHotEncoded[:,2]

In [12]:
copy_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,EmbarkedS,EmbarkedC,EmbarkedQ
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,0,1.0,0.0,0.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,1,0.0,1.0,0.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,0,1.0,0.0,0.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,0,1.0,0.0,0.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,0,1.0,0.0,0.0
