# Feature Encoding

Feature encoding is the process fo transforming `categorical features` into `numeric features`. This is necessary because machine learning algorithms can only handle numeric features. There are many different ways to encode categorical features, and each method has its own advantages and disadvantages. In this notebook we will expore some of the msot popular methods for endcoding categorical features, such as :
1. label encoding
2. One-hot encoding
3. ordinal encoding
4. Binary encoding

In [2]:
# import libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns


In [6]:
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [7]:
df['sex'].value_counts()

sex
male      577
female    314
Name: count, dtype: int64

In [8]:
df['embark_town'].value_counts()

embark_town
Southampton    644
Cherbourg      168
Queenstown      77
Name: count, dtype: int64

In [11]:
# let's encode the sex in labelencoder with sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

le = LabelEncoder()

df['update_sex'] = le.fit_transform(df['sex'])
df.head(15)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,update_sex
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,1
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,1
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True,1
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True,1
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False,1
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False,0
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False,0


## Ordinal encoding the emabark_town column using specific Order 

In [27]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
update_sex       0
dtype: int64

In [35]:
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])

In [36]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      0
alive            0
alone            0
update_sex       0
dtype: int64

In [37]:
oe = OrdinalEncoder()
oe = OrdinalEncoder(categories=[['Southampton', 'Cherbourg', 'Queenstown']])

In [39]:
df['town_update'] = oe.fit_transform(df[['embark_town']])
df['town_update'].head()

0    0.0
1    1.0
2    0.0
3    0.0
4    0.0
Name: town_update, dtype: float64

In [41]:
df['embark_town'].value_counts()

embark_town
Southampton    646
Cherbourg      168
Queenstown      77
Name: count, dtype: int64

In [40]:
df['town_update'].value_counts()

town_update
0.0    646
1.0    168
2.0     77
Name: count, dtype: int64

## One hot encoding

In [53]:
ohe = OneHotEncoder()
ohe.fit_transform(df[['alive']]).toarray()

array([[1., 0.],
       [0., 1.],
       [0., 1.],
       ...,
       [1., 0.],
       [0., 1.],
       [1., 0.]])

## Binary Encoder

In [None]:
# for binaryencoder we have to install category_encoders
!pip install category_encoders

In [None]:
from category_encoders import BinaryEncoder
binary_encoder = BinaryEncoder()

## Assignment
* How many types of features encoding are there?
* When to use which type of feature encoding?

## Assignment No 2
visit scikit learn library and read and try to understand section 6.3 preprocessing data 

_____________________________________________________________________

### Use pandas for feature encoding

In [55]:
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [59]:
# use pandas get dumies
get_dum = pd.get_dummies(df, columns=['embark_town'])
get_dum.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,alive,alone,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,no,False,False,False,True
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,yes,False,True,False,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,yes,True,False,False,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,yes,False,False,False,True
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,no,True,False,False,True
