## Categorical Data Encoding Techniques

Categorical variables can be of 3 types -

 1. **Binary variable** - Variable which has only two possible values.
                          eg. Male/Female , True/False, Y/N etc.
                          
 2. **Ordinal variable** - where some inherent order or sequence is present in variable.
                         eg. Grades - A,B,C or difficulty-level : simple, average,difficult etc.
                         
 3. **Nominal variable** - where no inherent order or no numerical importance present in that variable
                           eg. city, Department, Species etc.

In [126]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder,OneHotEncoder,OrdinalEncoder

import category_encoders as ce

### Binary Feature Encoding

In [127]:
data=pd.DataFrame({'Col':['Y','N','Y','N','N','Y','Y','N','Y']})

In [128]:
data

Unnamed: 0,Col
0,Y
1,N
2,Y
3,N
4,N
5,Y
6,Y
7,N
8,Y


##### 1. Using Map

In [129]:
data['Col_map'] = data['Col'].map({'Y':1,'N':0})

##### 2. Using Replace

In [130]:
data['Col_replace'] = data['Col'].replace({'Y':1,'N':0})

In [131]:
data

Unnamed: 0,Col,Col_map,Col_replace
0,Y,1,1
1,N,0,0
2,Y,1,1
3,N,0,0
4,N,0,0
5,Y,1,1
6,Y,1,1
7,N,0,0
8,Y,1,1


### Ordinal Feature Encoding

In [132]:
data=pd.DataFrame({'Degree':['High school','Masters','Diploma','Bachelors','Bachelors','Masters','Phd','High school','High school']})

##### 1. Manual encoding ( Using map/replace )

In [133]:
dict_encode = {'None':0,'High school':1,'Diploma':2,'Bachelors':3,'Masters':4,'Phd':5}

data['Degree_map'] =data['Degree'].map(dict_encode)
data['Degree_replace'] =data['Degree'].replace(dict_encode)

##### 2. Using Label Encoder

In [134]:
lblencoder = LabelEncoder()
lblencoder.fit(data['Degree'])
data['Degree_lbl'] = lblencoder.transform(data['Degree'])

###### Drawbacks - 

1. It is supposed to be used for encoding of the target variable
2. prior to this missing value treatment is required
3. Need to make sure ordering is correct after label encoding.

##### 3. Using ordinal encoder

In [135]:
ord_encoder = OrdinalEncoder(dtype='int64')
ord_encoder.fit(data[['Degree']])
data['Degree_ord'] = ord_encoder.transform(data[['Degree']])

In [136]:
data

Unnamed: 0,Degree,Degree_map,Degree_replace,Degree_lbl,Degree_ord
0,High school,1,1,2,2
1,Masters,4,4,3,3
2,Diploma,2,2,1,1
3,Bachelors,3,3,0,0
4,Bachelors,3,3,0,0
5,Masters,4,4,3,3
6,Phd,5,5,4,4
7,High school,1,1,2,2
8,High school,1,1,2,2


### Nominal Feature Encoding

In [137]:
data=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad','Mumbai','Agra']})

In [138]:
n=data['City'].nunique()
n

6

In [139]:
#data['City'].value_counts().index.tolist()

data['City'].value_counts()

Hyderabad    2
Delhi        2
Mumbai       2
Chennai      1
Agra         1
Bangalore    1
Name: City, dtype: int64

##### 1. Using One-hot encoder

In [140]:
ohe = OneHotEncoder(drop='first',sparse=False,dtype='int64')
ohe.fit(data[['City']])

data_ohe = pd.DataFrame(ohe.transform(data[['City']]))

data_ohe.columns = ['City_'+str(i) for i in range(n-1)]

data = pd.concat([data,data_ohe],axis=1)

data.drop(['City'],axis=1,inplace=True)

data

Unnamed: 0,City_0,City_1,City_2,City_3,City_4
0,0,0,1,0,0
1,0,0,0,0,1
2,0,0,0,1,0
3,0,1,0,0,0
4,1,0,0,0,0
5,0,0,1,0,0
6,0,0,0,1,0
7,0,0,0,0,1
8,0,0,0,0,0
