# Encoding Categorical Data before Running ML Model

### TASK : To Experiment and Implement with Different Types of Encoding to deal with Categorical Data 

### In this blog we will explore and implement:

- One hot Encoding using:
    -Python's category_encoding library
    -Sklearn Preprocessing
    -Python's get_dummies
- Binary Encoding
- Frequency Encoding
- Label Encoding
- Ordinal Encoding

### What is Categorical data

Categorical data is a type of data that is used to group information with similar characteristics while Numerical data is a type of data that expresses information in the form of numbers.

Example : Gender

### Why do we need Encoding?

- Most of the Machine learning algorithms can not handle categorical variables unless we convert them to numerical values.

- Many algorithm’s performances even vary based upon how the Categorical variables are encoded.

### Categorical variables can be divided into two categories: 
    
- Nominal (No particular order)
- Ordinal (some ordered)

We will also refer a cheat sheet which shows when to use which type of encoding.

## Method : 1 : Using Python's category encoder library

category_encoders is an amazing python library that provides 15 different encoding schemes. 

### Here, is the list of 15 types of encoding :
    
- One Hot Encoding
- Label Encoding
- Ordinal Encoding
- Helmert Encoding
- Binary Encoding
- Frequency Encoding
- Mean Encoding
- Weight of Evidence Encoding
- Probability Ratio Encoding
- Hashing Encoding
- Backward Difference Encoding
- Leave One Out Encoding
- James-Stein Encoding
- M-estimator Encoding
- Thermometer Encoder

### Importing Libraries

In [1]:
import pandas as pd 
import sklearn

In [18]:
#pip install category_encoders

In [3]:
import category_encoders as ce

#### Creating Dataframe 

In [129]:
data = pd.DataFrame({ 'gender' : ['Male', 'Female', 'Male', 'Female', 'Female'],
                      'class' : ['A','B','C','D','A'],
                      'city' : ['Delhi','Gurugram','Delhi','Delhi','Gurugram'] }) 

In [5]:
data.head()

Unnamed: 0,gender,class,city
0,Male,A,Delhi
1,Female,B,Gurugram
2,Male,C,Delhi
3,Female,D,Delhi
4,Female,A,Gurugram


#### Implementing One-Hot Encoding through category_encoder

In this method, each category is mapped to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features. 

In [130]:
# create an object of the One Hot Encoder 

ce_OHE = ce.OneHotEncoder(cols=['gender','city']) 

# transform the data 
data1 = ce_OHE.fit_transform(data)
data1.head()

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,gender_1,gender_2,class,city_1,city_2
0,1,0,A,1,0
1,0,1,B,0,1
2,1,0,C,1,0
3,0,1,D,1,0
4,0,1,A,0,1


## Binary Encoding

Binary encoding converts a category into binary digits. Each binary digit creates one feature column.

In [88]:
ce_be = ce.BinaryEncoder(cols=['class']);

# transform the data 
data_binary = ce_be.fit_transform(data["class"]);

  elif pd.api.types.is_categorical(cols):


In [91]:
print(data["class"])
data_binary

0    A
1    B
2    C
3    D
4    A
Name: class, dtype: object


Unnamed: 0,class_0,class_1,class_2
0,0,0,1
1,0,1,0
2,0,1,1
3,1,0,0
4,0,0,1


Similarly there are other 14 types of encoding provided by this library.

## Method 2 : USING PYTHON'S GET DUMMIES

In [14]:
pd.get_dummies(data,columns=["gender","city"])

Unnamed: 0,class,gender_Female,gender_Male,city_Delhi,city_Gurugram
0,A,0,1,1,0
1,B,1,0,0,1
2,C,0,1,1,0
3,D,1,0,1,0
4,A,1,0,0,1


Asigning Prefix if we want to. Though it takes the default prefix too!

In [17]:
pd.get_dummies(data,prefix=["gen","city"],columns=["gender","city"])  

Unnamed: 0,class,gen_Female,gen_Male,city_Delhi,city_Gurugram
0,A,0,1,1,0
1,B,1,0,0,1
2,C,0,1,1,0
3,D,1,0,1,0
4,A,1,0,0,1


##  METHOD 3 : USING SKLEARN 

sklearn also has 15  different type of inbuilt encoders , which can be accessed from sklear.preprocessing.

### SKLEARN ONE HOT ENCODING

#### lets first Get list of categorical variables from our data

In [9]:
s = (data.dtypes == 'object')
cols = list(s[s].index)


In [133]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown='ignore',sparse=False)

#### Applying on gender column

In [132]:
data_gender = pd.DataFrame(ohe.fit_transform(data[["gender"]]))

data_gender

Unnamed: 0,0,1
0,0.0,1.0
1,1.0,0.0
2,0.0,1.0
3,1.0,0.0
4,1.0,0.0


#### Applying on City Column

In [131]:
data_city = pd.DataFrame(ohe.fit_transform(data[["city"]]))

data_city

Unnamed: 0,0,1
0,1.0,0.0
1,0.0,1.0
2,1.0,0.0
3,1.0,0.0
4,0.0,1.0


#### Applying on class column

In [30]:
data_class = pd.DataFrame(ohe.fit_transform(data[["class"]]))

data_class

Unnamed: 0,0,1,2,3
0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0


This is because the class column have 4 unique values

#### Applying on list of categorical variables:

In [27]:
data_cols = pd.DataFrame(ohe.fit_transform(data[cols]))

data_cols

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
3,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


here the first 2 columns reprsent :gender , next 4 columns represent class  and remaining 2 of city

### SKLEARN Label Encoding

In label encoding , each category is assigned a value from 1 through N where N is the number of categories for the feature.
There is no relation or order between these assignments.

In [135]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()     #Takes no arguments

In [134]:
le_class = le.fit_transform(data[["class"]]);

le_class

  return f(*args, **kwargs)


array([0, 1, 2, 3, 0])

#### Comparing with one hot encoding

In [39]:
data_class

Unnamed: 0,0,1,2,3
0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0


### ORDINAL ENCODING

Ordinal encoding's encoded variables retains the ordinal(ordered) nature of the variable.
It looks almost similar to Label Encoding. Only difference being Label coding doesnt consider whether variable is ordinal or not, it will anyways assign sequence of integers

#### Example : Ordinal encoding will assign values as Very Good(1)<Good(2)<Bad(3)<Worse(4)

First we need to assign the original order of the variable through a dictionary.

In [77]:
temp = {'temperature' :['very cold', 'cold', 'warm', 'hot', 'very hot']}

df=pd.DataFrame(temp,columns=["temperature"])

temp_dict = {
    'very cold': 1,
    'cold': 2,
    'warm': 3,
    'hot': 4,
    "very hot": 5
}

In [80]:
temp_dict

{'very cold': 1, 'cold': 2, 'warm': 3, 'hot': 4, 'very hot': 5}

In [78]:
temp

{'temperature': ['very cold', 'cold', 'warm', 'hot', 'very hot']}

In [81]:
df

Unnamed: 0,temperature
0,very cold
1,cold
2,warm
3,hot
4,very hot


 Then we can map each row for the variable as per the dictionary.

In [84]:
df["temp_ordinal"] = df.temperature.map(temp_dict)

In [85]:
df

Unnamed: 0,temperature,temp_ordinal
0,very cold,1
1,cold,2
2,warm,3
3,hot,4
4,very hot,5


## Frequency Encoding

Category is assigned as per the frequency of value in its total lot

In [122]:
data_freq = pd.DataFrame({'class' : ['A','B','C','D','A',"B","E","E","D","C","C","C","E","A","A"]})

In [123]:
fe = data_freq.groupby("class").size()
fe

class
A    4
B    2
C    4
D    2
E    3
dtype: int64

In [124]:
len(data_freq)

15

In [125]:
fe_ = fe/len(data_freq)

In [126]:
data_freq["data_fe"] = data_freq["class"].map(fe_).round(2)

In [127]:
data_freq

Unnamed: 0,class,data_fe
0,A,0.27
1,B,0.13
2,C,0.27
3,D,0.13
4,A,0.27
5,B,0.13
6,E,0.2
7,E,0.2
8,D,0.13
9,C,0.27


We saw 5 types of encoding schemes. Similarly there are 10 other type of encoding :
    
    
- Helmert Encoding
- Mean Encoding
- Weight of Evidence Encoding
- Probability Ratio Encoding
- Hashing Encoding
- Backward Difference Encoding
- Leave One Out Encoding
- James-Stein Encoding
- M-estimator Encoding
- Thermometer Encoder

### Which One is Best then ?

There is no single method that works best for every problem or dataset. I personally think that get_dummies method has an advantage in its ability to implement very easily.

### Read about all the 15 types of encoding in detail here:

If you want to read about all the 15 types of encoding here is a very good article to refer: https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

I am also attaching a cheat sheet on when to use what type of encoding. 