# Categorical Variable Encoding

#### Categorical variables can be divided into two categories: 

  >  Nominal (No particular order) and 
  
  >  Ordinal (some ordered).

#### There are many ways we can encode these categorical variables as numbers and use them in an algorithm.

####  1)One Hot Encoding
####  2)  Label Encoding
####  3)  Ordinal Encoding
####  4)  Helmert Encoding
####  5)  Binary Encoding
####  6) Frequency Encoding
####  7) Mean Encoding
####  8) Weight of Evidence Encoding
####  9) Probability Ratio Encoding
#### 10) Hashing Encoding
#### 11) Backward Difference Encoding
#### 12) Leave One Out Encoding
#### 13) James-Stein Encoding
#### 14) M-estimator Encoding
#### 15) Thermometer Encoder 

In [63]:
import  numpy as np
import pandas as pd

In [100]:
data = {'Temperature' : ['Hot', 'Cold', 'Very Hot', 'Warm', 'Hot', 'Warm', 'Warm','Hot', 'Hot', 'Cold'],
'Color' : ['Red', 'Yellow', 'Blue', 'Blue', 'Red', 'Yellow', 'Red', 'Yellow', 'Yellow', 'Yellow'],
'Target' : [1, 1, 1, 0, 1, 0, 1, 0, 1, 1]}

In [101]:
data = pd.DataFrame(data)
data

Unnamed: 0,Temperature,Color,Target
0,Hot,Red,1
1,Cold,Yellow,1
2,Very Hot,Blue,1
3,Warm,Blue,0
4,Hot,Red,1
5,Warm,Yellow,0
6,Warm,Red,1
7,Hot,Yellow,0
8,Hot,Yellow,1
9,Cold,Yellow,1


#### We will use Pandas and Scikit-learn and category_encoders (Scikit-learn contribution library) to show different encoding methods in Python.

## 1. One Hot Encoding

        In this method, we map each category to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features. This method produces a lot of columns that slows down the learning significantly if the number of the category is very high for the feature. 

    Pandas has get_dummies function, which is quite easy to use. 
    For the sample data-frame code would be as below:

In [66]:
#For all columns
data1 = data.copy()
mydata = pd.get_dummies(data1)
mydata.head()

Unnamed: 0,Target,Temperature_Cold,Temperature_Hot,Temperature_Very Hot,Temperature_Warm,Color_Blue,Color_Red,Color_Yellow
0,1,0,1,0,0,0,1,0
1,1,1,0,0,0,0,0,1
2,1,0,0,1,0,1,0,0
3,0,0,0,0,1,1,0,0
4,1,0,1,0,0,0,1,0


In [23]:
#For Specific Columns
mydata1 = pd.get_dummies(data1, prefix = ['Temp'], columns = ['Temperature'])  #Very Important keep inside['Temp']
mydata1.head()

Unnamed: 0,Color,Target,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Red,1,0,1,0,0
1,Yellow,1,1,0,0,0
2,Blue,1,0,0,1,0
3,Blue,0,0,0,0,1
4,Red,1,0,1,0,0


In [24]:
# Dropping extra column
mydata2 = pd.get_dummies(data1, drop_first = True)
mydata2.head()

Unnamed: 0,Target,Temperature_Hot,Temperature_Very Hot,Temperature_Warm,Color_Red,Color_Yellow
0,1,1,0,0,1,0
1,1,0,0,0,0,1
2,1,0,1,0,0,0
3,0,0,0,1,0,0
4,1,1,0,0,1,0


### 2nd Method

In [102]:
from sklearn.preprocessing import OneHotEncoder

ohc = OneHotEncoder()

ohe = ohc.fit_transform(data1[['Temperature', 'Color']].values.reshape(-1, 1)).toarray()


ohe_df = pd.DataFrame(ohe, columns = 'Temp_Cold,Temp_Hot,Temp_VeryHot,Temp_Warm,Color_Blue,Color_Red,Color_Yellow'.split(',') )
ohe_df.head()

In [106]:
dfh = pd.concat([data1, ohe_df], axis = 1)
dfh.head()

Unnamed: 0,Temperature,Color,Target,Temp_Cold,Temp_Hot,Temp_VeryHot,Temp_Warm,Color_Blue,Color_Red,Color_Yellow
0,Hot,Red,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,Cold,Yellow,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,Very Hot,Blue,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,Warm,Blue,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,Hot,Red,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


    One Hot Encoding is very popular. We can represent all categories by N-1 (N= No of Category) as that is sufficient to encode the one that is not included. Usually, for Regression, we use N-1 (drop first or last column of One Hot Coded new feature ), but for classification, the recommendation is to use all N columns as most of the tree-based algorithm builds a tree based on all available

    One hot encoding with N-1 binary variables should be used in linear Regression, to ensure the correct number of degrees of freedom (N-1). The linear Regression has access to all of the features as it is being trained, and therefore examines the whole set of dummy variables altogether. This means that N-1 binary variables give complete information about (represent completely) the original categorical variable to the linear Regression. This approach can be adopted for any machine learning algorithm that looks at ALL the features at the same time during training. For example, support vector machines and neural networks as well and clustering algorithms.

    In tree-based methods, we will never consider that additional label if we dropp. Thus, if we use the categorical variables in a tree-based learning algorithm, it is good practice to encode it into N binary variables and doesn’t drop.

## 2. Label Encoding

    In this encoding, each category is assigned a value from 1 through N (here N is the number of categories for the feature. One major issue with this approach is there is no relation or order between these classes, but the algorithm might consider them as some order, or there is some relationship. In below example it may look like (Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 ) .Scikit-learn code for the data-frame as follows:

In [107]:
from sklearn.preprocessing import LabelEncoder

In [108]:
data2 = data.copy()
data2['Temp_Label_Encode'] = LabelEncoder().fit_transform(data2['Temperature'])
data2

Unnamed: 0,Temperature,Color,Target,Temp_Label_Encode
0,Hot,Red,1,1
1,Cold,Yellow,1,0
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,1
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,1
8,Hot,Yellow,1,1
9,Cold,Yellow,1,0


### 2nd Method

#### Pandas factorize also perform the same function.

In [109]:
data3 = data.copy()
data3.loc[:, 'Temp_Label_Encode'] = pd.factorize(data3['Temperature'])[0].reshape(-1,1)
data3

Unnamed: 0,Temperature,Color,Target,Temp_Label_Encode
0,Hot,Red,1,0
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,0
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,0
8,Hot,Yellow,1,0
9,Cold,Yellow,1,1


## 3. Ordinal Encoding

We do Ordinal encoding to ensure the encoding of variables retains the ordinal nature of the variable. This is reasonable only for ordinal variables. This encoding looks almost similar to Label Encoding but slightly different as Label coding would not consider whether variable is ordinal or not and it will assign sequence of integers.

- as per the order of data (Pandas assigned Hot (0), Cold (1), “Very Hot” (2) and Warm (3)) or

- as per alphabetical sorted order (scikit-learn assigned Cold(0), Hot(1), “Very Hot” (2) and Warm (3)).

“ Ordinal encoding will assign values as ( Cold(1) <Warm(2)<Hot(3)<”Very Hot(4)). Usually, we Ordinal Encoding is done starting from 1.

In [111]:
Temp_dict = {'Cold':1, 'Warm':2, 'Hot':3, 'Very Hot':4}

In [112]:
data4 = data.copy()
data4['Temp_Ordinal'] = data4['Temperature'].map(Temp_dict)
data4

Unnamed: 0,Temperature,Color,Target,Temp_Ordinal
0,Hot,Red,1,3
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,4
3,Warm,Blue,0,2
4,Hot,Red,1,3
5,Warm,Yellow,0,2
6,Warm,Red,1,2
7,Hot,Yellow,0,3
8,Hot,Yellow,1,3
9,Cold,Yellow,1,1


## 4. Binary Encoding

       Binary encoding converts a category into binary digits. Each binary digit creates one feature column. If there are n unique categories, then binary encoding results in the only log(base 2)ⁿ features. In this example, we have four features; thus, the total number of the binary encoded features will be three features. Compared to One Hot Encoding, this will require fewer feature columns (for 100 categories One Hot Encoding will have 100 features while for Binary encoding, we will need just seven features)./

For Binary encoding, one has to follow the following steps:
    
- The categories are first converted to numeric order starting from 1 (order is created as categories appear in a dataset and do not mean any ordinal nature)

- Then those integers are converted into binary code, so for example 3 becomes 011, 4 becomes 100

- Then the digits of the binary number form separate columns.

In [170]:
data5 = data.copy()
import category_encoders as ce


In [171]:
encoder = ce.BinaryEncoder(data5, cols = ['Temperature'])
encoder

BinaryEncoder(cols=['Temperature'], drop_invariant=False,
              handle_unknown='impute', impute_missing=True, return_df=True,
              verbose=  Temperature   Color  Target
0         Hot     Red       1
1        Cold  Yellow       1
2    Very Hot    Blue       1
3        Warm    Blue       0
4         Hot     Red       1
5        Warm  Yellow       0
6        Warm     Red       1
7         Hot  Yellow       0
8         Hot  Yellow       1
9        Cold  Yellow       1)

In [172]:
X = data5.drop(['Color','Target'], axis = 1)
y = data5.drop(['Temperature'], axis = 1)


In [173]:
dfbin = encoder.fit_transform(X, y)
dfbin

Unnamed: 0,Temperature_0,Temperature_1,Temperature_2
0,0,0,1
1,0,1,0
2,0,1,1
3,1,0,0
4,0,0,1
5,1,0,0
6,1,0,0
7,0,0,1
8,0,0,1
9,0,1,0


In [174]:
data_new = pd.concat([data5, dfbin], axis = 1)
data_new

Unnamed: 0,Temperature,Color,Target,Temperature_0,Temperature_1,Temperature_2
0,Hot,Red,1,0,0,1
1,Cold,Yellow,1,0,1,0
2,Very Hot,Blue,1,0,1,1
3,Warm,Blue,0,1,0,0
4,Hot,Red,1,0,0,1
5,Warm,Yellow,0,1,0,0
6,Warm,Red,1,1,0,0
7,Hot,Yellow,0,0,0,1
8,Hot,Yellow,1,0,0,1
9,Cold,Yellow,1,0,1,0


In [None]:
For More methods refer link below
Thank you

In [None]:
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02