### All about Categorical Variable Encoding
#### Link: https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

1. One Hot Encoding
2. Label Encoding
3. Ordinal Encoding
4. Helmert Encoding
5. Binary Encoding
6. Frequency Encoding
7. Mean Encoding
8. Weight of Evidence Encoding
9. Probability Ratio Encoding
1.  Hashing Encoding
1.  Backward Difference Encoding
1.  Leave One Out Encoding
1.  James-Stein Encoding
1.  M-estimator Encoding

In [1]:
import pandas as pd

In [2]:
# Loading Data
df = pd.read_csv('D:\Projects\ML Basics with Python\Categorical Variable\Data/train.csv')

In [3]:
df

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target
0,0,0,0,0,T,Y,Green,Triangle,Snake,Finland,...,2f4cb3d51,2,Grandmaster,Cold,h,D,kr,2,2,0
1,1,0,1,0,T,Y,Green,Trapezoid,Hamster,Russia,...,f83c56c21,1,Grandmaster,Hot,a,A,bF,7,8,0
2,2,0,0,0,F,Y,Blue,Trapezoid,Lion,Russia,...,ae6800dd0,1,Expert,Lava Hot,h,R,Jc,7,2,0
3,3,0,1,0,F,Y,Red,Trapezoid,Snake,Canada,...,8270f0d71,1,Grandmaster,Boiling Hot,i,D,kW,2,1,1
4,4,0,0,0,F,N,Red,Trapezoid,Lion,Canada,...,b164b72a7,1,Grandmaster,Freezing,a,R,qP,7,8,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,299995,0,0,0,T,N,Red,Trapezoid,Snake,India,...,e027decef,1,Contributor,Freezing,k,K,dh,3,8,0
299996,299996,0,0,0,F,Y,Green,Trapezoid,Lion,Russia,...,80f1411c8,2,Novice,Freezing,h,W,MO,3,2,0
299997,299997,0,0,0,F,Y,Blue,Star,Axolotl,Russia,...,314dcc15b,3,Novice,Boiling Hot,o,A,Bn,7,9,1
299998,299998,0,1,0,F,Y,Green,Square,Axolotl,Costa Rica,...,ab0ce192b,1,Master,Boiling Hot,h,W,uJ,3,8,1


### One Hot Encoding
In this method, we map each category to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features. This method produces a lot of columns that slows down the learning significantly if the number of the category is very high for the feature. Pandas has get_dummies function, which is quite easy to use. For the sample data-frame code would be as below:

In [5]:
df_1 = df[['nom_1', 'ord_1', 'month']].head(1000)

In [6]:
df_1

Unnamed: 0,nom_1,ord_1,month
0,Triangle,Grandmaster,2
1,Trapezoid,Grandmaster,8
2,Trapezoid,Expert,2
3,Trapezoid,Grandmaster,1
4,Trapezoid,Grandmaster,8
...,...,...,...
995,Square,Novice,7
996,Square,Grandmaster,1
997,Polygon,Novice,4
998,Trapezoid,Grandmaster,5


In [9]:
print(df_1['nom_1'].unique())
print(df_1['ord_1'].unique())
print(df_1['month'].unique())

['Triangle' 'Trapezoid' 'Polygon' 'Square' 'Star' 'Circle']
['Grandmaster' 'Expert' 'Novice' 'Contributor' 'Master']
[ 2  8  1  4 10  3  7  9 12 11  5  6]


### Using Pandas get_dummies function

In [20]:
df_ohe_1 = pd.get_dummies(df_1)
df_ohe_1

Unnamed: 0,month,nom_1_Circle,nom_1_Polygon,nom_1_Square,nom_1_Star,nom_1_Trapezoid,nom_1_Triangle,ord_1_Contributor,ord_1_Expert,ord_1_Grandmaster,ord_1_Master,ord_1_Novice
0,2,0,0,0,0,0,1,0,0,1,0,0
1,8,0,0,0,0,1,0,0,0,1,0,0
2,2,0,0,0,0,1,0,0,1,0,0,0
3,1,0,0,0,0,1,0,0,0,1,0,0
4,8,0,0,0,0,1,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
995,7,0,0,1,0,0,0,0,0,0,0,1
996,1,0,0,1,0,0,0,0,0,1,0,0
997,4,0,1,0,0,0,0,0,0,0,0,1
998,5,0,0,0,0,1,0,0,0,1,0,0


In [19]:
df_ohe_2 = pd.get_dummies(df_1, prefix=['nom'], columns=['nom_1'])
df_ohe_2

Unnamed: 0,ord_1,month,nom_Circle,nom_Polygon,nom_Square,nom_Star,nom_Trapezoid,nom_Triangle
0,Grandmaster,2,0,0,0,0,0,1
1,Grandmaster,8,0,0,0,0,1,0
2,Expert,2,0,0,0,0,1,0
3,Grandmaster,1,0,0,0,0,1,0
4,Grandmaster,8,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...
995,Novice,7,0,0,1,0,0,0
996,Grandmaster,1,0,0,1,0,0,0
997,Novice,4,0,1,0,0,0,0
998,Grandmaster,5,0,0,0,0,1,0


### Using Scikit-learn OneHotEncoder function
Scikit-learn has OneHotEncoder for this purpose, but it does not create an additional feature column (another code is needed, as shown in the below code sample.

In [25]:
from sklearn.preprocessing import OneHotEncoder
ohc = OneHotEncoder()
ohe = ohc.fit_transform(df_1.nom_1.values.reshape(-1,1)).toarray()

In [26]:
ohe

array([[0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1., 0.],
       ...,
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0.]])

In [36]:
df_ohe_tmp = pd.DataFrame(ohe, columns = ['nom_1_' + str(ohc.categories_[0][i]) for i in range(len(ohc.categories_[0]))])

In [37]:
df_ohe_tmp

Unnamed: 0,nom_1_Circle,nom_1_Polygon,nom_1_Square,nom_1_Star,nom_1_Trapezoid,nom_1_Triangle
0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...
995,0.0,0.0,1.0,0.0,0.0,0.0
996,0.0,0.0,1.0,0.0,0.0,0.0
997,0.0,1.0,0.0,0.0,0.0,0.0
998,0.0,0.0,0.0,0.0,1.0,0.0


In [38]:
df_ohe_3 = pd.concat([df_1, df_ohe_tmp], axis=1)

In [39]:
df_ohe_3

Unnamed: 0,nom_1,ord_1,month,nom_1_Circle,nom_1_Polygon,nom_1_Square,nom_1_Star,nom_1_Trapezoid,nom_1_Triangle
0,Triangle,Grandmaster,2,0.0,0.0,0.0,0.0,0.0,1.0
1,Trapezoid,Grandmaster,8,0.0,0.0,0.0,0.0,1.0,0.0
2,Trapezoid,Expert,2,0.0,0.0,0.0,0.0,1.0,0.0
3,Trapezoid,Grandmaster,1,0.0,0.0,0.0,0.0,1.0,0.0
4,Trapezoid,Grandmaster,8,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...
995,Square,Novice,7,0.0,0.0,1.0,0.0,0.0,0.0
996,Square,Grandmaster,1,0.0,0.0,1.0,0.0,0.0,0.0
997,Polygon,Novice,4,0.0,1.0,0.0,0.0,0.0,0.0
998,Trapezoid,Grandmaster,5,0.0,0.0,0.0,0.0,1.0,0.0


One Hot Encoding is very popular. We can represent all categories by N-1 (N= No of Category) as that is sufficient to encode the one that is not included. Usually, for Regression, we use N-1 (drop first or last column of One Hot Coded new feature ), but for classification, the recommendation is to use all N columns without as most of the tree-based algorithm builds a tree based on all available variables. One hot encoding with N-1 binary variables should be used in linear Regression, to ensure the correct number of degrees of freedom (N-1). The linear Regression has access to all of the features as it is being trained, and therefore examines the whole set of dummy variables altogether. This means that N-1 binary variables give complete information about (represent completely) the original categorical variable to the linear Regression. This approach can be adopted for any machine learning algorithm that looks at ALL the features at the same time during training. For example, support vector machines and neural networks as well and clustering algorithms.
In tree-based methods, we will never consider that additional label if we drop. Thus, if we use the categorical variables in a tree-based learning algorithm, it is good practice to encode it into N binary variables and don’t drop.

### Label Encoding
In this encoding, each category is assigned a value from 1 through N (here N is the number of categories for the feature. One major issue with this approach is there is no relation or order between these classes, but the algorithm might consider them as some order, or there is some relationship. In below example it may look like (Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 ) .Scikit-learn code for the data-frame as follows:

#### Using Pandas factorize

In [48]:
df_2 = df_1.head(25)
df_2.loc[:, 'ord_1_label_encode'] = pd.factorize(df_2['ord_1'])[0].reshape(-1,1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [50]:
df_2

Unnamed: 0,nom_1,ord_1,month,ord_1_label_encode
0,Triangle,Grandmaster,2,0
1,Trapezoid,Grandmaster,8,0
2,Trapezoid,Expert,2,1
3,Trapezoid,Grandmaster,1,0
4,Trapezoid,Grandmaster,8,0
5,Polygon,Novice,2,2
6,Trapezoid,Grandmaster,4,0
7,Triangle,Novice,2,2
8,Square,Novice,4,2
9,Trapezoid,Expert,2,1


#### Using sklearn LabelEncoder

In [53]:
from sklearn.preprocessing import LabelEncoder
df_3 = df_1.head(25)
df_3['ord_1_label_encode'] = LabelEncoder().fit_transform(df_3['ord_1'])
df_3

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,nom_1,ord_1,month,ord_1_label_encode
0,Triangle,Grandmaster,2,2
1,Trapezoid,Grandmaster,8,2
2,Trapezoid,Expert,2,1
3,Trapezoid,Grandmaster,1,2
4,Trapezoid,Grandmaster,8,2
5,Polygon,Novice,2,4
6,Trapezoid,Grandmaster,4,2
7,Triangle,Novice,2,4
8,Square,Novice,4,4
9,Trapezoid,Expert,2,1


### Ordinal Encoding
We do Ordinal encoding to ensure the encoding of variables retains the ordinal nature of the variable. This is reasonable only for ordinal variables, as I mentioned at the beginning of this article. This encoding looks almost similar to Label Encoding but slightly different as Label coding would not consider whether variable is ordinal or not and it will assign sequence of integers
as per the order of data (Pandas assigned Hot (0), Cold (1), “Very Hot” (2) and Warm (3)) or
as per alphabetical sorted order (scikit-learn assigned Cold(0), Hot(1), “Very Hot” (2) and Warm (3)).
If we consider in the temperature scale as the order, then the ordinal value should from cold to “Very Hot. “ Ordinal encoding will assign values as ( Cold(1) <Warm(2)<Hot(3)<”Very Hot(4)). Usually, we Ordinal Encoding is done starting from 1.
Refer to this code using Pandas, where first, we need to assign the original order of the variable through a dictionary. Then we can map each row for the variable as per the dictionary.

In [55]:
df_4 = df_1.head(15)
df_4

Unnamed: 0,nom_1,ord_1,month
0,Triangle,Grandmaster,2
1,Trapezoid,Grandmaster,8
2,Trapezoid,Expert,2
3,Trapezoid,Grandmaster,1
4,Trapezoid,Grandmaster,8
5,Polygon,Novice,2
6,Trapezoid,Grandmaster,4
7,Triangle,Novice,2
8,Square,Novice,4
9,Trapezoid,Expert,2


In [58]:
ord_1_dict = {'Novice': 0, 'Expert': 1, 'Grandmaster': 2}
df_4['ord_1_encode'] = df_4['ord_1'].map(ord_1_dict)
df_4

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,nom_1,ord_1,month,ord_1_encode
0,Triangle,Grandmaster,2,2
1,Trapezoid,Grandmaster,8,2
2,Trapezoid,Expert,2,1
3,Trapezoid,Grandmaster,1,2
4,Trapezoid,Grandmaster,8,2
5,Polygon,Novice,2,0
6,Trapezoid,Grandmaster,4,2
7,Triangle,Novice,2,0
8,Square,Novice,4,0
9,Trapezoid,Expert,2,1


### Helmert Encoding
In this encoding, the mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.
The version in category_encoders is sometimes referred to as Reverse Helmert Coding. The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. Hence, the name ‘reverse’ is used to differentiate from forward Helmert coding.

In [61]:
import category_encoders as ce
df_4 = df_1.head(15)
encoder = ce.HelmertEncoder(cols=['nom_1'], drop_invariant=True)
dfh = encoder.fit_transform(df_4['nom_1'])
df_enc_h = pd.concat([df_4, dfh], axis=1)
df_enc_h

Unnamed: 0,nom_1,ord_1,month,nom_1_0,nom_1_1,nom_1_2,nom_1_3
0,Triangle,Grandmaster,2,-1.0,-1.0,-1.0,-1.0
1,Trapezoid,Grandmaster,8,1.0,-1.0,-1.0,-1.0
2,Trapezoid,Expert,2,1.0,-1.0,-1.0,-1.0
3,Trapezoid,Grandmaster,1,1.0,-1.0,-1.0,-1.0
4,Trapezoid,Grandmaster,8,1.0,-1.0,-1.0,-1.0
5,Polygon,Novice,2,0.0,2.0,-1.0,-1.0
6,Trapezoid,Grandmaster,4,1.0,-1.0,-1.0,-1.0
7,Triangle,Novice,2,-1.0,-1.0,-1.0,-1.0
8,Square,Novice,4,0.0,0.0,3.0,-1.0
9,Trapezoid,Expert,2,1.0,-1.0,-1.0,-1.0
