## Encoding Techniques - Categorical Variables
The categorical variables are broadly of two types ordinal and nominal
##### Ordinal Variables : The variables which can be ordered/ranked in a particular order like ['Good','Fair','Excellent']
#####  Nominal Variables : The variables which cannot be ordered/ranked in any particular order like ['Male','Female']
We will be using category_encoders library to explore various techniques of categorical encoding

In [1]:
import pandas as pd
df=pd.DataFrame({'numeric':[1,2,3,4,5,6,7,8,9,10],
                 'category':['a','c','b','a','b','c','b','b','c','b']})
df.head()

Unnamed: 0,numeric,category
0,1,a
1,2,c
2,3,b
3,4,a
4,5,b


## Label Encoding
This encoding technique is used to convert ordinal variables into numerical features by using LabelEncoder in sklearn and OrdinalEncoder in category_encoders

In [2]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
le.fit_transform(df['category'])

array([0, 2, 1, 0, 1, 2, 1, 1, 2, 1])

In [3]:
#!pip install category_encoders

In [4]:
import category_encoders as ce
ce_ord=ce.OrdinalEncoder()
ce_ord.fit_transform(df['category'])

Unnamed: 0,category
0,1
1,2
2,3
3,1
4,3
5,2
6,3
7,3
8,2
9,3


## One Hot Encoder
This is a popular technique used to turn our categorical features into numerical. However if the cardinality of the categories is high it results in sparsity and blows the number of columns.

In [5]:
ce_ohe=ce.OneHotEncoder()
ce_ohe.fit_transform(df['category'])

Unnamed: 0,category_1,category_2,category_3
0,1,0,0
1,0,1,0
2,0,0,1
3,1,0,0
4,0,0,1
5,0,1,0
6,0,0,1
7,0,0,1
8,0,1,0
9,0,0,1


## One Hot Encoder with High Cardinality
A work around to live with is to take the value counts of the categorical variable and considering say 10 most frequent categories of 50 categories and implementing one hot encoding on these 10 distinct categories. This is worth experimenting as in spite of information loss it might result in the same accuracy of other encoding schemes when the counts of say 40 categories are less

In [6]:
df.category.value_counts()

b    5
c    3
a    2
Name: category, dtype: int64

In [7]:
df_mod=df[df['category']=='b']
ce.OneHotEncoder().fit_transform(df_mod['category'])

Unnamed: 0,category_1
2,1
4,1
6,1
7,1
9,1


## Binary Encoder 
Binary encoder transforms the categories in to numerical variables and represents them with the binary representation viz 2 as an 001, 3 as an 011. This technique is used for variables with more categories with out information loass as in hashing technique 

In [8]:
ce_bin=ce.BinaryEncoder()
ce_bin.fit_transform(df['category'])

Unnamed: 0,category_0,category_1,category_2
0,0,0,1
1,0,1,0
2,0,1,1
3,0,0,1
4,0,1,1
5,0,1,0
6,0,1,1
7,0,1,1
8,0,1,0
9,0,1,1


Let us try Binary Encoder on all US states. We can see that there are only 6 columns in contrary to the 50 columns produced by applying one hot encoding. Here we can notice that we can delete state_0 column as it is always 0 and it does not contain any information significantly

In [9]:
df_states=pd.read_csv(r'C:\Users\udays\Documents\data.csv',sep=',')
print(df_states.head())
ce.BinaryEncoder().fit_transform(df_states['State'])

        State  Abbrev Code
0     Alabama    Ala.   AL
1      Alaska  Alaska   AK
2     Arizona   Ariz.   AZ
3    Arkansas    Ark.   AR
4  California  Calif.   CA


Unnamed: 0,State_0,State_1,State_2,State_3,State_4,State_5,State_6
0,0,0,0,0,0,0,1
1,0,0,0,0,0,1,0
2,0,0,0,0,0,1,1
3,0,0,0,0,1,0,0
4,0,0,0,0,1,0,1
5,0,0,0,0,1,1,0
6,0,0,0,0,1,1,1
7,0,0,0,1,0,0,0
8,0,0,0,1,0,0,1
9,0,0,0,1,0,1,0


### Hashing
HashingEncoder implements the hashing trick. It is similar to one-hot encoding but with fewer new dimensions and some info loss due to collisions. The collisions do not significantly affect performance unless there is a great deal of overlap.

In [10]:
df

Unnamed: 0,numeric,category
0,1,a
1,2,c
2,3,b
3,4,a
4,5,b
5,6,c
6,7,b
7,8,b
8,9,c
9,10,b


In [11]:
X = df.drop('numeric', axis = 1)
y = df.drop('category', axis = 1)

In [12]:
#!pip install category_encoders==2.0.0

###### Note
In the hashing encoding the default technique used is md5 and the default number of columns resulted is 8. We can change the number of components by n_components parameter of HashingEncoder()

In [13]:
ce_has=ce.HashingEncoder()
import numpy as np
#x=np.array()
ce_has.fit_transform(df['category'])

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7
0,0,1,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,1
3,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1
5,0,0,0,1,0,0,0,0
6,0,0,0,0,0,0,0,1
7,0,0,0,0,0,0,0,1
8,0,0,0,1,0,0,0,0
9,0,0,0,0,0,0,0,1


### Encoding based on Count/Frequency
This is used in the case where we have many distinct categories for a variable. In this technique we will count the distinct values how many times thay are repeated and replace those with these count values. The advantage here is the number of columns are adding up but the disadvantage is that if two distinct values like 'A', 'B' are repeated for same number of times say 8 then these two values will be replaced by 8 there by loosing the distinction between A and B

In [14]:
df['category2']=['e','d','d','f','f','f','e','e','f','d']

In [15]:
df1=df.copy()
from pandas.api.types import is_string_dtype

In [16]:
def count_based_encoding(df):
    cat_list=list()
    a=list()
    for col in df.columns:
        if(is_string_dtype(df[col])):
            cat_list.append(col)
            a.append(dict(df[col].value_counts()))
    for key1,i in enumerate(a):
        for key2,col in enumerate(cat_list):
            if(is_string_dtype(df[col]) and key1==key2):
                df[col]=df[col].map(i)
    return df

In [17]:
df1=count_based_encoding(df1)
df1

Unnamed: 0,numeric,category,category2
0,1,2,3
1,2,3,3
2,3,5,3
3,4,2,4
4,5,5,4
5,6,3,4
6,7,5,3
7,8,5,3
8,9,3,4
9,10,5,3


### Mean Encoding
In this method we will replace the distinct categories of the categorical variable based on the output. For example for a binary classification say if value 'A' occurs for 7 times for output=0 and 3 times for output=1 then value is replaced by 0.7 and 0.3 respectively

In [18]:
df2=df.copy()

In [19]:
df2['output']=[0,0,1,1,0,0,0,0,1,0]
df2

Unnamed: 0,numeric,category,category2,output
0,1,a,e,0
1,2,c,d,0
2,3,b,d,1
3,4,a,f,1
4,5,b,f,0
5,6,c,f,0
6,7,b,e,0
7,8,b,e,0
8,9,c,f,1
9,10,b,d,0


In [20]:
df2.dtypes

numeric       int64
category     object
category2    object
output        int64
dtype: object

In [21]:
#b=dict(df2.output.value_counts())
#for key,values in b.items():
#    df3=df2[df2['output']==key]

In [22]:
def mean_based_encoding(df2):
    c=list()
    for col in df2.columns:
        if(is_string_dtype(df2[col])):
            x=df2.groupby([col,'output'])['output'].agg(['count'])
            x=x.reset_index()
            u=dict((df2[col].value_counts()))
            x['sum']=x[col].map(u)
            x['mean']=x['count']/x['sum']
            x.drop(['count','sum'],axis=1,inplace=True)
            new_df = pd.merge(df2,x,how='inner', left_on=[col,'output'], right_on = [col,'output'])
            df2[col]=new_df['mean']
    return df2

In [23]:
df2=mean_based_encoding(df2)
df2

Unnamed: 0,numeric,category,category2,output
0,1,0.5,1.0,0
1,2,0.666667,1.0,0
2,3,0.666667,1.0,1
3,4,0.2,0.666667,1
4,5,0.5,0.666667,0
5,6,0.8,0.333333,0
6,7,0.8,0.5,0
7,8,0.8,0.5,0
8,9,0.8,0.5,1
9,10,0.333333,0.5,0


### Other Encoders
Category_encoders offers 15 techniques for categorical encoding in which I had explored 6 of them and implemented as above.Refer this link for other encoders : https://contrib.scikit-learn.org/categorical-encoding/index.html