In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn import preprocessing
import statsmodels as sm

df = pd.DataFrame([["India", 44, 72000],
                   ["US", 34, 65000],
                   ["Japan", 46, 98000],
                  ["US", 35, 45000],
                  ["Japan", 23, 34000]],
                  columns=["Country", "Age", "Salary"])

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Country  5 non-null      object
 1   Age      5 non-null      int64 
 2   Salary   5 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes


In [4]:

# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'Country'. 
df['Country']= label_encoder.fit_transform(df['Country']) 
print(df.head())

   Country  Age  Salary
0        0   44   72000
1        2   34   65000
2        1   46   98000
3        2   35   45000
4        1   23   34000


#As you can see here, label encoding uses alphabetical ordering. Hence, India has been encoded with 0, the US with 2, and Japan with 1.

Challenges with Label Encoding
In the above scenario, the Country names do not have an order or rank. But, when label encoding is performed, the country names are ranked based on the alphabets. Due to this, there is a very high probability that the model captures the relationship between countries such as India < Japan < the US.

This is something that we do not want! So how can we overcome this obstacle? Here comes the concept of One-Hot Encoding.

One-Hot Encoding


One-Hot Encoding is another popular technique for treating categorical variables. It simply creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature.

One-Hot Encoding is the process of creating dummy variables.

In this encoding technique, each category is represented as a one-hot vector. Let’s see how to implement one-hot encoding in Python:

In [10]:
df.Country.values.reshape(-1,1)

array([[0],
       [2],
       [1],
       [2],
       [1]], dtype=int64)

In [13]:
# importing one hot encoder 
from sklearn.preprocessing import OneHotEncoder
# creating one hot encoder object 
onehotencoder = OneHotEncoder()
#reshape the 1-D country array to 2-D as fit_transform expects 2-D and finally fit the object 
X = onehotencoder.fit_transform(df.Country.values.reshape(-1,1)).toarray()
#To add this back into the original dataframe 
dfOneHot = pd.DataFrame(X, columns = ["Country_"+str(int(i)) for i in range(df.shape[1])]) 
df1 = pd.concat([df, dfOneHot], axis=1)
#droping the country column 
df2= df1.drop(['Country'], axis=1) 
#printing to verify 
print(df2.head())

   Age  Salary  Country_0  Country_1  Country_2
0   44   72000        1.0        0.0        0.0
1   34   65000        0.0        0.0        1.0
2   46   98000        0.0        1.0        0.0
3   35   45000        0.0        0.0        1.0
4   23   34000        0.0        1.0        0.0


Challenges of One-Hot Encoding: Dummy Variable Trap
One-Hot Encoding results in a Dummy Variable Trap as the outcome of one variable can easily be predicted with the help of the remaining variables.

Dummy Variable Trap is a scenario in which variables are highly correlated to each other.

The Dummy Variable Trap leads to the problem known as multicollinearity. Multicollinearity occurs where there is a dependency between the independent features. Multicollinearity is a serious issue in machine learning models like Linear Regression and Logistic Regression.

So, in order to overcome the problem of multicollinearity, one of the dummy variables has to be dropped. Here, I will practically demonstrate how the problem of multicollinearity is introduced after carrying out the one-hot encoding.

One of the common ways to check for multicollinearity is the Variance Inflation Factor (VIF):

VIF=1, Very Less Multicollinearity
VIF<5, Moderate Multicollinearity
VIF>5, Extreme Multicollinearity (This is what we have to avoid)
Compute the VIF scores:



In [14]:
# Import library for VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [15]:


X = df2.iloc[:,:]
df4 = calc_vif(X)
df4.iloc[2:5,:]

Unnamed: 0,variables,VIF
2,Country_0,17.402774
3,Country_1,16.261873
4,Country_2,22.040062


In [16]:
X = df2.drop(['Country_0'],axis=1)
df5 = calc_vif(X)
df5.iloc[2:5,:]

Unnamed: 0,variables,VIF
2,Country_1,2.172309
3,Country_2,2.215864


(5, 3)