<a href="https://colab.research.google.com/github/raviteja-padala/EDA/blob/main/Categorical_Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Approaches to Categorical Encoding#

1.Label Encoding 

2.One-Hot Encoding


Categorical encoding is a process of converting categories to numbers.

There are two common ways to convert categorical variables into numeric variables:

1. Label Encoding: Assign each categorical value an integer value based on alphabetical order.

2. One Hot Encoding: Create new variables that take on values 0 and 1 to represent the original categorical values.



In [None]:
# Creating a Dataset

import numpy as np
import pandas as pd

#create dataset
df = pd.DataFrame({'Country': ['India', 'US', 'Japan', 'US', 'Japan'],
                   'Age': [44, 34, 46, 35, 23],
                   'Salary': [72000, 65000, 98000, 45000, 34000]})

df

Unnamed: 0,Country,Age,Salary
0,India,44,72000
1,US,34,65000
2,Japan,46,98000
3,US,35,45000
4,Japan,23,34000


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Country  5 non-null      object
 1   Age      5 non-null      int64 
 2   Salary   5 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes


## Label Encoding:

In [None]:
# Import label encoder 
from sklearn import preprocessing

# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder()

#copying data in df to df1 to perform Label encoding on df1
df1 = df.copy()

# Encode labels in column 'Country'. 
df1['Country']= label_encoder.fit_transform(df1['Country'])

print("before label encoding")
print(df)
print('\n')
print("after label encoding")
print(df1)

before label encoding
  Country  Age  Salary
0   India   44   72000
1      US   34   65000
2   Japan   46   98000
3      US   35   45000
4   Japan   23   34000


after label encoding
   Country  Age  Salary
0        0   44   72000
1        2   34   65000
2        1   46   98000
3        2   35   45000
4        1   23   34000


Label encoding uses alphabetical ordering. Hence, 

India has been encoded with 0, Japan with 1 , and the US with 2.

 Challenges with Label Encoding:
 
In the above scenario, the Country names do not have an order or rank. But, when label encoding is performed, the country names are ranked based on the alphabets. Due to this, there is a very high probability that the model captures the relationship between countries such as India < Japan < the US.

## One HOT Encoding

In [None]:
#copying data in df to df1 to perform Label encoding on df1
df2 = df.copy()

# importing one hot encoder 
from sklearn.preprocessing import OneHotEncoder

# creating one hot encoder object 
onehotencoder = OneHotEncoder()

#reshape the 1-D country array to 2-D as fit_transform expects 2-D and finally fit the object 
X = onehotencoder.fit_transform(df2.Country.values.reshape(-1,1)).toarray()

#To add this back into the original dataframe 
df2OneHot = pd.DataFrame(X, columns = ["Country_"+str(int(i)) for i in range(df2.shape[1])]) 

df2 = pd.concat([df2, df2OneHot], axis=1)

#we can drop country coloumn if we want 
#droping the country column 
#df2= df2.drop(['Country'], axis=1) 


#printing to verify 
print("before encoding")
print(df)
print('\n')
print("after one hot encoding")
print(df2)

before encoding
  Country  Age  Salary
0   India   44   72000
1      US   34   65000
2   Japan   46   98000
3      US   35   45000
4   Japan   23   34000


after one hot encoding
  Country  Age  Salary  Country_0  Country_1  Country_2
0   India   44   72000        1.0        0.0        0.0
1      US   34   65000        0.0        0.0        1.0
2   Japan   46   98000        0.0        1.0        0.0
3      US   35   45000        0.0        0.0        1.0
4   Japan   23   34000        0.0        1.0        0.0


### When to use One-Hot encoding and Label encoding?

Depending upon the data encoding technique is selected. For example, we have encoded different state names into numerical data in the above example. This categorical data is having no relation, of any kind, between the rows. Then we can use Lable encoding.

Label encoder is used when:<br>
The number of categories is quite large as one-hot encoding can lead to high memory consumption.
When the order does not matter in categorical feature.

One Hot encoder is used when:<br>
When the order does not matter in categorical features
Categories in a feature are fewer.

Note: The model will misunderstand the data to be in some kind of order, 0 < 1 < 2. For e.g. In the above six classes’ example for “State” column, the model misunderstood a relationship between these values as follows: 0 < 1 < 2 < 3 < 4 < 5. To overcome this problem, we can use one-hot encoding as explained below.

## Challenges of One-Hot Encoding: 
Dummy Variable Trap , 
One-Hot Encoding results in a Dummy Variable Trap as the outcome of one variable can easily be predicted with the help of the remaining variables.

Dummy Variable Trap is a scenario in which variables are highly correlated to each other.

The Dummy Variable Trap leads to the problem known as multicollinearity. Multicollinearity occurs where there is a dependency between the independent features. Multicollinearity is a serious issue in machine learning models like Linear Regression and Logistic Regression.

So, in order to overcome the problem of multicollinearity, one of the dummy variables has to be dropped.

## Multi Co-linearity:

Multicollinearity
Multicollinearity describes the state where the independent variables used in a study exhibit a strong relationship with each other.

This can pose a problem in many cases as you would normally want your independent variables to be… independent.

Depending on the aim and scope of your projects, it may be worthwhile to identify and address any signs of multicollinearity.

Detecting Multicollinearity
Naturally, we can not simply assess whether or not two features have a significant relationship through intuition alone, especially when many datasets boast dozens of features.

Here are the two common evaluation metrics used for detecting multicollinearity.

1. Correlation Coefficient
The Pearson’s correlation coefficient metric directly evaluates the strength of the relationship between two variables. Its values range between -1 and 1.

The magnitude of the correlation coefficient signifies the strength of the relationship, with a higher value corresponding to a stronger relationship.

By calculating the correlation coefficient between pairs of predictive features, you can identify features that may be contributing to multicollinearity.

2. Variance Inflation Factor
The second metric for gauging multicollinearity is the variance inflation factor (VIF). The VIF directly measures the ratio of the variance of the entire model to the variance of a model with only the feature in question.

In layman’s terms, it gauges how much a feature’s inclusion contributes to the overall variance of the coefficients of the features in the model.

A VIF of 1 indicates that the feature has no correlation with any of the other features.

Typically, a VIF value exceeding 5 or 10 is deemed to be too high. Any feature with such VIF values is likely to be contributing to multicollinearity.

In [None]:
#Correlation cofficient:
df1.corr()

Unnamed: 0,Country,Age,Salary
Country,1.0,-0.371007,-0.297922
Age,-0.371007,1.0,0.890411
Salary,-0.297922,0.890411,1.0


In [None]:
#copying data in df to df1 to perform Label encoding on df1
corr_df = df1.copy()

corr_df.corr()

Unnamed: 0,Country,Age,Salary
Country,1.0,-0.371007,-0.297922
Age,-0.371007,1.0,0.890411
Salary,-0.297922,0.890411,1.0


### Variance Inflation Factor(VIF)

In [None]:
# load statmodels functions
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# compute the vif for all given features
def compute_vif(considered_features):
    
    X = corr_df
    # the calculation of variance inflation requires a constant
    X['intercept'] = 1
    
    # create dataframe to store vif values
    vif = pd.DataFrame()
    vif["Variable"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif = vif[vif['Variable']!='intercept']
    return vif

In [None]:

# compute vif 
compute_vif(corr_df).sort_values('VIF', ascending=False)

Unnamed: 0,Variable,VIF
1,Age,5.130828
2,Salary,4.855557
0,Country,1.166482
