# Different Approaches to Categorical Encoding
**1)Label Encoding:-**
Label Encoding is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.

**2)One-Hot Encoding:-**
One-Hot Encoding is another popular technique for treating categorical variables. It simply creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature.One-Hot Encoding is the process of creating dummy variables.
In this encoding technique, each category is represented as a one-hot vector


In [19]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

In [20]:
# loading the data from csv file to pandas dataFrame
iris_data = pd.read_csv('/content/iris_data.csv')

In [21]:
iris_data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [22]:
iris_data['Species'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64

# Label Encoding

In [23]:
# loding the label encoder
label_encoder_1 = LabelEncoder()

In [24]:
iris_labels = label_encoder_1.fit_transform(iris_data.Species)
iris_data['target'] = iris_labels


In [25]:
#you can also use this code the output will be same it will give you encoding of the species column (basicaly alll the changes will happen in the Species column only)
#iris_data["Species"]=label_encoder_1.fit_transform(iris_data["Species"])

In [26]:
iris_data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,target
0,1,5.1,3.5,1.4,0.2,Iris-setosa,0
1,2,4.9,3.0,1.4,0.2,Iris-setosa,0
2,3,4.7,3.2,1.3,0.2,Iris-setosa,0
3,4,4.6,3.1,1.5,0.2,Iris-setosa,0
4,5,5.0,3.6,1.4,0.2,Iris-setosa,0


In [27]:
iris_data['target'].value_counts()

0    50
1    50
2    50
Name: target, dtype: int64

Iris-setosa --> 0

Iris-versicolor --> 1

Iris-virginica --> 2

Species names do not have an order or rank. But, when label encoding is performed, the Species names are ranked based on the alphabets. Due to this, there is a very high probability that the model captures the relationship between countries such as Iris-virginica > Iris-versicolor >Iris-setosa .This is something that we do not want.

# One Hot Encoding

In [28]:
# loading the data from csv file to pandas dataFrame
df = pd.read_csv('/content/iris_data.csv')

In [29]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [30]:
df['Species'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64

In [37]:
one_hot_encoding=OneHotEncoder()
one_hot_encoded_data = pd.DataFrame(pd.get_dummies(df, columns = ["Species"]))
one_hot_encoded_data.head(60)


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species_Iris-setosa,Species_Iris-versicolor,Species_Iris-virginica
0,1,5.1,3.5,1.4,0.2,1,0,0
1,2,4.9,3.0,1.4,0.2,1,0,0
2,3,4.7,3.2,1.3,0.2,1,0,0
3,4,4.6,3.1,1.5,0.2,1,0,0
4,5,5.0,3.6,1.4,0.2,1,0,0
5,6,5.4,3.9,1.7,0.4,1,0,0
6,7,4.6,3.4,1.4,0.3,1,0,0
7,8,5.0,3.4,1.5,0.2,1,0,0
8,9,4.4,2.9,1.4,0.2,1,0,0
9,10,4.9,3.1,1.5,0.1,1,0,0


**The Dummy Variable Trap leads to the problem known as multicollinearity. Multicollinearity occurs where there is a dependency between the independent features.**

# We apply One-Hot Encoding when:

**1)The categorical feature is not ordinal (like the countries above)**

**2)The number of categorical features is less so one-hot encoding can be effectively applied**

# We apply Label Encoding when:

**1)The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school)**

**2)The number of categories is quite large as one-hot encoding can lead to high memory consumption**
 