# *Categorical encoding is a process of converting categories to numbers.*

In many Machine-learning or Data Science activities, the data set might contain text or categorical values (basically non-numerical values). For example, color feature having values like red, orange, blue, white etc. Meal plan having values like breakfast, lunch, snacks, dinner, tea etc. Few algorithms such as CATBOAST, decision-trees can handle categorical values very well but most of the algorithms expect numerical values to achieve state-of-the-art results. A machine can only understand numbers so we need to convert categorical columns to numerical columns

In [None]:
import numpy as np
import pandas as pd 


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/cat-in-the-dat-ii/train.csv')
df.head(10)

In [None]:
for feature in df.columns:
    print('The feature is {} and number of categories are {}'.format(feature,len(df[feature].unique())))

# Techniques for Categorical Encoding

There are several techniques for categorical techniques:
* Label Encoding or Ordinal Encoding
* One hot Encoding
* Dummy Encoding
* Effect Encoding
* Binary Encoding
* BaseN Encoding
* Hash Encoding
* Target Encoding

In this notebook I will be going through the two most commonly encoding techniques: *Label Encoding and One-Hot Encoding*

In [None]:
#Though there are many more columns in the dataset, to understand encoding, we will focus on one categorical column only.
features = ['bin_4','nom_0','nom_4','ord_2']

df1  = df[features]

print(df1.shape)

df1.head()

In [None]:
df1.info()

# Label Encoding Or Ordinal Encoding

This approach is very simple and it involves converting each value in a column to a number.In this case, retaining the order is important. Hence encoding should reflect the sequence.

**In Label encoding, *each label is converted into an integer value*.** 

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

#lets take the feature ord_2 here

df1['ord_2_en'] = label_encoder.fit_transform(df1['ord_2'].astype(str))

df1.head(10)

That how label encoding is performed but did you notice a pattern??

YES, there is a pattern!! In the above scenario, the temperatures do not have an order or rank. But, when label encoding is performed, the temperatures are **ranked based on the alphabets**. Due to this, there is a very high probability that the model captures the relationship between temperatures such as Boiling Hot < Cold < Freezing < Lava Hot < Warm.

This ordering issue is addressed in another commonly used technique: **One-Hot Encoding**.

# One-Hot Encoding

In this technique, **each category value is converted into a new column and assigned a 1 or 0**  (notation for true/false) value to the column. 

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

#lets take the feature bin_4 here.
df1 = df1.dropna()

enc_df = pd.DataFrame(ohe.fit_transform(df1[['bin_4']]).toarray())

final_df = df1.join(enc_df)

final_df.head(15)

The major issue with OneHotEncoder is that it creates dummy variables(categories as seperate features). This leads to the problem of multicollinearity. Multicollinearity occurs where there is a dependency between the independent features. Multicollinearity is a serious issue in machine learning models like Linear Regression and Logistic Regression.  

# When to use a Label Encoding vs. One Hot Encoding?


We apply One-Hot Encoding when:

* The categorical feature is **not ordinal** (like the bin_4 feature above)
* The number of categorical features is less so one-hot encoding can be effectively applied

We apply Label Encoding when:

* The categorical feature is **ordinal** (like warm, cold, freezing, etc)
* The number of categories is quite large as one-hot encoding can lead to high memory consumption