# Section 0: Encoding

Feature encoding is the process of converting categorical data into a numerical format that can be understood by machine learning algorithms. Most machine learning algorithms require input features to be numeric, and feature encoding allows categorical data to be transformed into a suitable format for modeling.

# Types of Feature Encoding
## Label Encoding
Converts each category into a unique integer. 

In [1]:
from sklearn.preprocessing import LabelEncoder


colors = ["Red", "Blue", "Green", "Blue", "Green"]

label_encoder = LabelEncoder()

encoded_colors = label_encoder.fit_transform(colors)

In [3]:
label_encoder.classes_

array(['Blue', 'Green', 'Red'], dtype='<U5')

In [4]:
encoded_colors

array([2, 0, 1, 0, 1])

In [5]:
label_encoder.inverse_transform(encoded_colors)

array(['Red', 'Blue', 'Green', 'Blue', 'Green'], dtype='<U5')

## One-Hot Encoding
Converts each category into a binary vector where only one bit is set to 1 (indicating the presence of the category) and all other bits are 0.

In [9]:
import pandas as pd

colors_df = pd.DataFrame(colors)
colors_df

Unnamed: 0,0
0,Red
1,Blue
2,Green
3,Blue
4,Green


In [17]:
from sklearn.preprocessing import OneHotEncoder


# First Way (Using  OneHotEncoder)
one_hot_encoder = OneHotEncoder()
encoded_colors = one_hot_encoder.fit_transform(colors_df)
encoded_colors.toarray()

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [18]:
one_hot_encoder.categories_

[array(['Blue', 'Green', 'Red'], dtype=object)]

In [19]:
one_hot_encoder.inverse_transform(encoded_colors.toarray())

array([['Red'],
       ['Blue'],
       ['Green'],
       ['Blue'],
       ['Green']], dtype=object)

In [20]:
# Second Way (Using panda get_dummies)
pd.get_dummies(colors_df)

Unnamed: 0,0_Blue,0_Green,0_Red
0,False,False,True
1,True,False,False
2,False,True,False
3,True,False,False
4,False,True,False


## Ordinal Encoding
Assigns integer values to categories based on their order. For instance, if the feature is **Education Level** with categories `["High School", "Bachelors", "Masters", "PhD"]`, it might be encoded as `[0, 1, 2, 3]`

In [21]:
from sklearn.preprocessing import OrdinalEncoder


education_levels = np.array([["High School"], ["Bachelors"], ["Masters"], ["PhD"], ["Bachelors"], ["High School"]])

ordinal_encoder = OrdinalEncoder(categories=[["High School", "Bachelors", "Masters", "PhD"]])

encoded_levels = ordinal_encoder.fit_transform(education_levels)

In [25]:
encoded_levels

array([[0.],
       [1.],
       [2.],
       [3.],
       [1.],
       [0.]])

In [26]:
ordinal_encoder.categories_

[array(['High School', 'Bachelors', 'Masters', 'PhD'], dtype=object)]