# Data Encoding Techniques in Machine Learning

In this notebook, we will explore three common types of data encoding used in machine learning:

1. **Label Encoding**
2. **One-Hot Encoding**
3. **Ordinal Encoding**

Encoding categorical variables is an essential step in preparing data for machine learning algorithms, as most algorithms require numerical input.


## Import Required Libraries

Let's start by importing the necessary libraries for data encoding.


In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

## 1. Label Encoding

**Label Encoding** converts each value in a categorical column into a unique integer. It is useful for ordinal data, but can introduce unintended order for nominal data.

**Example:**
Suppose we have a column of colors: `['red', 'green', 'blue', 'green']`. Label encoding will assign an integer to each unique value.


In [2]:
# Sample data
df = pd.DataFrame({"Color": ["red", "green", "blue", "green", "red"]})

# Apply Label Encoding
le = LabelEncoder()
df["Color_encoded"] = le.fit_transform(df["Color"])
print(df)
print("Label mapping:", dict(zip(le.classes_, le.transform(le.classes_))))

   Color  Color_encoded
0    red              2
1  green              1
2   blue              0
3  green              1
4    red              2
Label mapping: {'blue': 0, 'green': 1, 'red': 2}


## 2. One-Hot Encoding

**One-Hot Encoding** creates a new binary column for each category in the original column. Each row has a 1 in the column corresponding to its category and 0 elsewhere.

This method is suitable for nominal (unordered) categorical data.


In [8]:
# Example DataFrame
df = pd.DataFrame({"Color": ["Red", "Blue", "Green", "Red"]})

# pandas get_dummies
one_hot_df = pd.get_dummies(df["Color"])
print("One-Hot Encoding with pandas:")
print(one_hot_df)

# sklearn OneHotEncoder
encoder = OneHotEncoder()
encoded = encoder.fit_transform(df[["Color"]]).toarray()
one_hot_sklearn = pd.DataFrame(encoded, columns=encoder.categories_[0])

print("\nOne-Hot Encoding with sklearn:")
print(one_hot_sklearn)

One-Hot Encoding with pandas:
    Blue  Green    Red
0  False  False   True
1   True  False  False
2  False   True  False
3  False  False   True

One-Hot Encoding with sklearn:
   Blue  Green  Red
0   0.0    0.0  1.0
1   1.0    0.0  0.0
2   0.0    1.0  0.0
3   0.0    0.0  1.0


## 3. Ordinal Encoding

**Ordinal Encoding** assigns each unique category an integer value, but unlike label encoding, it is specifically used for categorical variables with an inherent order (e.g., low < medium < high).

This method is suitable for ordinal categorical data.


In [5]:
# Sample data for ordinal encoding
sizes = ["Small", "Medium", "Large", "Medium", "Small"]
df_ordinal = pd.DataFrame({"Size": sizes})

# Define the order
size_order = ["Small", "Medium", "Large"]

# Apply Ordinal Encoding
ordinal_encoder = OrdinalEncoder(categories=[size_order])
df_ordinal["Size_encoded"] = ordinal_encoder.fit_transform(df_ordinal[["Size"]])
print(df_ordinal)

     Size  Size_encoded
0   Small           0.0
1  Medium           1.0
2   Large           2.0
3  Medium           1.0
4   Small           0.0


## 4. Target Guided Ordinal Encoding

**Target Guided Ordinal Encoding** is a technique where the categories are ordered based on the relationship with the target variable, usually by the mean of the target for each category. This is especially useful for categorical variables with a strong relationship to the target.

Let's see an example using a simple dataset.


In [21]:
# Sample data for target guided ordinal encoding
import numpy as np

data = {
    "City": ["A", "B", "A", "C", "B", "A", "C", "B", "C"],
    "Target": [100, 200, 150, 300, 250, 120, 320, 210, 310],
}
df_tg = pd.DataFrame(data)

# Calculate mean target for each category
mean_city = df_tg.groupby("City")["Target"].mean().to_dict()
df_tg["city_encoded"] = df_tg["City"].map(mean_city)

print("Mean Target by City:")
print(df_tg.groupby("City")["Target"].mean())
print("\nDataFrame with Target Guided Ordinal Encoding:")
print(df_tg)

Mean Target by City:
City
A    123.333333
B    220.000000
C    310.000000
Name: Target, dtype: float64

DataFrame with Target Guided Ordinal Encoding:
  City  Target  city_encoded
0    A     100    123.333333
1    B     200    220.000000
2    A     150    123.333333
3    C     300    310.000000
4    B     250    220.000000
5    A     120    123.333333
6    C     320    310.000000
7    B     210    220.000000
8    C     310    310.000000
