**Encoding** in Machine Learning (ML) means converting categorical (text / label) data into numerical form, because ML algorithms work with numbers, not strings.

Example:

"location" = Bangalore, Mumbai, Delhi â†’ ML cannot directly process these values.

Why Encoding Is Required

1. ML models perform mathematical operations

2. Text data has no inherent numeric meaning

3. Encoding prevents model errors

4. Proper encoding improves accuracy & generalization


# 1. Label Encoding

**Label Encoding** assigns each category a unique integer. It is simple and memory-efficient but may unintentionally imply an order among categories when none exists.

1. Used in tree-based models like Decision Trees or XGBoost.

2. Pros: Simple and memory-efficient.

3. Cons: Introduces implicit order which may be misinterpreted by non-tree models when used with nominal data.

**when to use** 

Ordinal categorical data (order matters)

Tree-based models (Decision Tree, Random Forest)

Example Columns

size â†’ 1 BHK, 2 BHK, 3 BHK

grade â†’ A, B, C

In [1]:
from sklearn.preprocessing import LabelEncoder

data = ['Grade A', 'Grade B', 'Grade c', 'Grade A']

le = LabelEncoder()
encoded_data = le.fit_transform(data)
print(f"Encoded Data: {encoded_data}")

Encoded Data: [0 1 2 0]


# 2. One-Hot Encoding

**One-Hot Encoding** converts categories into binary columns with each column representing one category. It prevents false ordering but can lead to high dimensionality if there are many unique values.

1. Used in linear models, logistic regression and neural networks.
2. Pros: Does not assume order; widely supported.
3. Cons: Can cause high dimensionality and sparse data when feature has many categories.

**When to Use**

Nominal categorical data

Linear Regression, Logistic Regression, SVM

**Drawback**

Increases number of columns (curse of dimensionality)

In [2]:
import pandas as pd

data = ['Red', 'Blue', 'Green', 'Red']

df = pd.DataFrame(data, columns=['Color'])
one_hot_encoded = pd.get_dummies(df['Color'])

print(one_hot_encoded)

    Blue  Green    Red
0  False  False   True
1   True  False  False
2  False   True  False
3  False  False   True


# 3. Ordinal Encoding

**Ordinal Encoding** maps categories to integers while preserving their natural order. This works well for ordered data like ratings but is not suitable for nominal variables.

1. Used for ordered features like ratings or education levels.
2. Pros: Maintains order; reduces dimensionality.
3. Cons: Not suitable for nominal categories.

When to Use

Ordered categories

ðŸ”¹ Example Columns

size â†’ 1 BHK < 2 BHK < 3 BHK

rating â†’ low < medium < high

In [3]:
from sklearn.preprocessing import OrdinalEncoder
data = [['Low'], ['Medium'], ['High'], ['Medium'], ['Low']]

encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
encoded_data = encoder.fit_transform(data)

print(f"Encoded Ordinal Data: {encoded_data}")

Encoded Ordinal Data: [[0.]
 [1.]
 [2.]
 [1.]
 [0.]]


# 4. Target Encoding
**Target Encoding** also known as Mean Encoding is a technique where each category in a feature is replaced by the mean of the target variable for that category.

1. Useful for high-cardinality features like ZIP codes or product IDs.
2. Pros: Captures relationship to target variable.
3. Cons: Risk of overfitting, also must apply smoothing/statistical techniques.

When to Use

High-cardinality features

Regression problems

Risk

Data leakage if not done carefully

ðŸ”¹ Example Columns

location, society

In [7]:
import pandas as pd
import category_encoders as ce

df = pd.DataFrame(
    {'City': ['London', 'Paris', 'London', 'Berlin'], 'Target': [1, 0, 1, 0]}
)

encoder = ce.TargetEncoder(cols=['City'])
df_tgt = encoder.fit_transform(df['City'], df['Target'])

print(f"Encoded Target Data:\n{df_tgt}")

Encoded Target Data:
       City
0  0.570926
1  0.434946
2  0.570926
3  0.434946


# 5. Binary Encoding

**Binary encoding** represents categories as binary codes and splits them across multiple columns. It is efficient for high-cardinality data but slightly more complex to implement.

1. Applied in high-cardinality text/NLP tasks to save memory.
2. Pros: Reduces dimensionality, more memory-efficient than one-hot encoding.
3. Cons: Slightly more complex; requires careful handling of missing values.

When to Use

Large categorical columns

Neural Networks

In [8]:
data = ['Red', 'Green', 'Blue', 'Red']
encoder = ce.BinaryEncoder(cols=['Color'])
encoded_data = encoder.fit_transform(pd.DataFrame(data, columns=['Color']))
print(encoded_data)

   Color_0  Color_1
0        0        1
1        1        0
2        1        1
3        0        1


# 6. Frequency Encoding

**Frequency Encoding** assigns categories values based on how often they occur in the dataset. It is simple and compact but can introduce data leakage if applied improperly.

1. Effective in retail, e-commerce or clickstream data for popularity trends.
2. Pros: Low computational and storage requirements.
3. Cons: Can introduce data leakage if not handled properly.

When to Use

High-cardinality columns

Avoids dimensional explosion

ðŸ”¹ Example Columns

location

society

In [9]:
data = ['Red', 'Green', 'Blue', 'Red', 'Red']
series_data = pd.Series(data)
frequency_encoding = series_data.value_counts()

encoded_data = [frequency_encoding[x] for x in data]
print("Encoded Data:", encoded_data)

Encoded Data: [np.int64(3), np.int64(1), np.int64(1), np.int64(3), np.int64(3)]
