<a href="https://colab.research.google.com/github/svgoudar/My-Data-Science-Roadmap/blob/main/EDA/4.Data%20Encoders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Encoding is the process of converting categorical data into numerical format so that it can be used in machine learning models. Here are the **main types of encoding techniques**:

---

### 🔢 1. **Label Encoding**

* **What it does:** Assigns each unique category an integer label.
* **Best for:** Ordinal variables (e.g., "Low", "Medium", "High").
* **Example:**
  `['Male', 'Female'] → [1, 0]`


In [10]:
import seaborn as sns
import pandas as pd

df = sns.load_dataset('titanic')
df = df[['sex', 'embarked', 'class', 'who', 'survived']]  # Select a few

In [11]:
df['class'].value_counts()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
Third,491
First,216
Second,184


In [12]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['class_label'] = le.fit_transform(df['class'])

In [13]:
df[['class_label','class']]

Unnamed: 0,class_label,class
0,2,Third
1,0,First
2,2,Third
3,0,First
4,2,Third
...,...,...
886,1,Second
887,0,First
888,2,Third
889,0,First



---

### 🔲 2. **One-Hot Encoding**

* **What it does:** Converts each category into a separate binary column (0 or 1).
* **Best for:** Nominal variables (no order).
* **Example:**
  `['Red', 'Green', 'Blue'] → [1, 0, 0], [0, 1, 0], [0, 0, 1]`


In [14]:
from sklearn.preprocessing import OneHotEncoder
df_encoded = pd.get_dummies(df, columns=['sex'], drop_first=True)


In [15]:
df_encoded

Unnamed: 0,embarked,class,who,survived,class_label,sex_male
0,S,Third,man,0,2,True
1,C,First,woman,1,0,False
2,S,Third,woman,1,2,False
3,S,First,woman,1,0,False
4,S,Third,man,0,2,True
...,...,...,...,...,...,...
886,S,Second,man,0,1,True
887,S,First,woman,1,0,False
888,S,Third,woman,0,2,False
889,C,First,man,1,0,True


In [16]:
df

Unnamed: 0,sex,embarked,class,who,survived,class_label
0,male,S,Third,man,0,2
1,female,C,First,woman,1,0
2,female,S,Third,woman,1,2
3,female,S,First,woman,1,0
4,male,S,Third,man,0,2
...,...,...,...,...,...,...
886,male,S,Second,man,0,1
887,female,S,First,woman,1,0
888,female,S,Third,woman,0,2
889,male,C,First,man,1,0


In [18]:
import seaborn as sns
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Load Titanic dataset
df = sns.load_dataset('titanic')

# Select 'sex' column and drop NA values for this example
sex_data = df[['sex']].dropna()

# Initialize the encoder
encoder = OneHotEncoder(drop='first', sparse_output=False)  # drop='first' to avoid dummy variable trap

# Fit and transform the 'sex' column
encoded_array = encoder.fit_transform(sex_data)

# Convert the encoded array to DataFrame
encoded_df = pd.DataFrame(encoded_array, columns=encoder.get_feature_names_out(['sex']))

# Reset index to join cleanly with the original dataframe
encoded_df.index = sex_data.index

# Combine with original DataFrame
df_encoded = df.join(encoded_df)

# Display result
print(df_encoded[['sex', 'sex_male']].head())


      sex  sex_male
0    male       1.0
1  female       0.0
2  female       0.0
3  female       0.0
4    male       1.0


In [21]:
df_encoded

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,sex_male
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False,1.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,0.0
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True,0.0
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False,0.0
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True,1.0
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True,0.0
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False,0.0
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True,1.0


In [19]:
encoded_df

Unnamed: 0,sex_male
0,1.0
1,0.0
2,0.0
3,0.0
4,1.0
...,...
886,1.0
887,0.0
888,0.0
889,1.0


### 🧱 3. **Ordinal Encoding**

* **What it does:** Assigns ordered numbers to categories based on hierarchy.
* **Best for:** Categorical features with clear order.
* **Example:**
  `['Small', 'Medium', 'Large'] → [0, 1, 2]`

---


In [22]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# Sample dataset with an ordered column
data = pd.DataFrame({
    'Education': ['High School', 'PhD', 'Master', 'Bachelor', 'Master', 'PhD']
})

# Define the order
education_order = [['High School', 'Bachelor', 'Master', 'PhD']]

# Create encoder
encoder = OrdinalEncoder(categories=education_order)

# Apply encoding
data['Education_Encoded'] = encoder.fit_transform(data[['Education']])

print(data)


     Education  Education_Encoded
0  High School                0.0
1          PhD                3.0
2       Master                2.0
3     Bachelor                1.0
4       Master                2.0
5          PhD                3.0



### 🔁 4. **Binary Encoding**

* **What it does:** Converts categories to binary numbers and splits them into columns.
* **Best for:** High cardinality variables (many unique values).
* **Example:**
  `Category A → 001, B → 010`


In [24]:
!pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.8.1-py3-none-any.whl.metadata (7.9 kB)
Downloading category_encoders-2.8.1-py3-none-any.whl (85 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/85.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: category_encoders
Successfully installed category_encoders-2.8.1


In [25]:
import pandas as pd
import category_encoders as ce

# Sample data
df = pd.DataFrame({
    'City': ['London', 'Berlin', 'New York', 'London', 'Berlin', 'Tokyo']
})

# Apply Binary Encoding
encoder = ce.BinaryEncoder(cols=['City'])
df_encoded = encoder.fit_transform(df)

print(df_encoded)


   City_0  City_1  City_2
0       0       0       1
1       0       1       0
2       0       1       1
3       0       0       1
4       0       1       0
5       1       0       0



### 🎯 5. **Target Encoding (Mean Encoding)**

* **What it does:** Replaces categories with the mean of the target variable for that category.
* **Best for:** When the categorical feature is highly correlated with the target.
* **Example:**
  `['A', 'B'] → [0.5, 0.7] (based on average target value for each)`


In [29]:
df = sns.load_dataset('titanic')
df['sex_encoded'] = df['sex'].map(df.groupby("sex")['survived'].mean())
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,sex_encoded
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False,0.188908
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,0.742038
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True,0.742038
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False,0.742038
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True,0.188908
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True,0.188908
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True,0.742038
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False,0.742038
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True,0.188908



---

### 📊 6. **Frequency / Count Encoding**

* **What it does:** Replaces categories with their frequency/count in the dataset.
* **Example:**
  `['Red', 'Red', 'Blue'] → [2, 2, 1]`


In [30]:
# Frequency / Count Encoding
count_encoded = df['embarked'].value_counts()
df['embarked_count_encoded'] = df['embarked'].map(count_encoded)

# View result
print(df.head())


   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  sex_encoded  \
0    man        True  NaN  Southampton    no  False     0.188908   
1  woman       False    C    Cherbourg   yes  False     0.742038   
2  woman       False  NaN  Southampton   yes   True     0.742038   
3  woman       False    C  Southampton   yes  False     0.742038   
4    man        True  NaN  Southampton    no   True     0.188908   

   embarked_count_encoded  
0                   644.0  
1                   168.0  
2                   644.0  
3                   64


---

### 🧮 7. **Hash Encoding (Hashing)**

* **What it does:** Uses a hash function to map categories to numerical space.
* **Best for:** Very high cardinality features; efficient in terms of memory.

---


In [31]:
import pandas as pd
import seaborn as sns
import category_encoders as ce

# Load dataset
df = sns.load_dataset('titanic')[['embarked']].dropna()

# Instantiate HashingEncoder with 4 output dimensions
encoder = ce.HashingEncoder(cols=['embarked'], n_components=4)  # You can change n_components

# Transform the data
hashed_df = encoder.fit_transform(df)

# Display result
print(hashed_df.head())


   col_0  col_1  col_2  col_3
0      0      0      1      0
1      0      0      0      1
2      0      0      1      0
3      0      0      1      0
4      0      0      1      0



### 📈 Summary Table

| Encoding Type      | Best For              | Output Form      | Risk of Overfitting |
| ------------------ | --------------------- | ---------------- | ------------------- |
| Label Encoding     | Ordinal data          | Single integer   | Low                 |
| One-Hot Encoding   | Nominal data          | Many binary cols | Medium (high dims)  |
| Ordinal Encoding   | Ordered categories    | Single integer   | Low                 |
| Binary Encoding    | High-cardinality      | Binary columns   | Low-Medium          |
| Target Encoding    | Categorical w/target  | Mean value       | High (need CV)      |
| Frequency Encoding | Any categorical       | Count/integer    | Medium              |
| Hash Encoding      | Very high-cardinality | Fixed # of cols  | Low-Medium          |

---

Let me know if you’d like a hands-on example using a dataset (e.g., Titanic or Google Play Store)!
