Date : 03/02/2026

### What is Feature Encoding?

Feature encoding is the process of converting categorical (text) data into numerical form so that Machine Learning models can understand and process it.

Machine learning models work only with numbers, not text.

Example:

Gender ‚Üí Male, Female  
City ‚Üí Pune, Mumbai, Delhi

These must be converted into numbers

#### üìå Why we use Encoding here?

Because ML models:  
‚ùå cannot understand text like "Male", "Female"  
‚úÖ understand only numbers like 0, 1

So we convert categorical data ‚Üí numerical data.

In [10]:
import pandas as pd
import numpy as np

In [11]:
df = pd.read_csv("Social_Network_Ads.csv")
df

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0
...,...,...,...,...,...
395,15691863,Female,46,41000,1
396,15706071,Male,51,23000,1
397,15654296,Female,50,20000,1
398,15755018,Male,36,33000,0


In [12]:
df['Gender'].value_counts()

Gender
Female    204
Male      196
Name: count, dtype: int64

### üìå Label Encoding

Label Encoding is a feature encoding technique used to convert categorical (text) data into numerical values.  
Machine Learning models cannot understand text like Male, Female, Yes, No, so Label Encoding converts them into numbers.  

##### How Label Encoding Works

- Each unique category is assigned a number.
- The first category becomes 0, second becomes 1, and so on.

##### When to Use Label Encoding  

Label Encoding is used when:

- The categorical data is binary (Yes/No, Male/Female).
- The data is ordinal (Low, Medium, High).
- There is a meaningful order between categories.

##### Not suitable for categorical data without order (City, Country, Color) (use One Hot Encoding instead).

In [13]:
# importing the class
from sklearn.preprocessing import LabelEncoder

In [14]:
# create an object 
le = LabelEncoder()

In [15]:
newdf = df.copy()   # we are creating a copy of the original dataframe to avoid modifying it directly.

In [16]:
newdf['Gender'] = le.fit_transform(newdf['Gender'])

In [17]:
newdf

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,1,19,19000,0
1,15810944,1,35,20000,0
2,15668575,0,26,43000,0
3,15603246,0,27,57000,0
4,15804002,1,19,76000,0
...,...,...,...,...,...
395,15691863,0,46,41000,1
396,15706071,1,51,23000,1
397,15654296,0,50,20000,1
398,15755018,1,36,33000,0


### üìå One Hot Encoding  

What is One Hot Encoding?

One Hot Encoding is a feature encoding technique used to convert categorical data into binary (0 or 1) columns.

Each category becomes a separate column with value:  
1 ‚Üí category present  
0 ‚Üí category not present  

Machine Learning models understand numbers, not text.

#### When to Use One Hot Encoding

One Hot Encoding is used when:

- Categories have no order (City, Country, Color).
- There are more than 2 categories.
- You want to avoid false ranking.

In [18]:
# pd.get_dummies() converts categorical (text) columns into numerical dummy variables (0/1) for machine learning.
pd.get_dummies(df)

Unnamed: 0,User ID,Age,EstimatedSalary,Purchased,Gender_Female,Gender_Male
0,15624510,19,19000,0,False,True
1,15810944,35,20000,0,False,True
2,15668575,26,43000,0,True,False
3,15603246,27,57000,0,True,False
4,15804002,19,76000,0,False,True
...,...,...,...,...,...,...
395,15691863,46,41000,1,True,False
396,15706071,51,23000,1,False,True
397,15654296,50,20000,1,True,False
398,15755018,36,33000,0,False,True


| Feature         | Label Encoding                      | One Hot Encoding                      |
| --------------- | ----------------------------------- | ------------------------------------- |
| Output format   | Single numeric column               | Multiple binary columns               |
| Example         | Male = 0, Female = 1                | Male = [1,0], Female = [0,1]          |
| Creates order   | Yes (may be false order)            | No order created                      |
| Best for        | Binary or ordinal data              | Nominal data                          |
| Memory usage    | Low                                 | High                                  |
| Columns created | 1                                   | Many                                  |
| Risk            | Model assumes ranking               | No ranking problem                    |
| Used when       | Data has order or only 2 categories | Data has no order and many categories |