# 🔢 Transformations for Categorical Data

Machine learning algorithms work best with **numerical input**, so categorical features must be transformed.  
There are several encoding methods depending on the type of categorical variable.

## 🔹 0–1 Transformation (Binary Encoding)

- For **binary categorical variables** (two categories only).  
- Assigns `0` and `1` to categories.

Example:  
- Gender → {Male = 0, Female = 1}  

**Pros**: Simple, efficient.  
**Cons**: Only works for binary features.

## 🔹 1 and Others Transformation (Dummy Variable)

- Converts a category of interest into `1`, and all other categories into `0`.  
- Useful when there is one **dominant class** vs. "others".

Example:  
- Fruit → {Apple = 1, Others = 0}  

**Pros**: Useful for highlighting one class.  
**Cons**: Loses information about other classes.

## 🔹 Multi-Class Transformation (Label Encoding)

- Assigns an **integer value** to each category.  

Example:  
- Fruit → {Apple = 0, Banana = 1, Cherry = 2}  

**Pros**: Simple, compact representation.  
**Cons**: Implies an **ordinal relationship** that may not exist (algorithms may treat 0 < 1 < 2 as ordered).

## 🔹 One-Hot Transformation (One-Hot Encoding)

- Creates a **binary column for each category**.  
- A row has `1` in the column of its category, and `0` in others.  

Example:  
- Fruit → {Apple, Banana, Cherry}  

| Apple | Banana | Cherry |
|-------|--------|--------|
|   1   |   0    |   0    |
|   0   |   1    |   0    |
|   0   |   0    |   1    |

**Pros**: No artificial ordering, works well for nominal data.  
**Cons**: Increases dimensionality (curse of dimensionality for many categories).

---


#### » Load the "tips" dataset

In [7]:
import seaborn as sns
tips = sns.load_dataset("tips")
df = tips.copy()
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


## 0–1 Transformation (Binary Encoding)

In [8]:
from sklearn.preprocessing import LabelEncoder
lbe = LabelEncoder()
df["new_sex"] = lbe.fit_transform(df.sex)
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,new_sex
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,1
2,21.01,3.50,Male,No,Sun,Dinner,3,1
3,23.68,3.31,Male,No,Sun,Dinner,2,1
4,24.59,3.61,Female,No,Sun,Dinner,4,0
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,1
240,27.18,2.00,Female,Yes,Sat,Dinner,2,0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,1
242,17.82,1.75,Male,No,Sat,Dinner,2,1


#### or

In [10]:
df["new2_sex"] = df["sex"].cat.codes
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,new_sex,new2_sex
0,16.99,1.01,Female,No,Sun,Dinner,2,0,1
1,10.34,1.66,Male,No,Sun,Dinner,3,1,0
2,21.01,3.50,Male,No,Sun,Dinner,3,1,0
3,23.68,3.31,Male,No,Sun,Dinner,2,1,0
4,24.59,3.61,Female,No,Sun,Dinner,4,0,1
...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,1,0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,0,1
241,22.67,2.00,Male,Yes,Sat,Dinner,2,1,0
242,17.82,1.75,Male,No,Sat,Dinner,2,1,0


## 1 and Others Transformation (Dummy Variable)

In [14]:
import numpy as np
df["new_day"] = np.where(df.day.str.contains("Sun"),1,0)
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,new_sex,new2_sex,new_day
0,16.99,1.01,Female,No,Sun,Dinner,2,0,1,1
1,10.34,1.66,Male,No,Sun,Dinner,3,1,0,1
2,21.01,3.50,Male,No,Sun,Dinner,3,1,0,1
3,23.68,3.31,Male,No,Sun,Dinner,2,1,0,1
4,24.59,3.61,Female,No,Sun,Dinner,4,0,1,1
...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,1,0,0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,0,1,0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,1,0,0
242,17.82,1.75,Male,No,Sat,Dinner,2,1,0,0


## Multi-Class Transformation (Label Encoding)

In [15]:
from sklearn.preprocessing import LabelEncoder
lbe = LabelEncoder()
df["new2_day"] = lbe.fit_transform(df.day)
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,new_sex,new2_sex,new_day,new2_day
0,16.99,1.01,Female,No,Sun,Dinner,2,0,1,1,2
1,10.34,1.66,Male,No,Sun,Dinner,3,1,0,1,2
2,21.01,3.50,Male,No,Sun,Dinner,3,1,0,1,2
3,23.68,3.31,Male,No,Sun,Dinner,2,1,0,1,2
4,24.59,3.61,Female,No,Sun,Dinner,4,0,1,1,2
...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,1,0,0,1
240,27.18,2.00,Female,Yes,Sat,Dinner,2,0,1,0,1
241,22.67,2.00,Male,Yes,Sat,Dinner,2,1,0,0,1
242,17.82,1.75,Male,No,Sat,Dinner,2,1,0,0,1


## One-Hot Transformation (One-Hot Encoding)

In [20]:
import pandas as pd
df_one_hot = pd.get_dummies(df,columns=["sex",],dtype=int)
df_one_hot

Unnamed: 0,total_bill,tip,smoker,day,time,size,new_sex,new2_sex,new_day,new2_day,sex_Male,sex_Female
0,16.99,1.01,No,Sun,Dinner,2,0,1,1,2,0,1
1,10.34,1.66,No,Sun,Dinner,3,1,0,1,2,1,0
2,21.01,3.50,No,Sun,Dinner,3,1,0,1,2,1,0
3,23.68,3.31,No,Sun,Dinner,2,1,0,1,2,1,0
4,24.59,3.61,No,Sun,Dinner,4,0,1,1,2,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,No,Sat,Dinner,3,1,0,0,1,1,0
240,27.18,2.00,Yes,Sat,Dinner,2,0,1,0,1,0,1
241,22.67,2.00,Yes,Sat,Dinner,2,1,0,0,1,1,0
242,17.82,1.75,No,Sat,Dinner,2,1,0,0,1,1,0


In [27]:
pd.get_dummies(df,columns=["day",],dtype=int).head()

Unnamed: 0,total_bill,tip,sex,smoker,time,size,new_sex,new2_sex,new_day,new2_day,day_Thur,day_Fri,day_Sat,day_Sun
0,16.99,1.01,Female,No,Dinner,2,0,1,1,2,0,0,0,1
1,10.34,1.66,Male,No,Dinner,3,1,0,1,2,0,0,0,1
2,21.01,3.5,Male,No,Dinner,3,1,0,1,2,0,0,0,1
3,23.68,3.31,Male,No,Dinner,2,1,0,1,2,0,0,0,1
4,24.59,3.61,Female,No,Dinner,4,0,1,1,2,0,0,0,1
