# Target Guided Ordinal Encoding:

It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [1]:
import pandas as pd

# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320] #target variable
})

In [2]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [4]:
# Calculate the mean price for each city:
df.groupby('city')['price'].mean()  #if outliers present -> then take median

city
London      150.0
New York    190.0
Paris       310.0
Tokyo       250.0
Name: price, dtype: float64

In [8]:
# Converting into dictionary
mean_price = df.groupby('city')['price'].mean().to_dict()
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [9]:
# Replace each city with it's mean price:
df['city_encoded'] = df['city'].map(mean_price)

In [10]:
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [11]:
import seaborn as sns

In [16]:
df = sns.load_dataset('tips')

In [17]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [18]:
# Assignment:
# Target -> total_bill
# Encoding on -> day

In [21]:
# Calculate the mean total_bill for each day:
mean_total_bill = df.groupby('day')['total_bill'].mean().to_dict()

In [22]:
mean_total_bill

{'Thur': 17.682741935483868,
 'Fri': 17.15157894736842,
 'Sat': 20.44137931034483,
 'Sun': 21.41}

In [23]:
# Replace each day with it's mean bill:
df['day_encoded'] = df['day'].map(mean_total_bill)

In [24]:
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,day_encoded
0,16.99,1.01,Female,No,Sun,Dinner,2,21.410000
1,10.34,1.66,Male,No,Sun,Dinner,3,21.410000
2,21.01,3.50,Male,No,Sun,Dinner,3,21.410000
3,23.68,3.31,Male,No,Sun,Dinner,2,21.410000
4,24.59,3.61,Female,No,Sun,Dinner,4,21.410000
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,20.441379
240,27.18,2.00,Female,Yes,Sat,Dinner,2,20.441379
241,22.67,2.00,Male,Yes,Sat,Dinner,2,20.441379
242,17.82,1.75,Male,No,Sat,Dinner,2,20.441379
