## Data Encoding
1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

## Nominal/OHE Encoding
* One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for ML algos.
* In this technique, each category is represented as a binary vector where each vector where each bit corresponds to a unique category.
* For Example, if we have a categorical variable "color" with three possible values(red, green ,blue), we can represent it one hot encoding as follows:
1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
## create a dataframe
df = pd.DataFrame({'color':['red','green','blue','green','red','blue']})

In [3]:
df

Unnamed: 0,color
0,red
1,green
2,blue
3,green
4,red
5,blue


In [5]:
# we create an instance of OneHotEncoder
encoder = OneHotEncoder()

In [11]:
# perform fit and transform
encoded = encoder.fit_transform(df[['color']]).toarray()

In [12]:
encoder_df =pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [13]:
encoder_df


Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [16]:
## for new data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [20]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,green,0.0,1.0,0.0
2,blue,1.0,0.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [22]:
import seaborn as sns


In [23]:
df1 = sns.load_dataset('tips')

In [24]:
df1.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [27]:
df1['time'].unique()

['Dinner', 'Lunch']
Categories (2, object): ['Lunch', 'Dinner']

In [28]:
df1_encoded = encoder.fit_transform(df1[['day']]).toarray()

In [29]:
encoded_df1 =pd.DataFrame(df1_encoded,columns=encoder.get_feature_names_out())

In [30]:
encoded_df1

Unnamed: 0,day_Fri,day_Sat,day_Sun,day_Thur
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0
...,...,...,...,...
239,0.0,1.0,0.0,0.0
240,0.0,1.0,0.0,0.0
241,0.0,1.0,0.0,0.0
242,0.0,1.0,0.0,0.0


In [34]:
encoder.transform([['Fri']]).toarray()



array([[1., 0., 0., 0.]])

In [35]:
pd.concat([df1['day'],encoded_df1],axis=1)

Unnamed: 0,day,day_Fri,day_Sat,day_Sun,day_Thur
0,Sun,0.0,0.0,1.0,0.0
1,Sun,0.0,0.0,1.0,0.0
2,Sun,0.0,0.0,1.0,0.0
3,Sun,0.0,0.0,1.0,0.0
4,Sun,0.0,0.0,1.0,0.0
...,...,...,...,...,...
239,Sat,0.0,1.0,0.0,0.0
240,Sat,0.0,1.0,0.0,0.0
241,Sat,0.0,1.0,0.0,0.0
242,Sat,0.0,1.0,0.0,0.0


## Label Encoding
* Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.
* Label encoding involves assigning a unique numerical label to each category in the variable.
* The labels are usually assigned in alphabedtical order or based on the frequency of the categories.
* For example if we have a categorical variable "color" with three possible values(red, green, blue)
* we can represent it using label encoding as follows:
  1. Red: 1
  2. Green: 2
  3. Blue: 3

In [4]:
df.head()

Unnamed: 0,color
0,red
1,green
2,blue
3,green
4,red


In [5]:
from sklearn.preprocessing import LabelEncoder

In [6]:
lbl_encoder = LabelEncoder()

In [7]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 1, 0, 1, 2, 0])

In [8]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [9]:
lbl_encoder.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

## Ordinal Encoding
* It is used to encode categorical data that have an intrinsic order or ranking.
* In this technique, each category is assigned a numerical value based on its position in the order.
* For example, if we have categorical variable "education level" with four possible values(high school, college, graduate, post-graduate)
* we can represent it using ordinal encoding as follows:
  1. High School: 1
  2. College: 2
  3. Graduate: 3
  4. Post_Graduate: 4

In [11]:
## Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

In [12]:
df = pd.DataFrame({
    'size': ['small','medium','large','medium','small','large']
})

In [13]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [15]:
## create an instance of ordinal encoder and the perform fit_transform

encoder = OrdinalEncoder(categories=[['small','medium','large']])

In [16]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [17]:
encoder.transform([['small']])



array([[0.]])

In [18]:
df1 = pd.DataFrame({
    'courses':['Excel','Sql','Python','ML','Python','Sql','Excel','ML']
})

In [19]:
df1

Unnamed: 0,courses
0,Excel
1,Sql
2,Python
3,ML
4,Python
5,Sql
6,Excel
7,ML


In [24]:
encoder1 = OrdinalEncoder(categories=[['Excel','Sql','Python','ML']])

In [26]:
t1=encoder1.fit_transform(df1[['courses']])

In [29]:
df1['course_code'] = t1

In [30]:
df1

Unnamed: 0,courses,course_code
0,Excel,0.0
1,Sql,1.0
2,Python,2.0
3,ML,3.0
4,Python,2.0
5,Sql,1.0
6,Excel,0.0
7,ML,3.0


In [33]:
encoder1.transform([['ML']])



array([[3.]])

# Target Guided Ordinal Encoding
* It is a technique used to encode categorical variables based on their relationship with the target variable.
* This encoding technique is useful when we have a categorical variable with a large no of unique categories, and we want to use this as a feature in our ML Model.
*
* In Target Guided Ordianl Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category.
* This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [36]:
import pandas as pd

df = pd.DataFrame({'city':['Chennai','Delhi','Hyderabad','Bengaluru','Chennai','Hyderabad'],
                  'price':[200,150,300,250,180,320]
                  })

In [37]:
df

Unnamed: 0,city,price
0,Chennai,200
1,Delhi,150
2,Hyderabad,300
3,Bengaluru,250
4,Chennai,180
5,Hyderabad,320


In [39]:
mean_price = df.groupby('city')['price'].mean().to_dict()

In [40]:
mean_price

{'Bengaluru': 250.0, 'Chennai': 190.0, 'Delhi': 150.0, 'Hyderabad': 310.0}

In [41]:
df['city_encoded'] = df['city'].map(mean_price)

In [42]:
df

Unnamed: 0,city,price,city_encoded
0,Chennai,200,190.0
1,Delhi,150,150.0
2,Hyderabad,300,310.0
3,Bengaluru,250,250.0
4,Chennai,180,190.0
5,Hyderabad,320,310.0


In [43]:
df[['price','city_encoded']]

Unnamed: 0,price,city_encoded
0,200,190.0
1,150,150.0
2,300,310.0
3,250,250.0
4,180,190.0
5,320,310.0


In [44]:
import seaborn as sns

In [45]:
tips = sns.load_dataset('tips')

In [46]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [49]:
mean_bill = tips.groupby('time')['total_bill'].mean()

  mean_bill = tips.groupby('time')['total_bill'].mean()


In [50]:
mean_bill.head()

time
Lunch     17.168676
Dinner    20.797159
Name: total_bill, dtype: float64

In [51]:
tips['time_encoded'] = tips['time'].map(mean_bill)

In [54]:
tips[['total_bill','time_encoded']]

Unnamed: 0,total_bill,time_encoded
0,16.99,20.797159
1,10.34,20.797159
2,21.01,20.797159
3,23.68,20.797159
4,24.59,20.797159
...,...,...
239,29.03,20.797159
240,27.18,20.797159
241,22.67,20.797159
242,17.82,20.797159
