## Data Encoding - Conversion of Categorical values into numerical values so the the model can understand the data

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding 

### Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

In [1]:
## Disadvantage of one hot encoding is that , you should not use this when you have so many categories 
## sparse matrix - leads to overfitting of model 
## many features 

In [1]:

import pandas as pd 
from sklearn.preprocessing import OneHotEncoder

In [4]:
## Create a simple DataFrame 
df = pd.DataFrame({ "color" : ["red","green","blue","red","blue"] })
df

Unnamed: 0,color
0,red
1,green
2,blue
3,red
4,blue


In [5]:
## Create an instance of OneHotEncoder 
encoder = OneHotEncoder()

In [10]:
## perform fit and transform 
encoded = encoder.fit_transform(df[["color"]]).toarray()  ## sparse matrix generation 
encoded

array([[0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [11]:
import pandas as pd 
encoder_df = pd.DataFrame(encoded,columns = encoder.get_feature_names_out())

In [12]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,0.0,1.0,0.0
2,1.0,0.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0


In [14]:
## for new data 
encoder.transform([["blue"]]).toarray()



array([[1., 0., 0.]])

In [17]:
encoder.transform([["green"]]).toarray()



array([[0., 1., 0.]])

In [19]:
## Concatenating 
pd.concat([df,encoder_df],axis = 1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,green,0.0,1.0,0.0
2,blue,1.0,0.0,0.0
3,red,0.0,0.0,1.0
4,blue,1.0,0.0,0.0


In [21]:
import seaborn as sns 
df1 = sns.load_dataset("tips")
df1.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [22]:
encoder1 = OneHotEncoder()

In [28]:
encoded_1= encoder1.fit_transform(df1[["sex"]]).toarray()
encoded_1


array([[1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.

In [29]:
encoded1_df = pd.DataFrame(encoded_1,columns = encoder1.get_feature_names_out())
encoded1_df


Unnamed: 0,sex_Female,sex_Male
0,1.0,0.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0
...,...,...
239,0.0,1.0
240,1.0,0.0
241,0.0,1.0
242,0.0,1.0


In [30]:
encoder1.transform([["Male"]]).toarray()



array([[0., 1.]])

In [33]:
pd.concat([df1["sex"],encoded1_df],axis =1 )

Unnamed: 0,sex,sex_Female,sex_Male
0,Female,1.0,0.0
1,Male,0.0,1.0
2,Male,0.0,1.0
3,Male,0.0,1.0
4,Female,1.0,0.0
...,...,...,...
239,Male,0.0,1.0
240,Female,1.0,0.0
241,Male,0.0,1.0
242,Male,0.0,1.0


### Label Encoding 
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

In [34]:
df

Unnamed: 0,color
0,red
1,green
2,blue
3,red
4,blue


In [35]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder = LabelEncoder()

In [43]:
lbl_encoded=lbl_encoder.fit_transform(df[["color"]])
lbl_encoded

  y = column_or_1d(y, warn=True)


array([2, 1, 0, 2, 0])

In [39]:
lbl_encoder.transform([["blue"]])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [40]:
lbl_encoder.transform([["green"]])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

In [42]:
lbl_encoder.transform([["red"]])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [None]:
## Problem with Label encoding is that , here red = 2 , green = 1 , blue = 0 , in this case ML model will consider
## it as red > green > blue ( 2 > 1 > 0 ) so it creates problem
## So in case when you have to assign ranks use ordinal encoding  

### Ordinal Encoding -  based on ranks 
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [45]:
## Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

In [48]:
# create a sample dataframe with an ordinal variable
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [49]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [50]:
## create an instance of ORdinalEncoder and then fit_transform
encoder = OrdinalEncoder(categories=[["small","medium","large"]])

In [51]:
encoder.fit_transform(df[["size"]])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [53]:
encoder.transform([["large"]])



array([[2.]])

In [58]:
df1 = pd.DataFrame({
    "degree" : ["College","Post-graduate","High school","College","Graduate","Post-graduate"]
})


In [59]:
encoder1 = OrdinalEncoder(categories=[["High school","College","Graduate","Post-graduate"]])

In [61]:
encoder1.fit_transform(df1[["degree"]])

array([[1.],
       [3.],
       [0.],
       [1.],
       [2.],
       [3.]])

In [62]:
encoder1.transform([["High school"]])



array([[0.]])

In [63]:
encoder1.transform([["College"]])



array([[1.]])

In [64]:
encoder1.transform([["Graduate"]])



array([[2.]])

In [65]:
encoder1.transform([["Post-graduate"]])



array([[3.]])

In [66]:
## Ordinal encoding - assign ranks 

## Target Guided Ordinal Encoding 
It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [68]:
# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [74]:
## target variable = price 
## categorical variable = city
df 

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [71]:
## taking mean of target variable 
encoded = df.groupby("city")["price"].mean().to_dict()

In [72]:
encoded 

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [77]:
df["City_encoded"]=df["city"].map(encoded)
df["City_encoded"]

0    190.0
1    150.0
2    310.0
3    250.0
4    190.0
5    310.0
Name: City_encoded, dtype: float64

In [78]:
df[["city","price","City_encoded"]]

Unnamed: 0,city,price,City_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [79]:
## according to target variable price we have encoded city 

In [92]:
tips_df = sns.load_dataset("tips")
tips_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [93]:
## time based on total bill
## target variable = total_bill
## categorical variable = time

In [100]:
mean_total_bill = tips_df.groupby("time")["total_bill"].mean().to_dict()

  mean_total_bill = tips_df.groupby("time")["total_bill"].mean().to_dict()


In [101]:
mean_total_bill

{'Lunch': 17.168676470588235, 'Dinner': 20.79715909090909}

In [102]:
tips_df["time_encoded"] = tips_df["time"].map(mean_total_bill)
tips_df["time_encoded"]

0      20.797159
1      20.797159
2      20.797159
3      20.797159
4      20.797159
         ...    
239    20.797159
240    20.797159
241    20.797159
242    20.797159
243    20.797159
Name: time_encoded, Length: 244, dtype: category
Categories (2, float64): [17.168676, 20.797159]

In [103]:
tips_df[["time","total_bill","time_encoded"]]

Unnamed: 0,time,total_bill,time_encoded
0,Dinner,16.99,20.797159
1,Dinner,10.34,20.797159
2,Dinner,21.01,20.797159
3,Dinner,23.68,20.797159
4,Dinner,24.59,20.797159
...,...,...,...
239,Dinner,29.03,20.797159
240,Dinner,27.18,20.797159
241,Dinner,22.67,20.797159
242,Dinner,17.82,20.797159
