#### Data Encoding

1. Nominal/ OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Encoding

**Nominal/ OHE Encoding**

There are some disadvantages in this
1. If there are more categories then it'll create more no of features
2. Sparse matrix : Due to this it'll create lot of 1s and 0s so it'll lead to Overfitting


In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
# Crete a simple DataFrame
df = pd.DataFrame({
    "color":["red","blue","green","red","blue"]
})

In [3]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,red
4,blue


In [4]:
## Create an instance of oneHotEncoder
encoder = OneHotEncoder()

In [5]:
type(df[['color']])

pandas.core.frame.DataFrame

In [6]:
## Perform fit and then transform
encoded=encoder.fit_transform(df[['color']]).toarray()

In [7]:
encoder_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [8]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0


In [9]:
# For new Data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [10]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,red,0.0,0.0,1.0
4,blue,1.0,0.0,0.0


In [11]:
import seaborn as sns

In [12]:
tips_df=sns.load_dataset('tips')

In [13]:
tips_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [14]:
tips_df['sex'].value_counts()

sex
Male      157
Female     87
Name: count, dtype: int64

In [15]:
encoder = OneHotEncoder()

In [16]:
encoded = encoder.fit_transform(tips_df[['sex']]).toarray()

In [17]:
encoder_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [18]:
pd.concat([tips_df,encoder_df],axis=1).head(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_Female,sex_Male
0,16.99,1.01,Female,No,Sun,Dinner,2,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,1.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0.0,1.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,1.0
4,24.59,3.61,Female,No,Sun,Dinner,4,1.0,0.0


**Label Encoding**

we'll assign a unique numeriacal value for a catefory

In [19]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,red
4,blue


In [20]:
from sklearn.preprocessing import LabelEncoder

In [21]:
le = LabelEncoder()

In [22]:
le.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 2, 0])

In [23]:
## For any new value 
le.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [24]:
le.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

**Ordinal Encoding**

this is used for the use cases where we want to apply ranks for the categories

In [25]:
from sklearn.preprocessing import OrdinalEncoder

In [26]:
df = pd.DataFrame({
    "size":["small","medium","large","medium","small","large"]
})

In [27]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [29]:
df['size'].unique()

array(['small', 'medium', 'large'], dtype=object)

In [31]:
oe = OrdinalEncoder(categories=[['small','medium','large']])

In [32]:
oe.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [33]:
oe.transform([['small']])



array([[0.]])

In [35]:
oe.transform([['medium']])



array([[1.]])

**Target Guided Ordinal Encoding**

It is a technique used to encode categorical variable based on theor relationship with the target variable. This encoding technique is useful when we have a categorical variable with large number of unique categories and we want to use the variable as a feature in our machine learning model

In target Guided Encoding, we replace each category in the categorical variable with a numerical based on the
mean or median of the target variable for the category

In [37]:
df = pd.DataFrame({
    'city': ["New York","London","Paris","Tokyo","New York","Paris"],
    'price':[200,150,300,250,180,320]
})

In [38]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [40]:
mean_price=df.groupby('city')['price'].mean().to_dict()

In [41]:
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [46]:
df['city_encoded']=df['city'].map(mean_price)

In [47]:
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [49]:
df[['city_encoded','price']]

Unnamed: 0,city_encoded,price
0,190.0,200
1,150.0,150
2,310.0,300
3,250.0,250
4,190.0,180
5,310.0,320
