<a href="https://colab.research.google.com/github/tahminahasan/Workflow-ML/blob/main/OneHotEncoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

One Hot Encoding

Nominal data is categorical data with no natural order or ranking.
City names: ['Dhaka', 'Chittagong', 'Rajshahi'].  Use One-Hot Encoding for nominal data.

In [51]:
import numpy as np
import pandas as pd

In [52]:
df=pd.read_csv('cars.csv')

In [53]:
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [54]:
df.shape

(8128, 5)

In [55]:
df['brand'].value_counts()

Unnamed: 0_level_0,count
brand,Unnamed: 1_level_1
Maruti,2448
Hyundai,1415
Mahindra,772
Tata,734
Toyota,488
Honda,467
Ford,397
Chevrolet,230
Renault,228
Volkswagen,186


In [56]:
df['brand'].nunique()

32

In [57]:
df['fuel'].value_counts()

Unnamed: 0_level_0,count
fuel,Unnamed: 1_level_1
Diesel,4402
Petrol,3631
CNG,57
LPG,38


In [58]:
df['owner'].value_counts()

Unnamed: 0_level_0,count
owner,Unnamed: 1_level_1
First Owner,5289
Second Owner,2105
Third Owner,555
Fourth & Above Owner,174
Test Drive Car,5


 One Hot Encoding using Pandas

In [59]:
#Converts each category in 'fuel' and 'owner' into separate columns.
#Replaces the original 'fuel' and 'owner' columns with the new columns.

pd.get_dummies(df,columns=['fuel', 'owner'])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,False,True,False,False,True,False,False,False,False
1,Skoda,120000,370000,False,True,False,False,False,False,True,False,False
2,Honda,140000,158000,False,False,False,True,False,False,False,False,True
3,Hyundai,127000,225000,False,True,False,False,True,False,False,False,False
4,Maruti,120000,130000,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,False,True,True,False,False,False,False
8124,Hyundai,119000,135000,False,True,False,False,False,True,False,False,False
8125,Maruti,120000,382000,False,True,False,False,True,False,False,False,False
8126,Tata,25000,290000,False,True,False,False,True,False,False,False,False


K-1..   It removes one column per categorical feature to avoid the dummy variable trap (multicollinearity).

In [60]:
pd.get_dummies(df,columns=['fuel', 'owner'], drop_first=True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,True,False,False,False,False
8124,Hyundai,119000,135000,True,False,False,True,False,False,False
8125,Maruti,120000,382000,True,False,False,False,False,False,False
8126,Tata,25000,290000,True,False,False,False,False,False,False


One Hot Encoding using Sklearn

In [61]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,0:4],
                                                 df.iloc[:,-1],
                                                 test_size=0.2,
                                                 random_state=0)

In [62]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
3042,Hyundai,60000,LPG,First Owner
1520,Tata,150000,Diesel,Third Owner
2611,Hyundai,110000,Diesel,Second Owner
3544,Mahindra,28000,Diesel,Second Owner
4138,Maruti,15000,Petrol,First Owner


In [63]:
X_test.head()

Unnamed: 0,brand,km_driven,fuel,owner
3558,Hyundai,40000,Diesel,First Owner
233,Mahindra,70000,Diesel,First Owner
7952,Maruti,5000,Petrol,First Owner
572,Maruti,120000,Petrol,Third Owner
6960,Lexus,20000,Petrol,First Owner


In [64]:
from sklearn.preprocessing import OneHotEncoder

In [67]:
ohe = OneHotEncoder(drop='first', sparse_output=False)

In [68]:
ohe.fit_transform(X_train[['fuel', 'owner']])

array([[0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 1.],
       [1., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 1., ..., 1., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

In [69]:
X_train_new = ohe.fit_transform(X_train[['fuel', 'owner']])

In [70]:
X_test_new = ohe.fit_transform(X_test[['fuel', 'owner']])

In [71]:
X_train_new.shape

(6502, 7)

In [72]:
X_train[['brand','km_driven']].values

array([['Hyundai', 60000],
       ['Tata', 150000],
       ['Hyundai', 110000],
       ...,
       ['Hyundai', 90000],
       ['Volkswagen', 90000],
       ['Hyundai', 110000]], dtype=object)

In [73]:
np.hstack((X_train[['brand','km_driven']].values,X_train_new)).shape

(6502, 9)

One Hot Encoding with Top Categories

In [74]:
counts = df['brand'].value_counts()

In [77]:
df['brand'].nunique()
threshold = 100

In [78]:
#the ones whose counts are less than or equal to a threshold.
repl = counts[counts <= threshold].index

In [99]:
#Replaces all brand names in repl (the rare ones) with 'uncommon'.
pd.get_dummies(df['brand'].replace(repl, 'uncommon')).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
1572,False,False,False,False,False,False,False,False,False,False,True,False,False
7061,False,False,False,True,False,False,False,False,False,False,False,False,False
7126,False,False,False,False,False,True,False,False,False,False,False,False,False
671,False,False,False,False,True,False,False,False,False,False,False,False,False
860,False,False,False,False,True,False,False,False,False,False,False,False,False


In [100]:
df['brand'] = df['brand'].replace(repl, 'uncommon')


In [101]:
df[df['brand'] == 'uncommon']


Unnamed: 0,brand,km_driven,fuel,owner,selling_price
31,uncommon,50000,Petrol,Second Owner,70000
38,uncommon,42000,Petrol,First Owner,150000
41,uncommon,5000,Petrol,First Owner,2100000
49,uncommon,27800,Diesel,Second Owner,1450000
51,uncommon,151000,Diesel,First Owner,1090000
...,...,...,...,...,...
8072,uncommon,82000,Diesel,Second Owner,450000
8090,uncommon,170000,Diesel,First Owner,509999
8091,uncommon,40000,Petrol,Second Owner,125000
8101,uncommon,70000,Diesel,First Owner,450000
