# One Hot Encoding for variables with many categories

In [3]:
#we are taking the mercedez benz  dataset.
#let take the dataset for only few columns(features) that has multiple categories like ['X1','X2','X3','X4','X5','X6']

In [4]:
#we pick the dataset from 
#https://www.kaggle.com/aditya1702/mercedes-benz-data-exploration/data

In [6]:
import pandas as pd
import numpy as np

data = pd.read_csv('mercedezbenz.csv', usecols = ['X1','X2','X3','X4','X5','X6'])

#here usecols parmater is used to pass the features or columns as list i.e columns ['X1','X2','X3','X4','X5','X6']

In [7]:
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [10]:
#lets see how many unique categories are present in each columns ['X1','X2','X3','X4','X5','X6']
#here we are writing a "for col"  to iterate through columns ['X1','X2','X3','X4','X5','X6']
#then we are finding unique/distinct length of categories in columns()
#data[col] will give values inside column
#len(data[col]) wil give the length of all categories which are present
#len(data[col].unique()) will give the length of unique categories in column
#here "data" is name of dataframe object variable we are using to store using pd.read_csv

for col in  data.columns:
    print(col, ':' ,len(data[col].unique()), 'categories')

X1 : 27 categories
X2 : 44 categories
X3 : 7 categories
X4 : 4 categories
X5 : 29 categories
X6 : 12 categories


In [11]:
#so if we perform one hot encoding to all these 6 features ['X1','X2','X3','X4','X5','X6'] how many extra features will be created

In [13]:
#to do one hot encoding we use pd.get_dummies
#we drop the first which means we are dropping the column names ['X1','X2','X3','X4','X5','X6'] using the parameter drop_first=True
#.shape will five the shape of datframe that will be created after one hhot encoding

In [15]:
pd.get_dummies(data,drop_first=True).shape

(4209, 117)

We can see from initial 6 categorical variables ['X1','X2','X3','X4','X5','X6'] we have ended up with 111 other features due to one hot encoding from previous 6 features ['X1','X2','X3','X4','X5','X6']

This does not seem to be an optimal approach when dataset is huge

Then we will end up making many categories with 0 and 1 and run into curse of dimensionality which will impact accuracy level.

What is curse of ddimensionality:
https://www.youtube.com/watch?v=_4DaqzLyT08

how do we solve the above problem??

In 2009 there was a challenge called KDD 2009 cup orange cup challenge 

## KDD Cup Orange Challenge

What can we do instead?

http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf In the winning solution of the KDD 2009 cup: "Winning the KDD Cup Orange Challenge with Ensemble

The Team suggested using 10 most frequent labels(categories) and convert them into dummy variables using onehotencoding

How can we do that in python?

In [16]:
#lets find the top 10 features for each column suppose we start with column "X2"

In [19]:
#here we are taking the X2 column using data.X2
#count of categories in column X2 will be given by data.X2.value_counts()
#now if we want to sort the count(repeatablity/frequanecy of category) into descending order we use:
#data.X2.value_counts().sort_values(ascending=False)
#if we want to display the top 20 rows of the descending output
#dat.X2.value_counts().sort(asceding=False).head(20)

In [20]:
data.X2.value_counts().sort_values(ascending=False).head(20)

as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
k       25
i       25
b       21
ao      20
ag      19
z       19
Name: X2, dtype: int64

In [25]:
#here we see in column X2:
#as category is repeated 1659 times
#ae category is repeated 496 times
#and so on

In [26]:
#lets extract the top 10  categories in column X2

In [27]:
#using lambda function
#here we are extracting the top 10 categories for column X2 using landa function
#data.X2.value_counts().sort_values(ascending=False).head(10)--->  this extracts the top 10 repeaing categories
#.index will give the name of those top 10 categories
#data.X2.value_counts().sort_values(ascending=False).head(10).index
#we iterate throught the categories using the for in loop of lambda function
#top_10 [x for x in data.X2.value_counts().sort_values(ascending=False).head().index]




top_10 = [x for x in  data.X2.value_counts().sort_values(ascending=False).head(10).index]
top_10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

now we make the one hot encoding for top 10 categories for X2



In [29]:
#for each categories in top_10 coded as--> for label in top_10:  example label can be 'as','m' etc 
#we are taking the data of label or data of category e.g "as"--> data[label]
#replace the category say "as" using np.where so when label(i.e 'as') is found replace only top_10 with 1 if found else if not found then 0
#data[label] = np.where(data['X2']==label, 1,0)
#concatenating the 'X2' and top_10 label one hot encoded for say top 40 categories out of 1659 categories of 'X2':
#data[['X2']+top_10].head(40)

Calculating top 10 for "X2"

In [28]:
for label in top_10:
    data[label] = np.where(data['X2']==label, 1, 0)
    
data[['X2']+top_10].head(40)


#here for 'at' we see it does not belong to top 10 category so all is zero

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,at,0,0,0,0,0,0,0,0,0,0
1,av,0,0,0,0,0,0,0,0,0,0
2,n,0,0,0,0,0,0,1,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,n,0,0,0,0,0,0,1,0,0,0
5,e,0,0,0,0,0,0,0,0,0,1
6,e,0,0,0,0,0,0,0,0,0,1
7,as,1,0,0,0,0,0,0,0,0,0
8,as,1,0,0,0,0,0,0,0,0,0
9,aq,0,0,0,0,0,0,0,0,0,0


In [30]:
#getting dummy varoables for all the features ['X1','X2','X3','X4','X5','X6']

In [31]:
def one_hot_top_x(df,variable,top_x_labels):
    #function to create dummyvariables for most frequent labels
    #we can vary the number of most frequent labels that we encode 
    #variable is column name
    
    for label in top_x_labels:
        df[variable+"_"+label] = np.where(data[variable]==label , 1,0)
        


In [32]:
#read data again
data = pd.read_csv('mercedezbenz.csv', usecols = ['X1','X2','X3','X4','X5','X6'])


    

In [33]:
data

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d
...,...,...,...,...,...,...
4204,s,as,c,d,aa,d
4205,o,t,d,d,aa,h
4206,v,r,a,d,aa,g
4207,r,e,f,d,aa,l


In [34]:
#one hot encode column X2,
#here X2=varable , df=data, top_10 = top 10 labels for 'X2' calculated above

one_hot_top_x(data,'X2',top_10)
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e
0,v,at,a,d,u,j,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,0,0,1,0,0,0
3,t,n,f,d,x,l,0,0,0,0,0,0,1,0,0,0
4,v,n,f,d,h,d,0,0,0,0,0,0,1,0,0,0


In [36]:
#finding the 10 most frequent categories for X1

top_10_X1 = [x for x in data.X1.value_counts().sort_values(ascending=False).head(10).index]


In [37]:
#calling the function
#one_hot_top_x(df,variable,top_x_labels)
#here X1=varable , df=data, top_10 = top_10_X1 labels for 'X1' calculated above


one_hot_top_x(data,"X1",top_10_X1)
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X2_as,X2_ae,X2_ai,X2_m,...,X1_aa,X1_s,X1_b,X1_l,X1_v,X1_r,X1_i,X1_a,X1_c,X1_o
0,v,at,a,d,u,j,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


Similarly we can perform this for "X3","X4","X5","x6"

### One Hot Encoding of Top Varaiables

### Advantages

* Strainghtforward to implement
* Does not require hours of variable exploration
* Does not massively increase the feature space(gets rid of curse of dimensionality problem)

### Disadvantages

* Does not add any extra information that can make the vaiable more predicitve
* Does not keep information of ignored labels or categories

In real life it is not unusual to see categorical features having few categories which are dense or dominating or repeating in the feature completely.
Also we will see few categories which are sparse or few an these can be treated as noise and can be discarded.
In above we have used top 10 features of each categories we can use top 5, top 20 etc as per our need.