### https://www.kaggle.com/learn-forum/114857
### https://medium.com/dwadda/ways-of-encoding-categorical-variables-b7a798931c8c

In [1]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv('mercedes.csv',usecols=['X1','X2','X3','X4','X5','X6'])
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [5]:
for col in data.columns:
    print(col,' ',len(data[col].unique()), ' labels')

X1   27  labels
X2   44  labels
X3   7  labels
X4   4  labels
X5   29  labels
X6   12  labels


In [6]:
pd.get_dummies(data,drop_first=True).shape

(4209, 117)

We can see that from just 6 initial categorical variables, we end up with 117 new variables.

These numbers are still not huge, and in practice we could work with them relatively easily. However, in business datasets and also other Kaggle or KDD datasets, it is not unusual to find several categorical variables with multiple labels. And if we use one hot encoding on them, we will end up with datasets with thousands of columns.

What can we do instead?

In the winning solution of the KDD 2009 cup: "Winning the KDD Cup Orange Challenge with Ensemble Selection" (http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf), the authors limit one hot encoding to the 10 most frequent labels of the variable. This means that they would make one binary variable for each of the 10 most frequent labels only. This is equivalent to grouping all the other labels under a new category, that in this case will be dropped. Thus, the 10 new dummy variables indicate if one of the 10 most frequent labels is present (1) or not (0) for a particular observation.

How can we do that in python?

In [8]:
# Lets find 10 most frequent categories for variable X2
print(type(data.X2.value_counts().sort_values(ascending=False).head(20)))
data.X2.value_counts().sort_values(ascending=False).head(20)

<class 'pandas.core.series.Series'>


as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
k       25
i       25
b       21
ao      20
z       19
ag      19
Name: X2, dtype: int64

In [12]:
top_10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]
top_10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [22]:
[1,2]+[3,4]

[1, 2, 3, 4]

In [18]:
#print(data['as'])
for label in top_10:
    data[label] = np.where(data['X2']==label,1,0)
data[['X2']+top_10].head(40)

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,at,0,0,0,0,0,0,0,0,0,0
1,av,0,0,0,0,0,0,0,0,0,0
2,n,0,0,0,0,0,0,1,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,n,0,0,0,0,0,0,1,0,0,0
5,e,0,0,0,0,0,0,0,0,0,1
6,e,0,0,0,0,0,0,0,0,0,1
7,as,1,0,0,0,0,0,0,0,0,0
8,as,1,0,0,0,0,0,0,0,0,0
9,aq,0,0,0,0,0,0,0,0,0,0


In [30]:
def one_hot_encoding_top_10(df,cols):
    for col in cols:
        top_10 = [x for x in df[col].value_counts().sort_values(ascending=False).head(10).index]
        for label in top_10:
            df[col+'_'+label] = np.where(df[col]==label,1,0)

In [31]:
data = pd.read_csv('mercedes.csv',usecols=['X1','X2','X3','X4','X5','X6'])

one_hot_encoding_top_10(data,['X1','X2','X3','X4','X5','X6'])

In [33]:
data.columns

Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X1_aa', 'X1_s', 'X1_b', 'X1_l',
       'X1_v', 'X1_r', 'X1_i', 'X1_a', 'X1_c', 'X1_o', 'X2_as', 'X2_ae',
       'X2_ai', 'X2_m', 'X2_ak', 'X2_r', 'X2_n', 'X2_s', 'X2_f', 'X2_e',
       'X3_c', 'X3_f', 'X3_a', 'X3_d', 'X3_g', 'X3_e', 'X3_b', 'X4_d', 'X4_a',
       'X4_c', 'X4_b', 'X5_v', 'X5_w', 'X5_q', 'X5_r', 'X5_s', 'X5_d', 'X5_n',
       'X5_p', 'X5_m', 'X5_i', 'X6_g', 'X6_j', 'X6_d', 'X6_i', 'X6_l', 'X6_a',
       'X6_h', 'X6_k', 'X6_c', 'X6_b'],
      dtype='object')

One Hot encoding of top variables<br>

Advantages<br><br>
Straightforward to implement<br>
Does not require hrs of variable exploration<br>
Does not expand massively the feature space (number of columns in the dataset)<br><br>
Disadvantages<br><br>
Does not add any information that may make the variable more predictive<br>
Does not keep the information of the ignored labels.<br>
Because it is not unusual that categorical variables have a few dominating categories and the remaining labels add mostly noise, this is a quite simple and straightforward approach that may be useful on many occasions.<br>

It is worth noting that the top 10 variables is a totally arbitrary number. You could also choose the top 5, or top 20.<br>