# What is Categorical Data?
Categorical data are variables that contain label values rather than numeric values.
The number of possible values is often limited to a fixed set.

### Nominal Categorical variables:
Some examples include:

A “pet” variable with the values: “dog” and “cat“.
A “color” variable with the values: “red“, “green” and “blue“.
A “place” variable with the values: “first”, “second” and “third“.
Each value represents a different category.

### Ordinal Categorical variables:
Some categories may have a natural relationship to each other, such as a natural ordering.
The “place” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable.

# One Hot Encoding - variables with many categories

If a categorical variable contains multiple labels, then by re-encoding them using one hot encoding we will expand the feature space dramatically which will lead to "Curse of Dimensionality"

In [4]:
data = pd.read_csv(r'mercedes_train.csv', usecols=['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'])
data.head()

<IPython.core.display.Javascript object>

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,k,v,at,a,d,u,j,o
1,k,t,av,e,d,y,l,o
2,az,w,n,c,d,x,j,x
3,az,t,n,f,d,x,l,e
4,az,v,n,f,d,h,d,n


In [5]:
# let's have a look at how many labels each variable has

for col in data.columns:
    print(col , ':' ,len(data[col].unique()), 'labels')

X0 : 47 labels
X1 : 27 labels
X2 : 44 labels
X3 : 7 labels
X4 : 4 labels
X5 : 29 labels
X6 : 12 labels
X8 : 25 labels


In [6]:
# let's examine how many columns we will obtain after one hot encoding these variables
# .shape creates that many columns i.e total 187 columns will be created
pd.get_dummies(data , drop_first=True).shape

<IPython.core.display.Javascript object>

(4209, 187)

We can see that from just 8 initial categorical variables, we end up with 187 new variables.

These numbers are still not huge, and in practice we could work with them relatively easily. However, in business datasets and also other Kaggle or KDD datasets, it is not unusual to find several categorical variables with multiple labels. And if we use one hot encoding on them, we will end up with datasets with thousands of columns.

What can we do instead?

In the winning solution of the KDD 2009 cup: "Winning the KDD Cup Orange Challenge with Ensemble Selection" (http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf), the authors limit one hot encoding to the 10 most frequent labels of the variable. This means that they would make one binary variable for each of the 10 most frequent labels only. This is equivalent to grouping all the other labels under a new category, that in this case will be dropped. Thus, the 10 new dummy variables indicate if one of the 10 most frequent labels is present (1) or not (0) for a particular observation.

How can we do that in python?

In [9]:
# let's find the most frequent categories for the variable X0 and X2
data.X0.value_counts().sort_values(ascending=False).head(20)

z     360
ak    349
y     324
ay    313
t     306
x     300
o     269
f     227
n     195
w     182
j     181
az    175
aj    151
s     106
ap    103
h      75
d      73
al     67
v      36
af     35
Name: X0, dtype: int64

In [10]:
# let's find the top 10 most frequent categories for the variable X2
data.X2.value_counts().sort_values(ascending=False).head(10)

as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
Name: X2, dtype: int64

In [13]:
# let's make a list with the most frequent categories of the variable X2

top_10=[ x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]
top_10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [14]:
# and now we make the 10 binary variables

for label in top_10:
    data[label]=np.where(data['X2']==label , 1, 0)

    
data[['X2'] + top_10].head(10)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,at,0,0,0,0,0,0,0,0,0,0
1,av,0,0,0,0,0,0,0,0,0,0
2,n,0,0,0,0,0,0,1,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,n,0,0,0,0,0,0,1,0,0,0
5,e,0,0,0,0,0,0,0,0,0,1
6,e,0,0,0,0,0,0,0,0,0,1
7,as,1,0,0,0,0,0,0,0,0,0
8,as,1,0,0,0,0,0,0,0,0,0
9,aq,0,0,0,0,0,0,0,0,0,0


In [15]:
# get whole set of dummy variables, for all the categorical variables

def one_hot_top_x(df, variable, top_x_labels):
    # function to create the dummy variables for the most frequent labels
    # we can vary the number of most frequent labels that we encode
    
    for label in top_x_labels:
        df[variable +'_'+label]=np.where(data[variable]==label , 1, 0)
        

# encode X2 into the 10 most frequent categories
one_hot_top_x(data, 'X2', top_10)
data.head()
    

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8,as,ae,...,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e
0,k,v,at,a,d,u,j,o,0,0,...,0,0,0,0,0,0,0,0,0,0
1,k,t,av,e,d,y,l,o,0,0,...,0,0,0,0,0,0,0,0,0,0
2,az,w,n,c,d,x,j,x,0,0,...,0,0,0,0,0,0,1,0,0,0
3,az,t,n,f,d,x,l,e,0,0,...,0,0,0,0,0,0,1,0,0,0
4,az,v,n,f,d,h,d,n,0,0,...,0,0,0,0,0,0,1,0,0,0


In [18]:

# find the 10 most frequent categories for X1

top_10=[ x for x in data.X1.value_counts().sort_values(ascending=False).head(10).index]

# now create the 10 most frequent dummy variables for X1
one_hot_top_x(data, 'X1', top_10)
data.head()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8,as,ae,...,X1_aa,X1_s,X1_b,X1_l,X1_v,X1_r,X1_i,X1_a,X1_c,X1_o
0,k,v,at,a,d,u,j,o,0,0,...,0,0,0,0,1,0,0,0,0,0
1,k,t,av,e,d,y,l,o,0,0,...,0,0,0,0,0,0,0,0,0,0
2,az,w,n,c,d,x,j,x,0,0,...,0,0,0,0,0,0,0,0,0,0
3,az,t,n,f,d,x,l,e,0,0,...,0,0,0,0,0,0,0,0,0,0
4,az,v,n,f,d,h,d,n,0,0,...,0,0,0,0,1,0,0,0,0,0


In [20]:

# find the 10 most frequent categories for X0
top_10= [x for x in data.X0.value_counts().sort_values(ascending=False).head(10).index]

# now create the 10 most frequent dummy variables for X0
one_hot_top_x(data, 'X0', top_10)
data.head()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8,as,ae,...,X0_z,X0_ak,X0_y,X0_ay,X0_t,X0_x,X0_o,X0_f,X0_n,X0_w
0,k,v,at,a,d,u,j,o,0,0,...,0,0,0,0,0,0,0,0,0,0
1,k,t,av,e,d,y,l,o,0,0,...,0,0,0,0,0,0,0,0,0,0
2,az,w,n,c,d,x,j,x,0,0,...,0,0,0,0,0,0,0,0,0,0
3,az,t,n,f,d,x,l,e,0,0,...,0,0,0,0,0,0,0,0,0,0
4,az,v,n,f,d,h,d,n,0,0,...,0,0,0,0,0,0,0,0,0,0


# One Hot encoding of top variables

#### Advantages
Straightforward to implement
Does not require hrs of variable exploration
Does not expand massively the feature space (number of columns in the dataset)

#### Disadvantages
Does not add any information that may make the variable more predictive
Does not keep the information of the ignored labels
Because it is not unusual that categorical variables have a few dominating categories and the remaining labels add mostly noise, this is a quite simple and straightforward approach that may be useful on many occasions.

It is worth noting that the top 10 variables is a totally arbitrary number. You could also choose the top 5, or top 20.

This modelling was more than enough for the team to win the KDD 2009 cup. They did do some other powerful feature engineering as we will see in following lectures, that improved the performance of the variables dramatically.