#### Count or frequency encoding
#### High Cardinality


Another way to refer to variables that have a multitude of categories, is to call them variables with high cardinality.

If we have categorical variables containing many multiple labels or high cardinality,then by using one hot encoding, we will expand the feature space dramatically.

One approach that is heavily used in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category. The 2 are equivalent.

Let's see how this works:

In [1]:
import pandas as pd
import numpy as np
df=pd.read_csv('Mall_Customers.csv')

In [2]:
df.head()

Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [4]:
df.shape

(200, 5)

#### One Hot encoding

In [5]:
pd.get_dummies(df).shape

(200, 6)

In [6]:
len(df['Genre'].unique())

2

In [7]:
len(df['Age'].unique())

51

In [8]:
# Let's have a look at how many labels
for col in df.columns[0:]:
    print(col,':',len(df[col].unique()),'labels')

CustomerID : 200 labels
Genre : 2 labels
Age : 51 labels
Annual Income (k$) : 64 labels
Spending Score (1-100) : 84 labels


In [10]:
# Lets obtain the counts for each one o f the labels in variable
# Let's caputre this in a dictionary that we can use to re-map the labels
df.Age.value_counts().to_dict()

{32: 11,
 35: 9,
 19: 8,
 31: 8,
 30: 7,
 49: 7,
 27: 6,
 47: 6,
 40: 6,
 23: 6,
 36: 6,
 38: 6,
 50: 5,
 48: 5,
 29: 5,
 21: 5,
 20: 5,
 34: 5,
 18: 4,
 28: 4,
 59: 4,
 24: 4,
 67: 4,
 54: 4,
 39: 3,
 25: 3,
 33: 3,
 22: 3,
 37: 3,
 43: 3,
 68: 3,
 45: 3,
 46: 3,
 60: 3,
 41: 2,
 57: 2,
 66: 2,
 65: 2,
 63: 2,
 58: 2,
 26: 2,
 70: 2,
 42: 2,
 53: 2,
 52: 2,
 51: 2,
 44: 2,
 55: 1,
 64: 1,
 69: 1,
 56: 1}

In [12]:
## And now let's replace each label in age by it's count
#  first we make a dictonary that maps each labels to the counts
df_frequency_map=df.Age.value_counts().to_dict()

In [14]:
df.head(100)

Unnamed: 0,CustomerID,Genre,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40
...,...,...,...,...,...
95,96,Male,24,60,52
96,97,Female,47,60,47
97,98,Female,27,60,50
98,99,Male,48,61,42
