# Count or frequency encoding
<h3> High Cardinality </h3>


Another way to refer to variables that have a multitude of categories, is to call them variables with high cardinality.

If we have categorical variables containing many multiple labels or high cardinality,then by using one hot encoding, we will expand the feature space dramatically.

One approach that is heavily used in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category. The 2 are equivalent.

In [1]:
import pandas as pd
import numpy as np

# Download the dataset from the below link
#https://www.kaggle.com/aditya1702/mercedes-benz-data-exploration/data

df = pd.read_csv('mercedes/test.csv', usecols=['X1', 'X2'])
df.head()

Unnamed: 0,X1,X2
0,v,n
1,b,ai
2,v,as
3,l,n
4,s,as


In [2]:
df.shape

(4209, 2)

# One-Hot Encoding

In [3]:
pd.get_dummies(df).shape

(4209, 72)

By doing One-Hot encoding new 69 columns are added. So by this dimension of dataset increasing 

# Count/Frequency Encoding

In [4]:
# Finding unique values present in x1
len(df['X1'].unique())

27

In X1, there are total 27 unique labels

In [5]:
# Finding unique values present in x2
len(df['X2'].unique())

45

In X2, Here there are total 45 unique labels

In [6]:
# let's have a look at how many labels

for col in df.columns[0:]:
    print(col, ':', len(df['X2'].unique()), 'labels')

X1 : 45 labels
X2 : 45 labels


In [7]:
# Now obtain the count for each one of the labels in variable X2
# And capture this in a dictionary that we can use to re-map the labels

df['X2'].value_counts().to_dict()

{'as': 1658,
 'ae': 478,
 'ai': 462,
 'm': 348,
 'ak': 260,
 'r': 155,
 'n': 113,
 's': 100,
 'f': 85,
 'e': 84,
 'ay': 78,
 'aq': 72,
 'a': 44,
 'b': 38,
 't': 25,
 'k': 25,
 'ag': 23,
 'ac': 20,
 'ao': 19,
 'i': 15,
 'z': 12,
 'ap': 11,
 'p': 10,
 'aw': 9,
 'd': 6,
 'h': 6,
 'g': 5,
 'au': 5,
 'q': 5,
 'af': 4,
 'ab': 4,
 'ad': 4,
 'al': 4,
 'w': 3,
 'ah': 3,
 'am': 3,
 'at': 3,
 'x': 2,
 'j': 2,
 'an': 1,
 'ax': 1,
 'u': 1,
 'av': 1,
 'aj': 1,
 'y': 1}

In [8]:
# Storing in the variable

df_frequency_map = df['X2'].value_counts().to_dict()

In [9]:
df['X2'].head(10)

0     n
1    ai
2    as
3     n
4    as
5    ai
6    ae
7    ae
8     s
9    as
Name: X2, dtype: object

In [10]:
# Replacing X2 labels in the dataset

df['X2'] = df['X2'].map(df_frequency_map)

In [11]:
df

Unnamed: 0,X1,X2
0,v,113
1,b,462
2,v,1658
3,l,113
4,s,1658
...,...,...
4204,h,1658
4205,aa,462
4206,v,1658
4207,v,1658


<b> Now each of the categories is replace by its count</b>

There are some advantages and disadvantages that we will discuss now



<h1> Advantages</h1>

1. It is very simple to implement
2. Does not increase the feature dimensional space


<h1>Disadvantages </h1>

1. If some of the labels have the same count, then they will be replaced with the same count and they will loose some valuable information.


2.  Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power

Follow this thread in Kaggle for more information: https://www.kaggle.com/general/16927