### https://github.com/krishnaik06/Complete-Feature-Engineering/

## Count or frequency encoding

### High Cardinality

Another way to refer to variables that have a multitude of categories, is to call them variables with high cardinality.

If we have categorical variables containing many multiple labels or high cardinality,then by using one hot encoding, we will expand the feature space dramatically.

One approach that is heavily used in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category. The 2 are equivalent.

Let's see how this works:

In [20]:
import pandas as pd
import numpy as np

In [21]:
df = pd.read_csv('mercedes.csv',usecols=['X1','X2'])

In [22]:
df.head()

Unnamed: 0,X1,X2
0,v,at
1,t,av
2,w,n
3,t,n
4,v,n


In [23]:
df.shape

(4209, 2)

## One Hot Encoding

Problem with one Hot Encoding is that it increases the dimensionality. 2 columns is converted to 71 columns

In [24]:
pd.get_dummies(df).shape

(4209, 71)

In [25]:
pd.get_dummies(df)

Unnamed: 0,X1_a,X1_aa,X1_ab,X1_b,X1_c,X1_d,X1_e,X1_f,X1_g,X1_h,...,X2_n,X2_o,X2_p,X2_q,X2_r,X2_s,X2_t,X2_x,X2_y,X2_z
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4205,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4206,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4207,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
len(df['X1'].unique())

27

In [27]:
len(df['X2'].unique())

44

In [28]:
for col in df.columns:
    print(col, ': ', len(df[col].unique()),' labels')

X1 :  27  labels
X2 :  44  labels


### Frequency Count

In [29]:
df.head()

Unnamed: 0,X1,X2
0,v,at
1,t,av
2,w,n
3,t,n
4,v,n


In [30]:
df_frequency_map = df.X2.value_counts().to_dict()

In [31]:
# Replace X2 column with its frequency
df.X2 = df.X2.map(df_frequency_map)

In [32]:
df.head()

Unnamed: 0,X1,X2
0,v,6
1,t,4
2,w,137
3,t,137
4,v,137


***Advantages¶***

It is very simple to implement
Does not increase the feature dimensional space

**Disadvantages**

1. If some of the labels have the same count, then they will be replaced with the same count and they will loose some valuable information.
2. Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power

Follow this thread in Kaggle for more information: https://www.kaggle.com/general/16927