## Count or frequency encoding

Another way to refer to variables that have a multitude of categories, is to call them variables with **high cardinality**.

We observed in the previous lecture, that if a categorical variable contains multiple labels, then by re-encoding them using one hot encoding, we will expand the feature space dramatically.

One approach that is heavily used in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category. The 2 are equivalent.

There is not any rationale behind this transformation, other than its simplicity.

### Advantages

- Simple
- Does not expand the feature space

### Disadvantages

-  If 2 labels appear the same amount of times in the dataset, that is, contain the same number of observations, they will be merged: may loose valuable information
- Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power

Follow this thread in Kaggle for more information:
https://www.kaggle.com/general/16927

Let's see how this works:

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

# let's open the mercedes benz dataset for demonstration

data = pd.read_csv('mercedesbenz.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'y'])
data.head()

Unnamed: 0,y,X1,X2,X3,X4,X5,X6
0,130.81,v,at,a,d,u,j
1,88.53,t,av,e,d,y,l
2,76.26,w,n,c,d,x,j
3,80.62,t,n,f,d,x,l
4,78.02,v,n,f,d,h,d


In [2]:
# let's have a look at how many labels

for col in data.columns[1:]:
    print(col, ': ', len(data[col].unique()), ' labels')

X1 :  27  labels
X2 :  44  labels
X3 :  7  labels
X4 :  4  labels
X5 :  29  labels
X6 :  12  labels


### Important

When doing count transformation of categorical variables, it is important to calculate the count (or frequency = count/total observations) **over the training set**, and then use those numbers to replace the labels in the test set.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(data[['X1', 'X2', 'X3', 'X4', 'X5', 'X6']], data.y,
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((2946, 6), (1263, 6))

In [4]:
# let's obtain the counts for each one of the labels in variable X2
# let's capture this in a dictionary that we can use to re-map the labels

X_train.X2.value_counts().to_dict()

{'a': 34,
 'aa': 1,
 'ac': 10,
 'ae': 342,
 'af': 1,
 'ag': 15,
 'ah': 3,
 'ai': 289,
 'ak': 188,
 'al': 3,
 'am': 1,
 'an': 3,
 'ao': 10,
 'ap': 5,
 'aq': 46,
 'as': 1155,
 'at': 5,
 'au': 3,
 'av': 2,
 'aw': 2,
 'ay': 40,
 'b': 12,
 'c': 1,
 'd': 12,
 'e': 61,
 'f': 59,
 'g': 10,
 'h': 4,
 'i': 15,
 'k': 16,
 'l': 1,
 'm': 284,
 'n': 97,
 'o': 1,
 'p': 1,
 'q': 3,
 'r': 101,
 's': 63,
 't': 17,
 'x': 8,
 'y': 8,
 'z': 14}

In [5]:
# lets look at X_train so we can compare then the variable re-coding

X_train.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
3059,aa,ai,c,d,q,g
3014,b,m,c,d,q,i
3368,o,f,f,d,s,l
2772,aa,as,d,d,p,j
3383,v,e,c,d,s,g


In [6]:
# And now let's replace each label in X2 by its count

# first we make a dictionary that maps each label to the counts
X_frequency_map = X_train.X2.value_counts().to_dict()

# and now we replace X2 labels both in train and test set with the same map
X_train.X2 = X_train.X2.map(X_frequency_map)
X_test.X2 = X_test.X2.map(X_frequency_map)

X_train.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
3059,aa,289,c,d,q,g
3014,b,284,c,d,q,i
3368,o,59,f,d,s,l
2772,aa,1155,d,d,p,j
3383,v,61,c,d,s,g


Where in the original dataset, for the observation 1 in the variable 2 before it was 'ai', now it was replaced by the count 289. And so on for the rest of the categories (compare outputs 5 and 6).

### Note

I want you to keep in mind something important:

If a category is present in the test set, that was not present in the train set, this method will generate missing data in the test set. This is why it is extremely important to handle rare categories, as we say in section 6 of this course.

Then we can combine rare label replacement plus categorical encoding with counts like this: we may choose to replace the 10 most frequent labels by their count, and then group all the other labels under one label (for example "Rare"), and replace "Rare" by its count, to account for what I just mentioned.

In coming sections I will explain more methods of categorical encoding. I want you to keep in mind that There is no rule of thumb to indicate which method you should use to encode categorical variables. It is mostly up to what makes sense for the data, and it also depends on what you are trying to achieve. In general, for data competitions, we value more model predictive power, whereas in business scenarios we want to capture and understand the information, and generally, we want to transform variables in a way that it makes 'Business sense'. Some of your common sense and a lot of conversation with the people that understand the data well will be required to encode categorical labels.
