# Count or Frequency Enconding
-----
Count or Frequency encoding is a technique used in data preprocessing and feature engineering in machine learning. It involves replacing categorical variables with their respective counts or frequencies of occurrence within the dataset.

**In count encoding,** each category or label in a categorical variable is replaced with the number of times it appears in the dataset. For example, if we have a categorical variable "Color" with categories {red, blue, green, red, red, blue}, the count encoding would replace these categories with {3, 2, 1, 3, 3, 2}, indicating the number of occurrences of each category.

**In frequency encoding,** each category is replaced with the proportion or percentage of times it appears in the dataset. It is similar to count encoding, but instead of using the raw count, we divide the count of each category by the total number of observations. Using the same example as above, the frequency encoding would replace the categories with {0.5, 0.33, 0.17, 0.5, 0.5, 0.33}.

Count or frequency encoding can be useful when dealing with categorical variables, especially when the cardinality (the number of unique categories) is high. It helps to capture the relationship between categories and the target variable by encoding them as numerical values, allowing machine learning algorithms to work more effectively with categorical data.

**However,** it is important to note that count or frequency encoding may introduce some bias in the data, especially if certain categories have a disproportionately large count or frequency. Therefore, it is essential to consider the potential impact on the model and evaluate the performance before and after applying this encoding technique.

In [1]:
import pandas as pd
import numpy as np
import warnings as wr
wr.filterwarnings('ignore')

In [2]:
df = pd.read_csv('Datasets\Mercedes-Benz.csv', usecols = ['X1', 'X2'])
df.head()

Unnamed: 0,X1,X2
0,v,at
1,t,av
2,w,n
3,t,n
4,v,n


In [3]:
df.shape

(4209, 2)

#### Let's check first "One Hot Enconding" Technique

In [4]:
pd.get_dummies(df).shape

(4209, 71)

In [5]:
len(df['X1'].unique())

27

In [6]:
len(df['X2'].unique())

44

In [7]:
for col in df.columns[0:]:
    print(col, ':', len(df[col].unique()), ' labels')

X1 : 27  labels
X2 : 44  labels


In [8]:
df_frequency = df.X2.value_counts().to_dict()
df_frequency

{'as': 1659,
 'ae': 496,
 'ai': 415,
 'm': 367,
 'ak': 265,
 'r': 153,
 'n': 137,
 's': 94,
 'f': 87,
 'e': 81,
 'aq': 63,
 'ay': 54,
 'a': 47,
 't': 29,
 'k': 25,
 'i': 25,
 'b': 21,
 'ao': 20,
 'ag': 19,
 'z': 19,
 'd': 18,
 'ac': 13,
 'g': 12,
 'ap': 11,
 'y': 11,
 'x': 10,
 'aw': 8,
 'at': 6,
 'h': 6,
 'al': 5,
 'an': 5,
 'q': 5,
 'av': 4,
 'ah': 4,
 'p': 4,
 'au': 3,
 'am': 1,
 'j': 1,
 'af': 1,
 'l': 1,
 'aa': 1,
 'c': 1,
 'o': 1,
 'ar': 1}

In [9]:
#Replacing X2 labels of the dataset df
df.X2 = df.X2.map(df_frequency)
df.head()

Unnamed: 0,X1,X2
0,v,6
1,t,4
2,w,137
3,t,137
4,v,137


In [13]:
df['X2'].value_counts()

1659    1659
496      496
415      415
367      367
265      265
153      153
137      137
94        94
87        87
81        81
63        63
54        54
25        50
47        47
19        38
29        29
11        22
21        21
20        20
18        18
5         15
13        13
6         12
12        12
4         12
10        10
8          8
1          8
3          3
Name: X2, dtype: int64

### Disadvantage
If two/more of the existing categories has same number of time in the datatset then all of them will have the same labeling in the updated dataset.

### Advantages

- Easy to implement
- It  does not increaste the dimanetion of the feature space