# Count or Frequency Encoding

In count encoding we replace the categories by the count of the observations that show that category in the dataset by the frequency -or percentage- of observations in the dataset. These techniques capture the representation of each label in a dataset, but the encoding may not necessarily be predictive of the outcome.
The assumption of this technique is that the number observations shown by each variable is somewhat informative of the predictive power of the category.


### Advantage:
- Does not expand the feature space

### Limitation:
- If 2 different categories appear the same amount of times in the dataset, that is, they appear in the same number of observations, they will be replaced by the same number: may lose valuable information.


### Dataset:
- House Pricing dataset


### Content:

1. First Steps.
2. Count and Frequency Encoding with Pandas:
    - count encoding,
    - frequency encoding.

In [72]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

## 1. First Steps

In [73]:
# load dataset

data = pd.read_csv(## 1. First Steps
    '../houseprice.csv',
    usecols=['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice'])

data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500
1,Veenker,MetalSd,MetalSd,181500
2,CollgCr,VinylSd,VinylSd,223500
3,Crawfor,Wd Sdng,Wd Shng,140000
4,NoRidge,VinylSd,VinylSd,250000


In [74]:
# labels each variable has

for col in data.columns:
    print(col, ': ', len(data[col].unique()), ' labels')

Neighborhood :  25  labels
Exterior1st :  15  labels
Exterior2nd :  16  labels
SalePrice :  663  labels


In [75]:
# train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']],
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 3), (438, 3))

## Count and Frequency Encoding with Pandas

### - count encoding

In [76]:
def find_counts(df, variable):
    count_map = X_train[variable].value_counts().to_dict()
    return count_map

def replace_with_counts(train, test, variable, ordinal_mapping):
    train[variable] = train[variable].map(ordinal_mapping)
    test[variable] = test[variable].map(ordinal_mapping)

In [77]:
for variable in ["Neighborhood", "Exterior1st", "Exterior2nd"]:
    mappings = find_counts(X_train, variable)
    replace_with_counts(X_train, X_test, variable, mappings)

In [78]:
X_train

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,105,364,353
682,24,148,142
960,41,148,112
1384,71,21,29
1100,18,148,142
...,...,...,...
763,30,364,353
835,61,364,141
1216,61,364,353
559,12,364,353


### - frequency encoding 

All the same but with replace with frequency

In [79]:
# train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']],
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [80]:
def find_frequency(df, variable):
    frequency_map = ( df[variable].value_counts() / len(df) ).to_dict()
    return frequency_map

def replace_with_frequency(train, test, variable, ordinal_mapping):
    train[variable] = train[variable].map(ordinal_mapping)
    test[variable] = test[variable].map(ordinal_mapping)

In [81]:
for variable in ["Neighborhood", "Exterior1st", "Exterior2nd"]:
    mappings = find_frequency(X_train, variable)
    replace_with_frequency(X_train, X_test, variable, mappings)

In [82]:
X_train

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,0.102740,0.356164,0.345401
682,0.023483,0.144814,0.138943
960,0.040117,0.144814,0.109589
1384,0.069472,0.020548,0.028376
1100,0.017613,0.144814,0.138943
...,...,...,...
763,0.029354,0.356164,0.345401
835,0.059687,0.356164,0.137965
1216,0.059687,0.356164,0.345401
559,0.011742,0.356164,0.345401
