<a href="https://colab.research.google.com/github/vinay10949/AnalyticsAndML/blob/master/FeatureEngineering/Categorical-Variable-Encoding/2_4_Count_or_frequency_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Count or frequency encoding

In count encoding we replace the categories by the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset. That is, if 10 of our 100 observations show the colour blue, we would replace blue by 10 if doing count encoding, or by 0.1 if replacing by the frequency. These techniques capture the representation of each label in a dataset, but the encoding may not necessarily be predictive of the outcome. These are however, very popular encoding methods in Kaggle competitions.

The assumption of this technique is that the number observations shown by each variable is somewhat informative of the predictive power of the category.


### Advantages

- Simple
- Does not expand the feature space

### Disadvantages

- If 2 different categories appear the same amount of times in the dataset, that is, they appear in the same number of observations, they will be replaced by the same number: may lose valuable information.

For example, if there are 10 observations for the category blue and 10 observations for the category red, both will be replaced by 10, and therefore, after the encoding, will appear to be the same thing. 


Follow this [thread in Kaggle](https://www.kaggle.com/general/16927) for more information.




In [43]:
!pip install feature_engine
import numpy as np
import pandas as pd

# to split the datasets
from sklearn.model_selection import train_test_split

# to encode with feature-engine
from feature_engine.categorical_encoders import CountFrequencyCategoricalEncoder



In [44]:
# load dataset

data = pd.read_csv(
    'houseprice_train.csv',
    usecols=['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice'])

data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500
1,Veenker,MetalSd,MetalSd,181500
2,CollgCr,VinylSd,VinylSd,223500
3,Crawfor,Wd Sdng,Wd Shng,140000
4,NoRidge,VinylSd,VinylSd,250000


In [45]:
# let's have a look at how many labels each variable has

for col in data.columns:
    print(col, ': ', len(data[col].unique()), ' labels')

Neighborhood :  25  labels
Exterior1st :  15  labels
Exterior2nd :  16  labels
SalePrice :  663  labels


### Important

When doing count transformation of categorical variables, it is important to calculate the count (or frequency = count / total observations) **over the training set**, and then use those numbers to replace the labels in the test set.

In [46]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']], # predictors
    data['SalePrice'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((1022, 3), (438, 3))

## Count and Frequency encoding with pandas

In [47]:
# let's obtain the counts for each one of the labels
# in the variable Neigbourhood

count_map = X_train['Neighborhood'].value_counts().to_dict()

count_map

{'Blmngtn': 12,
 'Blueste': 2,
 'BrDale': 10,
 'BrkSide': 41,
 'ClearCr': 24,
 'CollgCr': 105,
 'Crawfor': 35,
 'Edwards': 71,
 'Gilbert': 55,
 'IDOTRR': 24,
 'MeadowV': 12,
 'Mitchel': 36,
 'NAmes': 151,
 'NPkVill': 7,
 'NWAmes': 51,
 'NoRidge': 30,
 'NridgHt': 51,
 'OldTown': 73,
 'SWISU': 18,
 'Sawyer': 61,
 'SawyerW': 45,
 'Somerst': 56,
 'StoneBr': 16,
 'Timber': 30,
 'Veenker': 6}

The dictionary contains the number of observations per category in Neighbourhood.

In [0]:
# replace the labels with the counts

X_train['Neighborhood'] = X_train['Neighborhood'].map(count_map)
X_test['Neighborhood'] = X_test['Neighborhood'].map(count_map)

In [49]:
# let's explore the result

X_train['Neighborhood'].head(10)

64      105
682      24
960      41
1384     71
1100     18
416      61
1034     35
853     151
472      71
1011     71
Name: Neighborhood, dtype: int64

In [50]:
# if instead of the count we would like the frequency
# we need only divide the count by the total number of observations:

frequency_map = (X_train['Exterior1st'].value_counts() / len(X_train) ).to_dict()
frequency_map

{'AsbShng': 0.014677103718199608,
 'AsphShn': 0.0009784735812133072,
 'BrkComm': 0.0009784735812133072,
 'BrkFace': 0.03424657534246575,
 'CBlock': 0.0009784735812133072,
 'CemntBd': 0.03816046966731898,
 'HdBoard': 0.149706457925636,
 'ImStucc': 0.0009784735812133072,
 'MetalSd': 0.1350293542074364,
 'Plywood': 0.08414872798434442,
 'Stone': 0.0019569471624266144,
 'Stucco': 0.016634050880626222,
 'VinylSd': 0.3561643835616438,
 'Wd Sdng': 0.14481409001956946,
 'WdShing': 0.02054794520547945}

In [0]:
# replace the labels with the frequencies

X_train['Exterior1st'] = X_train['Exterior1st'].map(frequency_map)
X_test['Exterior1st'] = X_test['Exterior1st'].map(frequency_map)

We can then put these commands into 2 functions as we did in the previous 3 notebooks, and loop over all the categorical variables. If you don't know how to do this, please check any of the previous notebooks.

## Count or Frequency Encoding with Feature-Engine

In [52]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']], # predictors
    data['SalePrice'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [53]:
count_enc = CountFrequencyCategoricalEncoder(
    encoding_method='count', # to do frequency ==> encoding_method='frequency'
    variables=['Neighborhood', 'Exterior1st', 'Exterior2nd'])

count_enc.fit(X_train)

CountFrequencyCategoricalEncoder(encoding_method='count',
                                 variables=['Neighborhood', 'Exterior1st',
                                            'Exterior2nd'])

In [54]:
# in the encoder dict we can observe the number of 
# observations per category for each variable

count_enc.encoder_dict_

{'Exterior1st': {'AsbShng': 15,
  'AsphShn': 1,
  'BrkComm': 1,
  'BrkFace': 35,
  'CBlock': 1,
  'CemntBd': 39,
  'HdBoard': 153,
  'ImStucc': 1,
  'MetalSd': 138,
  'Plywood': 86,
  'Stone': 2,
  'Stucco': 17,
  'VinylSd': 364,
  'Wd Sdng': 148,
  'WdShing': 21},
 'Exterior2nd': {'AsbShng': 17,
  'AsphShn': 1,
  'Brk Cmn': 4,
  'BrkFace': 18,
  'CBlock': 1,
  'CmentBd': 39,
  'HdBoard': 141,
  'ImStucc': 8,
  'MetalSd': 136,
  'Other': 1,
  'Plywood': 112,
  'Stone': 4,
  'Stucco': 16,
  'VinylSd': 353,
  'Wd Sdng': 142,
  'Wd Shng': 29},
 'Neighborhood': {'Blmngtn': 12,
  'Blueste': 2,
  'BrDale': 10,
  'BrkSide': 41,
  'ClearCr': 24,
  'CollgCr': 105,
  'Crawfor': 35,
  'Edwards': 71,
  'Gilbert': 55,
  'IDOTRR': 24,
  'MeadowV': 12,
  'Mitchel': 36,
  'NAmes': 151,
  'NPkVill': 7,
  'NWAmes': 51,
  'NoRidge': 30,
  'NridgHt': 51,
  'OldTown': 73,
  'SWISU': 18,
  'Sawyer': 61,
  'SawyerW': 45,
  'Somerst': 56,
  'StoneBr': 16,
  'Timber': 30,
  'Veenker': 6}}

In [55]:
X_train = count_enc.transform(X_train)
X_test = count_enc.transform(X_test)

# let's explore the result
X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,105,364,353
682,24,148,142
960,41,148,112
1384,71,21,29
1100,18,148,142


**Note**

If the argument variables is left to None, then the encoder will automatically identify all categorical variables. Is that not sweet?

The encoder will not encode numerical variables. So if some of your numerical variables are in fact categories, you will need to re-cast them as object before using the encoder.

Note, if there is a variable in the test set, for which the encoder doesn't have a number to assigned (the category was not seen in the train set), the encoder will return an error.