## Count or frequency encoding

In count encoding we replace the categories by the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset. That is, if 10 of our 100 observations show the colour blue, we would replace blue by 10 if doing count encoding, or by 0.1 if replacing by the frequency. These techniques capture the representation of each label in a dataset, but the encoding may not necessarily be predictive of the outcome. These are however, very popular encoding methods in Kaggle competitions.

The assumption of this technique is that the number observations shown by each variable is somewhat informative of the predictive power of the category.


### Advantages

- Simple
- Does not expand the feature space

### Disadvantages

- If 2 different categories appear the same amount of times in the dataset, that is, they appear in the same number of observations, they will be replaced by the same number: may lose valuable information.

For example, if there are 10 observations for the category blue and 10 observations for the category red, both will be replaced by 10, and therefore, after the encoding, will appear to be the same thing. 


Follow this [thread in Kaggle](https://www.kaggle.com/general/16927) for more information.



## In this assignment:

You have to perform count or frequency encoding with:
- pandas
- Feature-Engine

And the advantages and limitations of each implementation using the House Prices dataset.

In [3]:
pip install feature_engine

Collecting feature_engine
[?25l  Downloading https://files.pythonhosted.org/packages/57/6d/0c7594c89bf07a7c447b1a251d4e04b07104d4a9332de71e1de42b78b838/feature_engine-1.0.2-py2.py3-none-any.whl (152kB)
[K     |████████████████████████████████| 153kB 29.4MB/s 
[?25hCollecting statsmodels>=0.11.1
[?25l  Downloading https://files.pythonhosted.org/packages/da/69/8eef30a6237c54f3c0b524140e2975f4b1eea3489b45eb3339574fc8acee/statsmodels-0.12.2-cp37-cp37m-manylinux1_x86_64.whl (9.5MB)
[K     |████████████████████████████████| 9.5MB 21.7MB/s 
Installing collected packages: statsmodels, feature-engine
  Found existing installation: statsmodels 0.10.2
    Uninstalling statsmodels-0.10.2:
      Successfully uninstalled statsmodels-0.10.2
Successfully installed feature-engine-1.0.2 statsmodels-0.12.2


In [4]:
import numpy as np
import pandas as pd

# to split the datasets
from sklearn.model_selection import train_test_split

# to encode with feature-engine
from feature_engine.encoding import CountFrequencyEncoder

In [5]:
from google.colab import files
uploaded = files.upload()

Saving houseprice.csv to houseprice.csv


In [6]:
# load dataset

data = pd.read_csv(
    'houseprice.csv',
    usecols=['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice'])

data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500
1,Veenker,MetalSd,MetalSd,181500
2,CollgCr,VinylSd,VinylSd,223500
3,Crawfor,Wd Sdng,Wd Shng,140000
4,NoRidge,VinylSd,VinylSd,250000


In [7]:
# let's have a look at how many labels each variable has
for i in data:
    print(i,':',len(data[i].unique()),'labels')

Neighborhood : 25 labels
Exterior1st : 15 labels
Exterior2nd : 16 labels
SalePrice : 663 labels


### Important

When doing count transformation of categorical variables, it is important to calculate the count (or frequency = count / total observations) **over the training set**, and then use those numbers to replace the labels in the test set.

In [26]:
# let's separate into training and testing set
# let's separate into training and testing set
X_train,X_test,y_train,y_test=train_test_split(data[['Neighborhood','Exterior1st','Exterior2nd']],data['SalePrice'],test_size=0.3)

X_train.shape, X_test.shape

((1022, 3), (438, 3))

## Count and Frequency encoding with pandas

In [18]:
# let's obtain the counts for each one of the labels
# in the variable Neigbourhood
X_train.groupby('Neighborhood').size().sort_values(ascending=False)
cat_dict = dict(X_train.groupby('Neighborhood').size().sort_values(ascending=False))
cat_dict

{'Blmngtn': 11,
 'Blueste': 2,
 'BrDale': 11,
 'BrkSide': 41,
 'ClearCr': 21,
 'CollgCr': 99,
 'Crawfor': 34,
 'Edwards': 71,
 'Gilbert': 58,
 'IDOTRR': 30,
 'MeadowV': 11,
 'Mitchel': 30,
 'NAmes': 151,
 'NPkVill': 5,
 'NWAmes': 56,
 'NoRidge': 30,
 'NridgHt': 59,
 'OldTown': 81,
 'SWISU': 20,
 'Sawyer': 51,
 'SawyerW': 41,
 'Somerst': 57,
 'StoneBr': 19,
 'Timber': 26,
 'Veenker': 7}

The dictionary contains the number of observations per category in Neighbourhood.

In [19]:
# replace the labels with the counts
X_train['Neighborhood']=X_train['Neighborhood'].replace(to_replace=list(cat_dict.keys()),value=list(cat_dict.values()))

In [20]:
# let's explore the result
X_train['Neighborhood'].head(10)

995    41
866    26
592    30
360    30
477    59
590    99
105    57
134    51
343    59
92     34
Name: Neighborhood, dtype: int64

In [21]:
X_train

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
995,41,MetalSd,MetalSd
866,26,VinylSd,VinylSd
592,30,HdBoard,HdBoard
360,30,VinylSd,VinylSd
477,59,VinylSd,VinylSd
...,...,...,...
1305,59,VinylSd,VinylSd
1234,20,MetalSd,MetalSd
1159,56,HdBoard,HdBoard
1066,58,VinylSd,VinylSd


In [22]:
# if instead of the count we would like the frequency
# we need only divide the count by the total number of observations:

frequency_map3 =  cat_dict

for i in frequency_map3.keys():
  frequency_map3[i] /= len(data['Neighborhood'])
frequency_map3

{'Blmngtn': 0.007534246575342466,
 'Blueste': 0.0013698630136986301,
 'BrDale': 0.007534246575342466,
 'BrkSide': 0.028082191780821917,
 'ClearCr': 0.014383561643835616,
 'CollgCr': 0.06780821917808219,
 'Crawfor': 0.023287671232876714,
 'Edwards': 0.04863013698630137,
 'Gilbert': 0.03972602739726028,
 'IDOTRR': 0.02054794520547945,
 'MeadowV': 0.007534246575342466,
 'Mitchel': 0.02054794520547945,
 'NAmes': 0.10342465753424658,
 'NPkVill': 0.003424657534246575,
 'NWAmes': 0.038356164383561646,
 'NoRidge': 0.02054794520547945,
 'NridgHt': 0.04041095890410959,
 'OldTown': 0.05547945205479452,
 'SWISU': 0.0136986301369863,
 'Sawyer': 0.03493150684931507,
 'SawyerW': 0.028082191780821917,
 'Somerst': 0.03904109589041096,
 'StoneBr': 0.013013698630136987,
 'Timber': 0.01780821917808219,
 'Veenker': 0.004794520547945206}

In [25]:
X_train

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
995,41,MetalSd,MetalSd
866,26,VinylSd,VinylSd
592,30,HdBoard,HdBoard
360,30,VinylSd,VinylSd
477,59,VinylSd,VinylSd
...,...,...,...
1305,59,VinylSd,VinylSd
1234,20,MetalSd,MetalSd
1159,56,HdBoard,HdBoard
1066,58,VinylSd,VinylSd


In [30]:
# replace the labels with the frequencies
X_train['Neighborhood'] = X_train['Neighborhood'].replace(to_replace=list(frequency_map3.keys()),value=list(frequency_map3.values()))



## Count or Frequency Encoding with Feature-Engine

In [36]:
# let's separate into training and testing set
X_train,X_test,y_train,y_test=train_test_split(data[['Neighborhood','Exterior1st','Exterior2nd']],data['SalePrice'],test_size=0.3)

X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [37]:
# let's explore the result
encoder = CountFrequencyEncoder(encoding_method='count')
X_train=encoder.fit_transform(X_train)
X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
849,9,74,99
723,69,160,156
503,38,38,132
1350,163,160,156
206,51,161,147


**Note**

If the argument variables is left to None, then the encoder will automatically identify all categorical variables. Is that not sweet?

The encoder will not encode numerical variables. So if some of your numerical variables are in fact categories, you will need to re-cast them as object before using the encoder.

Note, if there is a variable in the test set, for which the encoder doesn't have a number to assigned (the category was not seen in the train set), the encoder will return an error.