## Engineering Rare Labels / Rare Categories

Rare values are labels/ categories within a categorical variable that are only present for a small percentage of the observations.

There is no rule of thumb to determine how small is a small percentage, but typically, any value below 5% may cause over-fitting in trees.

Rare labels may exist in variables that have intrinsically a huge number of labels, or they can be present in variables with few labels (e.g., 2-10). There is no rule of thumb to determine how many different labels is big (and therefore represent high cardinality) and it will depend as well on how many observations there are in the dataset. In a dataset with 1,000 observations, 100 labels may seem a lot, whereas in a dataset with 100,000 observations it may not be so high.

In situations where rare labels are present in variables with only a few categories, the rare label may be adding some information. On the other hand, in variables with a high number of categories, likely there will be very many labels with a low frequency, which will quite likely add noise instead of information. 

Whether rare labels should be processed before training a machine learning algorithm will depend on the dataset and problem at hand. Ideally, if there are not too many variables, you could try and explore the variables and their categories one at a time, and determine whether the rare labels add noise or information.

If, on the other hand, the dataset has very many categorical variables, and exploring one at a time is not an option you may choose to sacrifice the ideal / optimal performance for a higher delivery speed.

### Engineering rare labels

There are multiple ways of accounting for rare labels. Some of them handle rare labels at the same time of converting labels into numbers. I will explain those in the section "Engineer labels of categorical variables".

In this section of the course, I will expand on how to handle rare labels by re-categorising the observation that show rare labels for a certain variable. These observations can be re-categorised by:

- Replacing the rare label by most frequent label
- Grouping the observations that show rare labels into a unique category (with a new label like 'Rare', or 'Other')

In this and the comming lectures I will explain when it is convenient to use one or the other way of replacing rare values, and evaluate the consequences of replacing rare labels in variables with:

- One predominant category
- A small number of different categories
- High cardinality

**Note that grouping infrequent labels or categories under a new category called 'Rare' or 'Other' is the most common practice in machine learning for businesses.**

Specifically, in this lecture I will demonstrate how to work with rare labels in variables with one predominant category using the House Sale dataset from Kaggle.


===============================================================================

## Real Life example: 

### Predicting Sale Price of Houses

The problem at hand aims to predict the final sale price of homes based on different explanatory variables describing aspects of residential homes. Predicting house prices is useful to identify fruitful investments, or to determine whether the price advertised for a house is over or underestimated, before making a buying judgment.

=============================================================================

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
% matplotlib inline

from sklearn.ensemble import RandomForestRegressor
#from sklearn.ensemble import GradientBoostingRegressor

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

pd.set_option('display.max_columns', None) # to display the total number columns present in the dataset

import warnings
warnings.filterwarnings('ignore')

## House Sale Price dataset

In [2]:
# let's load the house price dataset from Kaggle

data = pd.read_csv('houseprice.csv')
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


### Rare value imputation - important

The identification of rare labels should be done considering only the presence of rare labels in the training set, and then propagated to the test set. This means, rare labels should be identified in the training set. And then, when those are present in the test set as well, they should be replaced, regardless of whether in the test set they are rare or not (i.e., regardless of whether in the test set they are also present in a tiny percentage of the observations or in a high percentage of observations)

In addition, there may be in the test set labels that were not present in the train set. They should be considered rare and preprocessed using the method that was selected to replace rare labels in the training set.

For example, let's imagine that we have in the training set the variable 'city' with the labels 'London', 'Manchester' and 'Yorkshire'. 'Yorkshire' is present in less than 5% of the observations so we decide to replace it by 'London', the most frequent city in the training dataset. In the test set, we should also replace 'Yorkshire' by 'London', regardless of the percentage of observations for 'Yorkshire' or whether 'London' is still the most represented city in the test set.

In addition, if in the test set we find the category 'Milton Keynes', that was not present in the training set, we should also replace that category by London. This is, all categories present in test set, not present in training set, should be treated as rare values and imputed accordingly.

In [3]:
# let's go ahead and divide dataset into train and test set

X_train, X_test, y_train, y_test = train_test_split(data, data.SalePrice, test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((1022, 81), (438, 81))

### Functions

Below I will write a few functions to convert categories in categorical variables into numbers so we can use them in sklearn and then to quickly test these variables in a random forest.

In [4]:
def train_rf(X_train, y_train, X_test, y_test, columns):
    # function to train the random forest
    # and test it on train and test sets
    
    rf = RandomForestRegressor(n_estimators=800, random_state=39)
    
    if type(columns)==str: # if we train using only 1 variable (pass a string instead of list in the "columns" argument of the function)
        rf.fit(X_train[columns].to_frame(), y_train.values)
        pred_train = rf.predict(X_train[columns].to_frame())
        pred_test = rf.predict(X_test[columns].to_frame())
        
    else: # if we train using multiple variables (pass a list in the argument "columns")
        rf.fit(X_train[columns], y_train.values)
        pred_train = rf.predict(X_train[columns])
        pred_test = rf.predict(X_test[columns])
        
    print('Train set')
    print('Random Forests mse: {}'.format(mean_squared_error(y_train, pred_train)))
    print('Test set')
    print('Random Forests mse: {}'.format(mean_squared_error(y_test, pred_test)))

In [5]:
def labels_to_numbers(X_train, X_test, columns):
    # function to encode labels into numbers
    # each label will be assigned an ordinal number from 0 onwards
    
    for col in columns:
        labels_dict = {k:i for i, k in enumerate(X_train[col].unique(), 0)}
        X_train.loc[:, col] = X_train.loc[:, col].map(labels_dict )
        X_test.loc[:, col] = X_test.loc[:, col].map(labels_dict)

### Variables with few categories

In [6]:
# the columns in the below list have only 4 different labels
# let's inspect them

cols = ['MasVnrType', 'ExterQual', 'BsmtCond']
for col in cols:
    print(data.groupby(col)[col].count()/np.float(len(data)))
    print()

MasVnrType
BrkCmn     0.010274
BrkFace    0.304795
None       0.591781
Stone      0.087671
Name: MasVnrType, dtype: float64

ExterQual
Ex    0.035616
Fa    0.009589
Gd    0.334247
TA    0.620548
Name: ExterQual, dtype: float64

BsmtCond
Fa    0.030822
Gd    0.044521
Po    0.001370
TA    0.897945
Name: BsmtCond, dtype: float64



The variables above have only 4 categories. And in all three cases, there is at least one category that is infrequent, this is, that is present in less than 5% of the observations.

When the variable has only a few categories, then perhaps it makes no sense to re-categorise the rare labels into something else. Let's look for example at the first variable MasVnrType. This variable shows only 1 rare label, BrkCmn. Thus, re-categorising it into a new label is not an option, because it will leave the variable in the same situation. Replacing of that label by the most frequent category may be done, but ideally, we should first evaluate the distribution of values (for example house prices), within the rare and frequent label. If they are similar, then it makes sense to merge the categories. If the distributions are different however, I would choose to leave the rare label as such and use the original variable without modifications.

Below I will demonstrate the effects of engineering rare labels in variables with few categories.

In [7]:
# let's check if there are missing data

X_train[cols].isnull().sum()

MasVnrType     5
ExterQual      0
BsmtCond      24
dtype: int64

Two of the variables have missing data, so let's replace by the most frequent category as we saw on previous lectures.

In [8]:
# let's create a function to replace NA by the most frequent category
# we have seen this function in previous lectures

def impute_na(df_train, df_test, variable):
    # find most frequent category
    most_frequent_category = df_train.groupby([variable])[variable].count().sort_values(ascending=False).index[0]
    
    # replace NA
    df_train[variable].fillna(most_frequent_category, inplace=True)
    df_test[variable].fillna(most_frequent_category, inplace=True)

In [9]:
# and now we impute the NA with the function we just created

for col in ['MasVnrType', 'BsmtCond']:
    impute_na(X_train, X_test, col)
    
X_train[cols].isnull().sum()

MasVnrType    0
ExterQual     0
BsmtCond      0
dtype: int64

#### Let's look at those rare labels

In [10]:
print(X_train.groupby('MasVnrType')['MasVnrType'].count()/np.float(len(X_train)))

MasVnrType
BrkCmn     0.009785
BrkFace    0.294521
None       0.600783
Stone      0.094912
Name: MasVnrType, dtype: float64


The label BrkCmn is present in less than 1% of the observations. Since it is the only category under-represented, creating a new category called 'Rare' to group this label does not make much sense, as the new label Rare will be in essence the same as BrkCmn, and still under-represented. 

Thus, we may choose to replace the rare label by the most frequent category, in this case, 'None'.

In [11]:
# find the most frequent category, I will use this line in the below function
frequent_cat = X_train.groupby('MasVnrType')['MasVnrType'].count().sort_values().tail(1).index.values[0]
frequent_cat

'None'

In [12]:
# find the rare label, I will use this line in the below function
temp = X_train.groupby('MasVnrType')['MasVnrType'].count()/np.float(len(X_train))
[x for x in temp.loc[temp<0.05].index.values]

['BrkCmn']

In [13]:
# for comparison, I will replace rare values by both the most frequent category
# or by re-categorising them under a new label "Rare"
# I will create a function to make the 2 rare value imputations at once

def rare_imputation(X_train, X_test, variable):
    
    # find the most frequent category
    frequent_cat = X_train.groupby(variable)[variable].count().sort_values().tail(1).index.values[0]
    
    # find rare labels
    temp = X_train.groupby([variable])[variable].count()/np.float(len(X_train))
    rare_cat = [x for x in temp.loc[temp<0.05].index.values]
    
    # create new variables, with Rare labels imputed
    
    # by the most frequent category
    X_train[variable+'_freq_imp'] = np.where(X_train[variable].isin(rare_cat), frequent_cat, X_train[variable])
    X_test[variable+'_freq_imp'] = np.where(X_test[variable].isin(rare_cat), frequent_cat, X_test[variable])
    
    # by adding a new label 'Rare'
    X_train[variable+'_rare_imp'] = np.where(X_train[variable].isin(rare_cat), 'Rare', X_train[variable])
    X_test[variable+'_rare_imp'] = np.where(X_test[variable].isin(rare_cat), 'Rare', X_test[variable])

#### Variable MasVnrType

In [14]:
# impute rare labels
rare_imputation(X_train, X_test, 'MasVnrType')

# visualise the transformed dataset
X_train[['MasVnrType', 'MasVnrType_rare_imp', 'MasVnrType_freq_imp']].head(10)

Unnamed: 0,MasVnrType,MasVnrType_rare_imp,MasVnrType_freq_imp
64,BrkFace,BrkFace,BrkFace
682,,,
960,,,
1384,,,
1100,,,
416,BrkFace,BrkFace,BrkFace
1034,,,
853,BrkFace,BrkFace,BrkFace
472,BrkFace,BrkFace,BrkFace
1011,,,


In [15]:
# let's inspect the original variable distribution

print(X_train.groupby('MasVnrType')['MasVnrType'].count()/np.float(len(X_train)))

MasVnrType
BrkCmn     0.009785
BrkFace    0.294521
None       0.600783
Stone      0.094912
Name: MasVnrType, dtype: float64


In [16]:
# and now the distribution after rare label to "rare" imputation

print(X_train.groupby('MasVnrType_rare_imp')['MasVnrType_rare_imp'].count()/np.float(len(X_train)))

MasVnrType_rare_imp
BrkFace    0.294521
None       0.600783
Rare       0.009785
Stone      0.094912
Name: MasVnrType_rare_imp, dtype: float64


We can see  that in essence, we did nothing  other that replacing the category name 'BrkCmn' by 'Rare'. This is not useful. **Re-grouping categories under a new label rare, only makes sense for those variables that containing more than 1 rare category**. 

In [17]:
# and now let's inspect the rare label to 'frequent label' imputation
print(X_train.groupby('MasVnrType_freq_imp')['MasVnrType_freq_imp'].count()/np.float(len(X_train)))

MasVnrType_freq_imp
BrkFace    0.294521
None       0.610568
Stone      0.094912
Name: MasVnrType_freq_imp, dtype: float64


The observations that originally displayed the rare label BrkCmn are grouped together with those that showed the most frequent label.

Let's examine the performance of the three variables in Random Forests

In [18]:
# first we encode the labels into numbers with the function that we wrote a few cells ago

labels_to_numbers(X_train, X_test, ['MasVnrType', 'MasVnrType_rare_imp', 'MasVnrType_freq_imp'])

# and then we build a random forest using the original distribution
train_rf(X_train, y_train, X_test, y_test, 'MasVnrType')

Train set
Random Forests mse: 4879648935.038812
Test set
Random Forests mse: 5785517057.405808


In [19]:
# or the distribution in which we grouped the rare values with those of the most frequent label

train_rf(X_train, y_train, X_test, y_test, 'MasVnrType_freq_imp')

Train set
Random Forests mse: 4880437822.86853
Test set
Random Forests mse: 5787398643.51672


In [20]:
# calculate the difference in mse

5785517057-5714854737

70662320

The random forest built over the training set in which we grouped the rare label with the most frequent label has a better performance. Compare the mse of 5.785e9 obtained using the original variable vs 5.787e9 obtained using the variable with the engineered rare label. Thus, as expected, engineering rare labels did boost the performance of the tree based method.

#### Variable ExterQual

In [21]:
# let's now explore another variable
print(X_train.groupby('ExterQual')['ExterQual'].count()/np.float(len(X_train)))

ExterQual
Ex    0.029354
Fa    0.011742
Gd    0.332681
TA    0.626223
Name: ExterQual, dtype: float64


This variable has 2 categories that are rare, 'Ex' and 'Fa'.

In [22]:
# let's engineer the rare labels into the 'Rare' or 'most frequent label' 
# using the function we defined above

rare_imputation(X_train, X_test, 'ExterQual')
X_train[['ExterQual', 'ExterQual_rare_imp', 'ExterQual_freq_imp']].head(10)

Unnamed: 0,ExterQual,ExterQual_rare_imp,ExterQual_freq_imp
64,TA,TA,TA
682,TA,TA,TA
960,TA,TA,TA
1384,TA,TA,TA
1100,TA,TA,TA
416,TA,TA,TA
1034,TA,TA,TA
853,TA,TA,TA
472,TA,TA,TA
1011,TA,TA,TA


In [23]:
# let's examine the original distribution again
print(X_train.groupby('ExterQual')['ExterQual'].count()/np.float(len(X_train)))

ExterQual
Ex    0.029354
Fa    0.011742
Gd    0.332681
TA    0.626223
Name: ExterQual, dtype: float64


In [24]:
# and now the imputation into most frequent label
print(X_train.groupby('ExterQual_freq_imp')['ExterQual_freq_imp'].count()/np.float(len(X_train)))

ExterQual_freq_imp
Gd    0.332681
TA    0.667319
Name: ExterQual_freq_imp, dtype: float64


In [25]:
# and the imputation into the rare label
print(X_train.groupby('ExterQual_rare_imp')['ExterQual_rare_imp'].count()/np.float(len(X_train)))

ExterQual_rare_imp
Gd      0.332681
Rare    0.041096
TA      0.626223
Name: ExterQual_rare_imp, dtype: float64


The imputation into rare label has generated an additional category, called "Rare" under which the observations with the labels Ex and Fa are now grouped. On the other had, the frequent label imputation has merged the observations of the labels Ex and Fa with those of the label TA, leaving only 2 categories in that variable.

Let's examine the performance of the different methods in random forests

In [26]:
# first we transform the label strings into numbers so we can use sklearn
labels_to_numbers(X_train, X_test, ['ExterQual', 'ExterQual_rare_imp', 'ExterQual_freq_imp'])

# and now we build Random Forests using the original variable
train_rf(X_train, y_train, X_test, y_test, 'ExterQual')

Train set
Random Forests mse: 3285699437.6173377
Test set
Random Forests mse: 3327665435.796429


In [27]:
# and comparatively, we build random forests using the rare imputation into the 'Rare' category
train_rf(X_train, y_train, X_test, y_test, 'ExterQual_rare_imp')

Train set
Random Forests mse: 3970209768.435656
Test set
Random Forests mse: 3747390792.6268826


In [28]:
# and finally, we build random forests using the rare into frequent label imputation methods
train_rf(X_train, y_train, X_test, y_test, 'ExterQual_freq_imp')

Train set
Random Forests mse: 4777247329.540922
Test set
Random Forests mse: 5572669187.136114


On this occasion, the random forests built using all the labels in the original variable performs best (mse test: 33e8). Grouping rare variables under 'Rare' stills shows some reasonable performance (mse test: 37e8). However, replacing the infrequent labels by the most frequent ones makes the random forests under-perform quite dramatically (mse test: 55e8). We can try to understand why grouping the rare values together brought a drop in performance by examining the mean house price within each label.

In [29]:
data.groupby('ExterQual')['SalePrice'].median()

ExterQual
Ex    364606.5
Fa     82250.0
Gd    220000.0
TA    139450.0
Name: SalePrice, dtype: float64

As expected, the median house price among the 2 rare categories is extremely different. Thus, merging them into one masks the value added by this label. And this is why, in this case, keeping the labels separate renders better performance of the algorithm.

#### Variable BsmtCond

In [30]:
# let's do the exercise one more time for an additional variable
# let's examine the distribution of observations among the different categories within this variable

print(X_train.groupby('BsmtCond')['BsmtCond'].count()/np.float(len(X_train)))

BsmtCond
Fa    0.032290
Gd    0.045010
Po    0.001957
TA    0.920744
Name: BsmtCond, dtype: float64


We can anticipate that replacing all rare labels (Fa, Gd and Po) by the most frequent category (TA) will in essence remove all of the information, as we will end up with only one category. If, alternatively, we group the rare categories into 1, we will end up with 2 categories, presumably removing substantial information as well. See below.

In [31]:
rare_imputation(X_train, X_test, 'BsmtCond')

In [32]:
# original distribution
print(X_train.groupby('BsmtCond')['BsmtCond'].count()/np.float(len(X_train)))

BsmtCond
Fa    0.032290
Gd    0.045010
Po    0.001957
TA    0.920744
Name: BsmtCond, dtype: float64


In [33]:
# replacing by most frequent

print(X_train.groupby('BsmtCond_freq_imp')['BsmtCond_freq_imp'].count()/np.float(len(X_train)))

BsmtCond_freq_imp
TA    1.0
Name: BsmtCond_freq_imp, dtype: float64


In [34]:
# grouping under rare
print(X_train.groupby('BsmtCond_rare_imp')['BsmtCond_rare_imp'].count()/np.float(len(X_train)))

BsmtCond_rare_imp
Rare    0.079256
TA      0.920744
Name: BsmtCond_rare_imp, dtype: float64


I will leave the exercise of comparing how this impacts the performance of random forests to you. You got the idea already.

### Conclusion

In my opinion, engineering rare labels in variables with very few categories like the ones we worked in this notebook, seems unlikely to bring forward a boost in the performance of the algorithm. And this is because few categories are unlikely to bring so much noise. 

However, if the number of categories increases, then it becomes more important to handle the rare labels.

**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**