### Dealing with rare categories in categorical variables

- In a categorical variable often we see some categories appear a lot more frequent whereas there will be some categories which appear only in few observations.
- Categories that appear in a tiny proportion of the observations are rare categories (typically less than 5%)

### High level strategy
- Merge all categories less than 5% into most dominant category
- Group all categories less than 5% into a new category 'Rare'

In [189]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [190]:
df = pd.read_csv(r'./data/house_price.csv')

In [191]:
from sklearn.model_selection import train_test_split

In [192]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 0:-1], df.SalePrice, test_size=0.3, random_state=0 )

In [193]:
X_train.shape

(1022, 80)

In [195]:
# no. unique values of Street
X_train['Street'].nunique()

2

In [196]:
# unique values of Street
X_train['Street'].unique()

# Type of street access to property
# Grvl: Gravel
# Pave: Paved

array(['Pave', 'Grvl'], dtype=object)

In [197]:
def display_category_percentage(data, col):
    print((data[col].value_counts(normalize=True)).apply(lambda x : str(np.round(x * 100, 1)) + '%'))
    print()

In [198]:
# categories of Street
display_category_percentage(data = X_train, col = 'Street')

Pave    99.5%
Grvl     0.5%
Name: Street, dtype: object



- The majority of the houses are located on "Paved" streets. 
- In situations like this, we say that one category is dominating the variable. 
- Almost the totality of the observations show the same label.

### Different categorical variable scenarios
Categorical variables may present themselves in a wide variety of different scenarios. Among these, we find variables with:
- One dominating category (most of the observations share the same label)
- A few categories
- High cardinality (a lot of different categories).

#### Variables with less categories i.e. low cardinality but with one dominant category

In [199]:
  
def get_small_category_percent(data):
    for col in data.columns:
        if data[col].dtypes == 'O': # column is Object type
            if len(data[col].unique()) < 3: # categorical variable has less than 3 categories
                display_category_percentage(data, col)


In [200]:
get_small_category_percent(X_train)

Pave    99.5%
Grvl     0.5%
Name: Street, dtype: object

AllPub    99.9%
NoSeWa     0.1%
Name: Utilities, dtype: object

Y    93.2%
N     6.8%
Name: CentralAir, dtype: object



* In the first 2 variables, Street and Utilities, the variables show one dominating category which accounts for more than 99% of the observations. 
* In the third variable, the dominating category is present about 93% of the observations.

* In cases of variables with one dominating category, engineering the rare label is not an option. One needs to choose between whether to use that variable as it is at all or remove it from the dataset.
* These types of variables often are not useful for our predictions, and we should remove them from the set of features that we are going to use to build machine learning models. 
* There are of course exceptions, for example in those cases in which the target is unbalanced, and therefore, the presence of the rare label is indeed informative. 
* The rare label can also be informative in scenarios where the target is not unbalanced.

* Therefore, instead of automating a feature engineering pipeline, perhaps it is better to evaluate these variables individually.

#### Variables with few categories

In [201]:
# MasVnrType: Masonry veneer type
# ExterQual: Evaluates the quality of the material on the exterior
# BsmtCond: Evaluates the general condition of the basement

cols = ['MasVnrType', 'ExterQual', 'BsmtCond']

for col in cols:
    display_category_percentage(X_train, col)

None       59.9%
BrkFace    29.6%
Stone       9.5%
BrkCmn      1.0%
Name: MasVnrType, dtype: object

TA    62.6%
Gd    33.3%
Ex     2.9%
Fa     1.2%
Name: ExterQual, dtype: object

TA    91.9%
Gd     4.6%
Fa     3.3%
Po     0.2%
Name: BsmtCond, dtype: object



- The variables above have only 4 categories. And in all three cases, there is at least one category that is infrequent, this is, that is present in less than 5% of the observations.
- When the variable has only a few categories, then perhaps it makes no sense to re-categorise the rare labels into something else. 
- Lets look for example at the first variable MasVnrType. This variable shows only 1 rare label, BrkCmn. 
  Thus, re-categorising it into a new label is not an option, because it will leave the variable in the same situation. 
Replacing of that label by the most frequent category may be done, but ideally, we should first evaluate the distribution of values (for example house prices), within the rare and frequent label. If they are similar, then it makes sense to merge the categories. If the distributions are different however, I would choose to leave the rare label as such and use the original variable without modifications.

In [202]:
# let's check if there are missing data

X_train[cols].isnull().sum()

MasVnrType     5
ExterQual      0
BsmtCond      24
dtype: int64

In [203]:
def impute_na(train, test, variable):
    most_freq_cat = train.groupby(variable)[variable].count().sort_values(ascending=False).index[0]
    train.fillna(most_freq_cat, inplace=True)
    test.fillna(most_freq_cat, inplace=True)

In [204]:
for col in ['MasVnrType', 'BsmtCond']:
    impute_na(X_train, X_test, col)

In [205]:
# let's check if there are missing data

X_train[cols].isnull().sum()

MasVnrType    0
ExterQual     0
BsmtCond      0
dtype: int64

In [206]:
cols = ['MasVnrType']

for col in cols:
    display_category_percentage(X_train, col)

None       60.1%
BrkFace    29.5%
Stone       9.5%
BrkCmn      1.0%
Name: MasVnrType, dtype: object



The label BrkCmn is present in 1% of the observations. Since it is the only category under-represented, creating a new category called 'Rare' to group this label does not make much sense, as the new label Rare will be in essence the same as BrkCmn, and still under-represented.

Thus, we may choose to replace the rare label by the most frequent category, in this case, 'None'.

### Rare value imputation
- The identification of rare labels should be done considering only the presence of rare labels in the training set, and then propagated to the test set. 
- This means, rare labels should be identified in the training set. And then, when those are present in the test set as well, they should be replaced, regardless of whether in the test set they are rare or not.
- In addition, there may be labels in the test set that were not present in the train set. They should be considered rare and preprocessed using the method that was selected to replace rare labels in the training set.



In [207]:
def rare_imputation(train, test, variable):
    
    # find frequest category in train
    freq_cat = train[variable].value_counts().index[0]
    
    # find rare categories in train
    S = train[variable].value_counts(normalize=True) 
    rare_cat = S[S < 0.05].index.values
    
    # create new categorical variables with rare values imputed
    
    # Approach 1. by most frequent category
    freq_dict = {k:freq_cat for k in rare_cat}
    
    train[variable + '_freq'] = train[variable].apply(lambda x : freq_dict[x] if x in freq_dict.keys() else x)
    test[variable + '_freq'] = test[variable].apply(lambda x : freq_dict[x] if x in freq_dict.keys() else x)
    
    # Approach 2. by adding a new category label as Rare
    train[variable + '_rare'] = np.where(train[variable].isin(rare_cat), 'Rare', train[variable])
    test[variable + '_rare'] = np.where(test[variable].isin(rare_cat), 'Rare', test[variable])
    

In [208]:
#  imputation
rare_imputation(X_train, X_test, 'MasVnrType')

In [209]:
# After rare value imputation 

for col in X_train.columns[X_train.columns.str.startswith('MasVnrType')]:
    display_category_percentage(X_train, col)

None       60.1%
BrkFace    29.5%
Stone       9.5%
BrkCmn      1.0%
Name: MasVnrType, dtype: object

None       61.1%
BrkFace    29.5%
Stone       9.5%
Name: MasVnrType_freq, dtype: object

None       60.1%
BrkFace    29.5%
Stone       9.5%
Rare        1.0%
Name: MasVnrType_rare, dtype: object



In [210]:
#  imputation
rare_imputation(X_train, X_test, 'ExterQual')

cols=['ExterQual', 'ExterQual_freq', 'ExterQual_rare']
for col in cols:
    display_category_percentage(X_train, col)

TA    62.6%
Gd    33.3%
Ex     2.9%
Fa     1.2%
Name: ExterQual, dtype: object

TA    66.7%
Gd    33.3%
Name: ExterQual_freq, dtype: object

TA      62.6%
Gd      33.3%
Rare     4.1%
Name: ExterQual_rare, dtype: object



### High Cardinality

In [211]:
# let's explore examples in which variables have several categories, say more than 10

def get_highcardinality_columns(dataframe):
    return [col for col in dataframe.columns if (dataframe[col].dtype == 'O') and (dataframe[col].nunique() > 10)]

In [212]:
# rare imputation for high cardinality columns
for col in get_highcardinality_columns(X_train):
    rare_imputation(X_train, X_test, col)
    print(f'Rare imputation for {col} done')

Rare imputation for LotFrontage done
Rare imputation for Neighborhood done
Rare imputation for Exterior1st done
Rare imputation for Exterior2nd done
Rare imputation for MasVnrArea done
Rare imputation for GarageYrBlt done


In [213]:
highcardinality_cols=['LotFrontage', 'Neighborhood']

for hcol in highcardinality_cols:
    for col in X_train.columns[X_train.columns.str.startswith(hcol)]:
        display_category_percentage(X_train, col)

None     18.5%
60.0      9.2%
80.0      5.5%
50.0      4.0%
75.0      3.5%
         ...  
112.0     0.1%
109.0     0.1%
174.0     0.1%
103.0     0.1%
182.0     0.1%
Name: LotFrontage, Length: 102, dtype: object

None    85.3%
60.0     9.2%
80.0     5.5%
Name: LotFrontage_freq, dtype: object

Rare    66.8%
None    18.5%
60.0     9.2%
80.0     5.5%
Name: LotFrontage_rare, dtype: object

NAmes      14.8%
CollgCr    10.3%
OldTown     7.1%
Edwards     6.9%
Sawyer      6.0%
Somerst     5.5%
Gilbert     5.4%
NWAmes      5.0%
NridgHt     5.0%
SawyerW     4.4%
BrkSide     4.0%
Mitchel     3.5%
Crawfor     3.4%
NoRidge     2.9%
Timber      2.9%
ClearCr     2.3%
IDOTRR      2.3%
SWISU       1.8%
StoneBr     1.6%
Blmngtn     1.2%
MeadowV     1.2%
BrDale      1.0%
NPkVill     0.7%
Veenker     0.6%
Blueste     0.2%
Name: Neighborhood, dtype: object

NAmes      58.8%
CollgCr    10.3%
OldTown     7.1%
Edwards     6.9%
Sawyer      6.0%
Somerst     5.5%
Gilbert     5.4%
Name: Neighborhood_freq, dtype: o