# One Hot Encoding of Frequent Categories


OHE of frequent or top categories is equivalent to grouping all the remaining categories under a new category. 

### Advantages of OHE of top categories
- Does not require much time of variable exploration
- Does not expand massively the feature space
- Suitable for linear models


### Limitations
- Does not add any information that may make the variable more predictive
- Does not keep the information of the ignored labels



### Datasets:
- House Pricing dataset


### Content:

1. First Steps:
    - loading the data
    - exploring cardinality
    - train/test split
2. OHE with Pandas and NumPy:
    - for a single column
    - with re-usable functions

## 1. First Steps

### - loading the data 

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

In [2]:
# load dataset

data = pd.read_csv(
    '../houseprice.csv',
    usecols=['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice'])

data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500
1,Veenker,MetalSd,MetalSd,181500
2,CollgCr,VinylSd,VinylSd,223500
3,Crawfor,Wd Sdng,Wd Shng,140000
4,NoRidge,VinylSd,VinylSd,250000


### - exploring cardinality

In [3]:
# labels each variable has

for col in data.columns:
    print(col, ': ', len(data[col].unique()), ' labels')

Neighborhood :  25  labels
Exterior1st :  15  labels
Exterior2nd :  16  labels
SalePrice :  663  labels


In [4]:
# explore the unique categories
data['Neighborhood'].unique()

array(['CollgCr', 'Veenker', 'Crawfor', 'NoRidge', 'Mitchel', 'Somerst',
       'NWAmes', 'OldTown', 'BrkSide', 'Sawyer', 'NridgHt', 'NAmes',
       'SawyerW', 'IDOTRR', 'MeadowV', 'Edwards', 'Timber', 'Gilbert',
       'StoneBr', 'ClearCr', 'NPkVill', 'Blmngtn', 'BrDale', 'SWISU',
       'Blueste'], dtype=object)

In [5]:
data['Exterior1st'].unique()

array(['VinylSd', 'MetalSd', 'Wd Sdng', 'HdBoard', 'BrkFace', 'WdShing',
       'CemntBd', 'Plywood', 'AsbShng', 'Stucco', 'BrkComm', 'AsphShn',
       'Stone', 'ImStucc', 'CBlock'], dtype=object)

In [6]:
data['Exterior2nd'].unique()

array(['VinylSd', 'MetalSd', 'Wd Shng', 'HdBoard', 'Plywood', 'Wd Sdng',
       'CmentBd', 'BrkFace', 'Stucco', 'AsbShng', 'Brk Cmn', 'ImStucc',
       'AsphShn', 'Stone', 'Other', 'CBlock'], dtype=object)

### - train/test split

It is important to select the top or most frequent categories based of the train data. Then, we will use those top categories to encode the variables in the test data as well

In [7]:
# train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']], 
    data['SalePrice'], 
    test_size=0.3,  
    random_state=0) 

X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [8]:
# examine how OHE expands the feature space

pd.get_dummies(X_train, drop_first=True).shape

(1022, 53)

From the initial 3 categorical variables, we end up with 53 variables. 

## OHE with Pandas and NumPy

(limitation: it does not preserve information from train data to propagate to test data)

### - for a single column 

In [9]:
# find the top 10 most frequent categories for the variable 'Neighborhood'

X_train['Neighborhood'].value_counts().sort_values(ascending=False).head(10)

NAmes      151
CollgCr    105
OldTown     73
Edwards     71
Sawyer      61
Somerst     56
Gilbert     55
NWAmes      51
NridgHt     51
SawyerW     45
Name: Neighborhood, dtype: int64

In [10]:
# make a list with the most frequent categories of the variable

top_10 = [
    x for x in X_train['Neighborhood'].value_counts().sort_values(
        ascending=False).head(10).index
]

top_10

['NAmes',
 'CollgCr',
 'OldTown',
 'Edwards',
 'Sawyer',
 'Somerst',
 'Gilbert',
 'NWAmes',
 'NridgHt',
 'SawyerW']

In [11]:
# make the 10 binary variables

for label in top_10:
    X_train['Neighborhood' + '_' + label] = np.where(
        X_train['Neighborhood'] == label, 1, 0)
    
    X_test['Neighborhood' + '_' + label] = np.where(
        X_test['Neighborhood'] == label, 1, 0)

X_train[['Neighborhood'] + ['Neighborhood'+'_'+c for c in top_10]].head(10)

Unnamed: 0,Neighborhood,Neighborhood_NAmes,Neighborhood_CollgCr,Neighborhood_OldTown,Neighborhood_Edwards,Neighborhood_Sawyer,Neighborhood_Somerst,Neighborhood_Gilbert,Neighborhood_NWAmes,Neighborhood_NridgHt,Neighborhood_SawyerW
64,CollgCr,0,1,0,0,0,0,0,0,0,0
682,ClearCr,0,0,0,0,0,0,0,0,0,0
960,BrkSide,0,0,0,0,0,0,0,0,0,0
1384,Edwards,0,0,0,1,0,0,0,0,0,0
1100,SWISU,0,0,0,0,0,0,0,0,0,0
416,Sawyer,0,0,0,0,1,0,0,0,0,0
1034,Crawfor,0,0,0,0,0,0,0,0,0,0
853,NAmes,1,0,0,0,0,0,0,0,0,0
472,Edwards,0,0,0,1,0,0,0,0,0,0
1011,Edwards,0,0,0,1,0,0,0,0,0,0


In [12]:
X_train

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,Neighborhood_NAmes,Neighborhood_CollgCr,Neighborhood_OldTown,Neighborhood_Edwards,Neighborhood_Sawyer,Neighborhood_Somerst,Neighborhood_Gilbert,Neighborhood_NWAmes,Neighborhood_NridgHt,Neighborhood_SawyerW
64,CollgCr,VinylSd,VinylSd,0,1,0,0,0,0,0,0,0,0
682,ClearCr,Wd Sdng,Wd Sdng,0,0,0,0,0,0,0,0,0,0
960,BrkSide,Wd Sdng,Plywood,0,0,0,0,0,0,0,0,0,0
1384,Edwards,WdShing,Wd Shng,0,0,0,1,0,0,0,0,0,0
1100,SWISU,Wd Sdng,Wd Sdng,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,NoRidge,VinylSd,VinylSd,0,0,0,0,0,0,0,0,0,0
835,Sawyer,VinylSd,HdBoard,0,0,0,0,1,0,0,0,0,0
1216,Sawyer,VinylSd,VinylSd,0,0,0,0,1,0,0,0,0,0
559,Blmngtn,VinylSd,VinylSd,0,0,0,0,0,0,0,0,0,0


### - with re-usable functions

In [13]:
def calculate_top_categories(df, variable, how_many=10):
    return [
        x for x in df[variable].value_counts().sort_values(
            ascending=False).head(how_many).index
    ]


def one_hot_encode(train, test, variable, top_x_labels):

    for label in top_x_labels:
        train[variable + '_' + label] = np.where(
            train[variable] == label, 1, 0)
        
        test[variable + '_' + label] = np.where(
            test[variable] == label,1, 0)

In [14]:
# run a loop over the remaining categorical variables

for variable in ['Exterior1st', 'Exterior2nd']:
    
    top_categories = calculate_top_categories(X_train, variable, how_many=10)
    
    one_hot_encode(X_train, X_test, variable, top_categories)

In [15]:
X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,Neighborhood_NAmes,Neighborhood_CollgCr,Neighborhood_OldTown,Neighborhood_Edwards,Neighborhood_Sawyer,Neighborhood_Somerst,Neighborhood_Gilbert,...,Exterior2nd_VinylSd,Exterior2nd_Wd Sdng,Exterior2nd_HdBoard,Exterior2nd_MetalSd,Exterior2nd_Plywood,Exterior2nd_CmentBd,Exterior2nd_Wd Shng,Exterior2nd_BrkFace,Exterior2nd_AsbShng,Exterior2nd_Stucco
64,CollgCr,VinylSd,VinylSd,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
682,ClearCr,Wd Sdng,Wd Sdng,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
960,BrkSide,Wd Sdng,Plywood,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1384,Edwards,WdShing,Wd Shng,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1100,SWISU,Wd Sdng,Wd Sdng,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


Now there are 30 additional dummy variables instead of the 53 that we would have had if we had created dummies for all categories.