## Frequent category imputation - pandas

To download the House Prices dataset, please refer to the lecture **Datasets** in **Section 2** of this course.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
# Two categorical columns and the target SalePrice

cols_to_use = ["BsmtQual", "FireplaceQu", "SalePrice"]

In [3]:
# Let's load the House Prices dataset.

data = pd.read_csv("../../houseprice.csv", usecols=cols_to_use)

data.head()

Unnamed: 0,BsmtQual,FireplaceQu,SalePrice
0,Gd,,208500
1,Gd,TA,181500
2,Gd,TA,223500
3,TA,Gd,140000
4,Gd,TA,250000


**Remember that the frequent category must be identified by using the train set only.**

In [4]:
# Let's separate into training and testing sets.

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),
    data["SalePrice"],
    test_size=0.3,
    random_state=0,
)

X_train.shape, X_test.shape

((1022, 2), (438, 2))

In [5]:
# Find missing data

X_train.isnull().mean()

BsmtQual       0.023483
FireplaceQu    0.467710
dtype: float64

In [6]:
# Calculate the mode

X_train[["BsmtQual", "FireplaceQu"]].mode()

Unnamed: 0,BsmtQual,FireplaceQu
0,TA,Gd


Some variables may have more than 1 mode. In this case, we need to decide which category to use as replacement for the NaN values.

In [7]:
# Capture the mode of the variables in
# a dictionary

imputation_dict = X_train[["BsmtQual", "FireplaceQu"]].mode().iloc[0].to_dict()

imputation_dict

{'BsmtQual': 'TA', 'FireplaceQu': 'Gd'}

In [8]:
# Replace missing data

X_train.fillna(imputation_dict, inplace=True)
X_test.fillna(imputation_dict, inplace=True)

In [9]:
# Corroborate replacement

X_train.isnull().sum()

BsmtQual       0
FireplaceQu    0
dtype: int64

In [10]:
# Corroborate replacement

X_test.isnull().sum()

BsmtQual       0
FireplaceQu    0
dtype: int64