## Frequent Category Imputation in Pandas
We will use the data from the 
[Housing Dataset](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=train.csv)

In [None]:
import io
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
# Limit our data and use only these columns
cols_to_use = ["BsmtQual", "FireplaceQu", "SalePrice"]

In [None]:
# Load the House Prices dataset.
data = pd.read_csv(io.BytesIO(uploaded['houseprice.csv']), usecols=cols_to_use)
data.head()

**Note:** The frequent category must be identified by using the ***TRAIN SET ONLY***

In [None]:
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),  
    data["SalePrice"],  
    test_size=0.3,  
    random_state=42
    )

X_train.shape, X_test.shape

In [None]:
# Find missing data
X_train.isnull().mean()

In [None]:
# Calculate the mode
X_train[["BsmtQual", "FireplaceQu"]].mode()

Some variables may have more than 1 mode. In this case, we need to decide which category to use as replacement for the NaN values.

In [None]:
# Capture the mode of the variables to impute in a list.
imputation_dict = X_train[["BsmtQual", "FireplaceQu"]].mode().iloc[0].to_dict()
imputation_dict

In [None]:
# Replace missing data
X_train.fillna(imputation_dict, inplace=True)
X_test.fillna(imputation_dict, inplace=True)

In [None]:
# Validate Replacement for Train Data
X_train.isnull().sum()

In [None]:
# Validate Replacement for Test Data
X_test.isnull().sum()