## Frequent Category Imputation in Scikit Learn
We will use the data from the 
[Housing Dataset](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=train.csv)

In [None]:
import io
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
# Limit our data and use only these columns
cols_to_use = [
    "BsmtQual",
    "FireplaceQu",
    "LotFrontage",
    "MasVnrArea",
    "Street",
    "SalePrice",
]

In [None]:
# Load the House Prices dataset.
data = pd.read_csv(io.BytesIO(uploaded['houseprice.csv']), usecols=cols_to_use)
data.head()

**Note:** The frequent category must be identified by using the ***TRAIN SET ONLY***

In [None]:
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),  
    data["SalePrice"],  
    test_size=0.3,  
    random_state=42
    )

X_train.shape, X_test.shape

In [None]:
# Find missing data
X_train.isnull().mean()

The cateogrical variables BsmtQual and FirePlaceQu contain missing data.

# SimpleImputer - default

In [None]:
imputer = SimpleImputer(strategy="most_frequent")

# we fit the imputer to the train set
# The imputer will learn the mode of categorical and numerical variables
imputer.fit(X_train)

In [None]:
# Imputed Values for each Column
imputer.statistics_

**Note** that the transformer learns the most frequent value for both categorical AND numerical variables.

In [None]:
# Validate imputer statistics
X_train.mode()

In [None]:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

X_train

In [None]:
# Encode Train Set back to a Dataframe
X_train = pd.DataFrame(
    X_train,
    columns=imputer.get_feature_names_out(),  # the variable names
)

X_train.head()

# SimpleImputer - dataframe

In [None]:
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),  
    data["SalePrice"],  
    test_size=0.3,  
    random_state=42
    )

X_train.shape, X_test.shape

In [None]:
imputer = SimpleImputer(strategy="most_frequent").set_output(transform="pandas")

In [None]:
imputer.fit(X_train)

In [None]:
# Imputer Value
imputer.statistics_

In [None]:
# Impute the Data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

# Dataframe
X_train.head()

In [None]:
# Validate Missing Values
X_train.isnull().sum()

# SimpleImputer - feature subsets

In [None]:
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),  
    data["SalePrice"],  
    test_size=0.3,  
    random_state=42
    )

X_train.shape, X_test.shape

In [None]:
# Missing values
X_train.isnull().mean()

### Impute Specific Columns with different strategies

- `Frequency Category` -> **Categorical Varaibles**
- `Mean` -> **Numerical Variables** 

In [None]:
numericfeatures = [
    "LotFrontage",
    "MasVnrArea",
]
categoricfeatures = [
    "BsmtQual", 
    "FireplaceQu", 
    "Street", 
]

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("numeric_imputer", SimpleImputer(strategy="mean"), numericfeatures),
        ("categoric_imputer", SimpleImputer(strategy="most_frequent"), categoricfeatures),
    ]
)

In [None]:
preprocessor.set_output(transform="pandas")

In [None]:
preprocessor.fit(X_train)

In [None]:
# Explore the Imputers
preprocessor.transformers

In [None]:
# Imputer Statistics for Numerical Features
preprocessor.named_transformers_["numeric_imputer"].statistics_

In [None]:
# Validate from the Numerical Varaibles from the Train Set 
X_train[numericfeatures].mean()

In [None]:
# Imputer Statistics for Categorical Features
preprocessor.named_transformers_["categoric_imputer"].statistics_

In [None]:
# Validate from the Categorical Varaibles from the Train Set 
X_train[categoricfeatures].mode()

In [None]:
# Impute the train and test set
X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

In [None]:
X_train.head()