## Missing Category imputation with Scikit-learn: SimpleImputer

Scikit-learn provides a class to perform the most common data imputation techniques.

The **SimpleImputer** provides basic strategies for imputing missing values, including:

- Mean and median imputation for numerical variables
- Most frequent category imputation for categorical variables
- Arbitrary value imputation for both categorical and numerical variables

## Advantages

- Simple to use if applied to the entire dataframe
- Fast computation (it uses numpy for calculations)
- Imputes several types of values (you can indicate if the missing values are np.nan, or zeroes, etc)

## Limitations

- Returns a numpy array by default
- Modifies entire dataframe

## More details about the transformers

- [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)
- [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)
- [Stackoverflow](https://stackoverflow.com/questions/54160370/how-to-use-sklearn-column-transformer)


## Dataset:

To download the House Prices dataset, please refer to the lecture **Datasets** in **Section 2** of this course.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

# transformers to impute missing data with sklearn:

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# to split the datasets
from sklearn.model_selection import train_test_split

In [2]:
# let's load the dataset with a few categorical columns

# these are categorical columns and the target SalePrice
cols_to_use = [
    "LotFrontage",
    "BsmtQual",
    "FireplaceQu",
    "SalePrice",
]

data = pd.read_csv("../../houseprice.csv", usecols=cols_to_use)

data.head()

Unnamed: 0,LotFrontage,BsmtQual,FireplaceQu,SalePrice
0,65.0,Gd,,208500
1,80.0,Gd,TA,181500
2,68.0,Gd,TA,223500
3,60.0,TA,Gd,140000
4,84.0,Gd,TA,250000


In [3]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),  # just the features
    data["SalePrice"],  # the target
    test_size=0.3,  # the percentage of obs in the test set
    random_state=0,  # for reproducibility
)

X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [4]:
# let's check the misssing data again
X_train.isnull().mean()

LotFrontage    0.184932
BsmtQual       0.023483
FireplaceQu    0.467710
dtype: float64

In [5]:
# let's inspect the values of the categorical variable

X_train["BsmtQual"].unique()

array(['Gd', 'TA', 'Fa', nan, 'Ex'], dtype=object)

In [6]:
# let's inspect the values of the categorical variable

X_train["FireplaceQu"].unique()

array([nan, 'Gd', 'TA', 'Fa', 'Po', 'Ex'], dtype=object)

# SimpleImputer - default

In [7]:
# Now we impute the missing values with SimpleImputer

# Create an instance of the simple imputer
# indicating that we want to replace NA
# with 'Missing':

imputer = SimpleImputer(
    strategy="constant",
    fill_value="Missing",
)

# we fit the imputer to the train set
imputer.fit(X_train)

In [8]:
# we find the NA replacement values here:

imputer.statistics_

array(['Missing', 'Missing', 'Missing'], dtype=object)

In [9]:
# and now we impute the train and test set

# NOTE: the data is returned as a numpy array!!!

X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

X_train

array([['Missing', 'Gd', 'Missing'],
       ['Missing', 'Gd', 'Gd'],
       [50.0, 'TA', 'Missing'],
       ...,
       [68.0, 'Missing', 'Missing'],
       ['Missing', 'Gd', 'TA'],
       [58.0, 'Gd', 'Missing']], dtype=object)

In [10]:
# If we wanted to continue our data analysis, we would have to
# encode the train set back to a dataframe:

X_train = pd.DataFrame(
    X_train,
    columns=imputer.get_feature_names_out(),  # the variable names
)

X_train.head()

Unnamed: 0,LotFrontage,BsmtQual,FireplaceQu
0,Missing,Gd,Missing
1,Missing,Gd,Gd
2,50.0,TA,Missing
3,60.0,TA,Missing
4,60.0,TA,Missing


In [11]:
X_train["BsmtQual"].unique()

array(['Gd', 'TA', 'Fa', 'Missing', 'Ex'], dtype=object)

In [12]:
X_train.isnull().mean()

LotFrontage    0.0
BsmtQual       0.0
FireplaceQu    0.0
dtype: float64

**A MASSIVE NOTE OF CAUTION**:

Note that when using SimpleImputer and setting the parameters to:
- strategy='constant'
- fill_value = 'Missing'

If your dataframe contains variables that are numerical and categorical, NA in both will be replaced by 'Missing" therefore converting your numerical variables into categorical, which is probably not what you are after.

Most datasets contain both numerical and categorical variables, so very likely you will have to use the ColumnTransformer to split the data.

# SimpleImputer - feature subsets

To apply the imputation to a feature subset we need to use the [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html).

In [13]:
# let's load the dataset with both numerical and categorical variables

cols_to_use = [
    "BsmtQual",
    "FireplaceQu",
    "LotFrontage",
    "MasVnrArea",
    "GarageYrBlt",
    "SalePrice",
]

data = pd.read_csv("../../houseprice.csv", usecols=cols_to_use)

data.head()

Unnamed: 0,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,GarageYrBlt,SalePrice
0,65.0,196.0,Gd,,2003.0,208500
1,80.0,0.0,Gd,TA,1976.0,181500
2,68.0,162.0,Gd,TA,2001.0,223500
3,60.0,0.0,TA,Gd,1998.0,140000
4,84.0,350.0,Gd,TA,2000.0,250000


In [14]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),  # just the features
    data["SalePrice"],  # the target
    test_size=0.3,  # the percentage of obs in the test set
    random_state=0,  # for reproducibility
)

In [15]:
# let's look at the missing values

X_train.isnull().mean()

LotFrontage    0.184932
MasVnrArea     0.004892
BsmtQual       0.023483
FireplaceQu    0.467710
GarageYrBlt    0.052838
dtype: float64

For this demo, I will impute the numerical variables by the mean, and the categorical variables with the string "Missing".

In [16]:
# first we need to make lists, indicating which features
# will be imputed with each method

features_numeric = ["LotFrontage", "MasVnrArea", "GarageYrBlt"]
features_categoric = ["BsmtQual", "FireplaceQu"]

# then we put the features list and the transformers to
# the column transformer

preprocessor = ColumnTransformer(
    transformers=[
        ("imputer_numeric", SimpleImputer(strategy="mean"), features_numeric),
        (
            "imputer_categoric",
            SimpleImputer(strategy="constant", fill_value="Missing"),
            features_categoric,
        ),
    ]
)

In [17]:
# set output to be a dataframe

preprocessor.set_output(transform="pandas")

In [18]:
# now we fit the preprocessor
preprocessor.fit(X_train)

In [19]:
# we can explore the transformers like this:
preprocessor.transformers

[('imputer_numeric',
  SimpleImputer(),
  ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']),
 ('imputer_categoric',
  SimpleImputer(fill_value='Missing', strategy='constant'),
  ['BsmtQual', 'FireplaceQu'])]

In [20]:
# and we can look at the values like this:

# for the numerical imputer
preprocessor.named_transformers_["imputer_numeric"].statistics_

array([  69.66866747,  103.55358899, 1978.01239669])

In [21]:
# for the categorical imputer
preprocessor.named_transformers_["imputer_categoric"].statistics_

array(['Missing', 'Missing'], dtype=object)

In [22]:
# and now we can impute the data.
# Remember that it returs a numpy array.

X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

In [23]:
X_train.head()

Unnamed: 0,imputer_numeric__LotFrontage,imputer_numeric__MasVnrArea,imputer_numeric__GarageYrBlt,imputer_categoric__BsmtQual,imputer_categoric__FireplaceQu
64,69.668667,573.0,1998.0,Gd,Missing
682,69.668667,0.0,1996.0,Gd,Gd
960,50.0,0.0,1978.012397,TA,Missing
1384,60.0,0.0,1939.0,TA,Missing
1100,60.0,0.0,1930.0,TA,Missing


In [24]:
# And explore the missing values.
# There should be none:

X_train.isnull().mean()

imputer_numeric__LotFrontage      0.0
imputer_numeric__MasVnrArea       0.0
imputer_numeric__GarageYrBlt      0.0
imputer_categoric__BsmtQual       0.0
imputer_categoric__FireplaceQu    0.0
dtype: float64