## Frequent category imputation

Imputation is the act of replacing missing data with statistical estimates of the missing values. The goal of any imputation technique is to produce a **complete dataset** that can be used to train machine learning models.

Mode imputation consists of replacing all occurrences of missing values (NA) within a variable by the mode, which in other words refers to the **most frequent value** or **most frequent category**.

============================================

In this recipe, we will replace missing values by the median or the mean using pandas, Scikit-learn and Feature-Engine, all open source Python libraries.

============================================

To download the House Prices dataset from kaggle visit [this website](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) and download the train.csv file. Rename it to houseprice.csv and save it to the parent folder of your notebook folder.

## Frequent category imputation with pandas

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# to split the datasets
from sklearn.model_selection import train_test_split

In [2]:
# let's load the dataset with a few columns for the demonstration

# these are categorical columns and the target SalePrice
cols_to_use = ['BsmtQual', 'FireplaceQu', 'SalePrice']

data = pd.read_csv('../houseprice.csv', usecols=cols_to_use)
data.head()

Unnamed: 0,BsmtQual,FireplaceQu,SalePrice
0,Gd,,208500
1,Gd,TA,181500
2,Gd,TA,223500
3,TA,Gd,140000
4,Gd,TA,250000


In [3]:
# let's inspect the percentage of missing values in each variable

data.isnull().mean()

BsmtQual       0.025342
FireplaceQu    0.472603
SalePrice      0.000000
dtype: float64

In [4]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data,
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [5]:
# find the percentage of missing data within those variables
# same code as we learnt in section 3 on variable characteristics

X_train.isnull().mean()

BsmtQual       0.023483
FireplaceQu    0.467710
SalePrice      0.000000
dtype: float64

In [6]:
# let's make a function to fill missing values with the mean or median:
# the variable takes the dataframe, the variable, and the value of the
# mean or median

# and returns the variable with the filled na


def impute_na(df, variable, value):

    return df[variable].fillna(value)

In [7]:
# replace the missing values with the value

value = X_train['BsmtQual'].mode()[0]

X_train.loc[:,'BsmtQual'] = impute_na(X_train, 'BsmtQual', value)
X_test.loc[:,'BsmtQual'] = impute_na(X_test, 'BsmtQual', value)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [8]:
# and we repeat for the other 2 variables
value = X_train['FireplaceQu'].mode()[0]

X_train.loc[:,'FireplaceQu'] = impute_na(X_train, 'FireplaceQu', value)
X_test.loc[:,'FireplaceQu'] = impute_na(X_test, 'FireplaceQu', value)

## Frequent category imputation with Scikit-learn

In [9]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# these are the packages we need to impute missing data
# with sklearn
from sklearn.impute import SimpleImputer

# to split the datasets
from sklearn.model_selection import train_test_split

In [10]:
# let's load the dataset with a few categorical columns

# these are categorical columns and the target SalePrice
cols_to_use = ['BsmtQual', 'FireplaceQu', 'SalePrice']

data = pd.read_csv('../houseprice.csv', usecols=cols_to_use)
data.head()

Unnamed: 0,BsmtQual,FireplaceQu,SalePrice
0,Gd,,208500
1,Gd,TA,181500
2,Gd,TA,223500
3,TA,Gd,140000
4,Gd,TA,250000


In [11]:
# let's separate into training and testing set

# first let's remove the target from the features
cols_to_use.remove('SalePrice')

X_train, X_test, y_train, y_test = train_test_split(
    data[['BsmtQual', 'FireplaceQu']],  # just the features
    data['SalePrice'],  # the target
    test_size=0.3,  # the percentage of obs in the test set
    random_state=0)  # for reproducibility

X_train.shape, X_test.shape

((1022, 2), (438, 2))

In [12]:
# Now we impute the missing values with SimpleImputer

# create an instance of the simple imputer
# we indicate that we want to impute with the median
imputer = SimpleImputer(strategy='most_frequent')

# we fit the imputer to the train set
# the imputer will learn the median of all variables
imputer.fit(X_train)

SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='most_frequent', verbose=0)

In [13]:
# we can look at the learnt modes like this:
imputer.statistics_

array(['TA', 'Gd'], dtype=object)

In [14]:
# and now we impute the train and test set

# NOTE: the data is returned as a numpy array!!!
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

### Frequent category imputation selecting features to impute

In [15]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# these are the packages we need to impute missing data
# with sklearn
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# to split the datasets
from sklearn.model_selection import train_test_split

In [16]:
# let's load the dataset with both numerical and categorical variables

cols_to_use = [
    'BsmtQual', 'FireplaceQu', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt',
    'SalePrice'
]

data = pd.read_csv('../houseprice.csv', usecols=cols_to_use)
data.head()

Unnamed: 0,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,GarageYrBlt,SalePrice
0,65.0,196.0,Gd,,2003.0,208500
1,80.0,0.0,Gd,TA,1976.0,181500
2,68.0,162.0,Gd,TA,2001.0,223500
3,60.0,0.0,TA,Gd,1998.0,140000
4,84.0,350.0,Gd,TA,2000.0,250000


In [17]:
# let's separate into training and testing set

# first drop the target from the feature list
cols_to_use.remove('SalePrice')

X_train, X_test, y_train, y_test = train_test_split(data,
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((1022, 6), (438, 6))

In [18]:
# first we need to make lists, indicating which features
# will be imputed with each method

features_numeric = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
features_categoric = ['BsmtQual', 'FireplaceQu', ]

# then we instantiate the imputers, within a pipeline
# we create one imputer for numerical and one imputer
# for categorical

imputer_numeric = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
])

imputer_categoric = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
])

# then we put the features list and the transformers together
# using the column transformer

preprocessor = ColumnTransformer(transformers=[
    ('imputer_numeric', imputer_numeric, features_numeric),
    ('imputer_categoric', imputer_categoric, features_categoric)
])

In [19]:
# now we fit the preprocessor
preprocessor.fit(X_train)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('imputer_numeric', Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0))]), ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']), ('imputer_categoric', Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='most_frequent', verbose=0))]), ['BsmtQual', 'FireplaceQu'])])

In [20]:
# and now we can impute the data
# remember it returs a numpy array

X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

In [21]:
# be carefutl that Scikit-Learn transformers return NumPy arrays!!
X_train

array([[69.66866746698679, 573.0, 1998.0, 'Gd', 'Gd'],
       [69.66866746698679, 0.0, 1996.0, 'Gd', 'Gd'],
       [50.0, 0.0, 1978.0123966942149, 'TA', 'Gd'],
       ...,
       [68.0, 0.0, 1978.0, 'TA', 'Gd'],
       [69.66866746698679, 18.0, 2003.0, 'Gd', 'TA'],
       [58.0, 30.0, 1998.0, 'Gd', 'Gd']], dtype=object)

## Frequent category imputation with feature engine

In [22]:
import pandas as pd
import numpy as np

from feature_engine import missing_data_imputers as mi

In [23]:
# let's load the House Prices dataset

# these are categorical columns and the target SalePrice
cols_to_use = ['BsmtQual', 'FireplaceQu', 'SalePrice']

data = pd.read_csv('../houseprice.csv', usecols=cols_to_use)

# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data,
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=0)

In [24]:
frequentLabel_imputer = mi.FrequentCategoryImputer(variables = ['BsmtQual', 'FireplaceQu'])
frequentLabel_imputer.fit(X_train)

FrequentCategoryImputer(variables=['BsmtQual', 'FireplaceQu'])

In [25]:
# dictionary with the mappings for each variable
frequentLabel_imputer.imputer_dict_

{'BsmtQual': 'TA', 'FireplaceQu': 'Gd'}

In [26]:
# transform the data
X_train = frequentLabel_imputer.transform(X_train)
X_test = frequentLabel_imputer.transform(X_test)

In [27]:
X_train.isnull().mean()

BsmtQual       0.0
FireplaceQu    0.0
SalePrice      0.0
dtype: float64