## Mean / Median imputation

Imputation is the act of replacing missing data with statistical estimates of the missing values. The goal of any imputation technique is to produce a **complete dataset** that can be used to train machine learning models.

Mean / median imputation consists of replacing all occurrences of missing values (NA) within a variable by the mean (if the variable has a Gaussian distribution) or median (if the variable has a skewed distribution).

**Note the following**:

- If a variable is normally distributed, the mean, median and mode, are approximately the same. Therefore, replacing missing values by the mean and the median are equivalent. Replacing missing data by the mode is not common practice for  numerical variables.
- If the variable is skewed, the mean is biased by the values at the far end of the distribution. Therefore, the median is a better representation of the majority of the values in the variable.
- For discrete variables casted as 'int' (to save memory), the mean may not be an integer, therefore the whole variable will be re-casted as 'float'. In order to avoid this behaviour, we can replace NA with the median instead. The median will inevitably be an integer / discrete value as well.

============================================

In this recipe, we will replace missing values by the median or the mean using pandas, Scikit-learn and Feature-Engine, all open source Python libraries.

============================================

To download the House Prices dataset from kaggle visit [this website](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) and download the train.csv file. Rename it to houseprice.csv and save it to the parent folder of your notebook folder.

## Mean / median imputation with pandas

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# to split the datasets
from sklearn.model_selection import train_test_split

In [2]:
# we are going to use only the following variables:

cols_to_use =  ['LotFrontage', 'MasVnrArea', 'GarageYrBlt', 'SalePrice']

In [3]:
# let's load the House Prices dataset

data = pd.read_csv('../houseprice.csv', usecols=cols_to_use)
print(data.shape)
data.head()

(1460, 4)


Unnamed: 0,LotFrontage,MasVnrArea,GarageYrBlt,SalePrice
0,65.0,196.0,2003.0,208500
1,80.0,0.0,1976.0,181500
2,68.0,162.0,2001.0,223500
3,60.0,0.0,1998.0,140000
4,84.0,350.0,2000.0,250000


In [4]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data,
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((1022, 4), (438, 4))

In [5]:
# find the percentage of missing data within those variables
# same code as we learnt in section 3 on variable characteristics

X_train.isnull().mean()

LotFrontage    0.184932
MasVnrArea     0.004892
GarageYrBlt    0.052838
SalePrice      0.000000
dtype: float64

In [6]:
# let's make a function to fill missing values with the mean or median:
# the variable takes the dataframe, the variable, and the value of the
# mean or median

# and returns the variable with the filled na


def impute_na(df, variable, value):

    return df[variable].fillna(value)

In [7]:
# replace the missing values with the value

value = X_train['LotFrontage'].median()

X_train.loc[:,'LotFrontage'] = impute_na(X_train, 'LotFrontage', value)
X_test.loc[:,'LotFrontage'] = impute_na(X_test, 'LotFrontage', value)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [8]:
# and we repeat for the other 2 variables
value = X_train['MasVnrArea'].median()

X_train.loc[:,'MasVnrArea'] = impute_na(X_train, 'MasVnrArea', value)
X_test.loc[:,'MasVnrArea'] = impute_na(X_test, 'MasVnrArea', value)

value = X_train['GarageYrBlt'].median()

X_train.loc[:, 'GarageYrBlt'] = impute_na(X_train, 'GarageYrBlt', value)
X_test.loc[:,'GarageYrBlt'] = impute_na(X_test, 'GarageYrBlt', value)

In [9]:
# if instead I would like to replace by the value this is what I do:

value = X_train['LotFrontage'].mean()

X_train.loc[:,'LotFrontage_value'] = impute_na(X_train, 'LotFrontage', value)
X_test.loc[:,'LotFrontage_value'] = impute_na(X_test, 'LotFrontage', value)

value = X_train['MasVnrArea'].mean()

X_train.loc[:,'MasVnrArea'] = impute_na(X_train, 'MasVnrArea', value)
X_test.loc[:,'MasVnrArea'] = impute_na(X_test, 'MasVnrArea', value)

value = X_train['GarageYrBlt'].mean()

X_train.loc[:, 'GarageYrBlt'] = impute_na(X_train, 'GarageYrBlt', value)
X_test.loc[:,'GarageYrBlt'] = impute_na(X_test, 'GarageYrBlt', value)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)


## Mean / median imputation with Scikit-learn

In [10]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# these are the packages we need to impute missing data
# with sklearn
from sklearn.impute import SimpleImputer

# to split the datasets
from sklearn.model_selection import train_test_split

In [11]:
# let's load the House Prices dataset

data = pd.read_csv('../houseprice.csv', usecols=cols_to_use)

# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data,
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=0)

In [12]:
# Now we impute the missing values with SimpleImputer

# create an instance of the simple imputer
# we indicate that we want to impute with the median
imputer = SimpleImputer(strategy='median')

# we fit the imputer to the train set
# the imputer will learn the median of all variables
imputer.fit(X_train)

# we can look at the learnt medians like this:
imputer.statistics_

array([6.900e+01, 0.000e+00, 1.979e+03, 1.630e+05])

In [13]:
# and now we impute the train and test set

# NOTE: the data is returned as a numpy array!!!
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

### Mean / median imputation selecting features to impute

In [14]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# these are the packages we need to impute missing data
# with sklearn
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# to split the datasets
from sklearn.model_selection import train_test_split

In [15]:
# let's load the House Prices dataset

data = pd.read_csv('../houseprice.csv', usecols=cols_to_use)

# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data,
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=0)

In [16]:
# first we need to make lists, indicating which features
# will be imputed with each method

numeric_features_mean = ['LotFrontage']
numeric_features_median = ['MasVnrArea', 'GarageYrBlt']

# then we instantiate the imputers, within a pipeline
# we create one mean imputer and one median imputer
# by changing the parameter in the strategy

numeric_mean_imputer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
])

numeric_median_imputer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
])

# then we put the features list and the transformers together
# using the column transformer

preprocessor = ColumnTransformer(transformers=[
    ('mean_imputer', numeric_mean_imputer, numeric_features_mean),
    ('median_imputer', numeric_median_imputer, numeric_features_median)
])

In [17]:
# now we fit the preprocessor
preprocessor.fit(X_train)

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('mean_imputer', Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan, strategy='mean',
       verbose=0))]), ['LotFrontage']), ('median_imputer', Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0))]), ['MasVnrArea', 'GarageYrBlt'])])

In [18]:
# and now we can impute the data
X_train = preprocessor.transform(X_train)

In [19]:
# be carefutl that Scikit-Learn transformers return NumPy arrays!!
X_train

array([[  69.66866747,  573.        , 1998.        ],
       [  69.66866747,    0.        , 1996.        ],
       [  50.        ,    0.        , 1979.        ],
       ...,
       [  68.        ,    0.        , 1978.        ],
       [  69.66866747,   18.        , 2003.        ],
       [  58.        ,   30.        , 1998.        ]])

## Mean / Median imputation with feature engine

In [20]:
import pandas as pd
import numpy as np

from feature_engine import missing_data_imputers as mi

In [21]:
# let's load the House Prices dataset

data = pd.read_csv('../houseprice.csv', usecols=cols_to_use)

# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data,
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=0)

In [22]:
median_imputer = mi.MeanMedianImputer(
    imputation_method='median',
    variables=['LotFrontage', 'MasVnrArea', 'GarageYrBlt'])

median_imputer.fit(X_train)

MeanMedianImputer(imputation_method='median',
         variables=['LotFrontage', 'MasVnrArea', 'GarageYrBlt'])

In [23]:
# dictionary with the mappings for each variable
median_imputer.imputer_dict_

{'LotFrontage': 69.0, 'MasVnrArea': 0.0, 'GarageYrBlt': 1979.0}

In [24]:
# transform the data
X_train = median_imputer.transform(X_train)
X_test = median_imputer.transform(X_test)

In [25]:
X_train.isnull().mean()

LotFrontage    0.0
MasVnrArea     0.0
GarageYrBlt    0.0
SalePrice      0.0
dtype: float64