## Adding a variable to capture NA

In the previous lectures within this section we studied how to replace missing values by the most frequent category or by extracting a random sample of the variable. These 2 methods assume that the missing data are missing completely at random (MCAR), and are suitable when the number of missing data is small, otherwise it may affect the distribution of the target within the labels of the variable.

So what if the missing data are not small or not MCAR?

We can capture the importance of missingness by creating an additional variable indicating whether the data was missing for that observation (1) or not (0). The additional variable is a binary variable: it takes only the values 0 and 1, 0 indicating that a value was present for that observation, and 1 indicating that the value was missing for that observation.

The procedure is exactly the same as for numerical variables.


### Advantages

- Easy to implement
- Captures the importance of missingess if there is one

### Disadvantages

- Expands the feature space

This method of imputation will add 1 variable per variable in the dataset with missing values. So if a dataset contains 10 features, and all of them have missing values, we will end up with a dataset with 20 features. The original features where we replaced the NA by the frequent label or random sampling, and additional 10 features, indicating for each of the variables, whether the value was missing or not.

This may not be a problem in datasets with tens to a few hundreds of variables, but if your original dataset contains thousands of variables, by creating an additional variable to indicate NA, you will end up with very big datasets.

In addition, data tends to be missing for the same observation on multiple variables, so it may also be the case, that many of your added variables will be actually similar to each other.

===============================================================================

## Real Life example: 

### Predicting Sale Price of Houses

The problem at hand aims to predict the final sale price of homes based on different explanatory variables describing aspects of residential homes. Predicting house prices is useful to identify fruitful investments, or to determine whether the price advertised for a house is over or underestimated, before making a buying judgment.

=============================================================================

In the following cells, I will demonstrate NA imputation by random sampling + adding an additional variable using the House Price datasets from Kaggle.

If you haven't downloaded the datasets yet, in the lecture "Guide to setting up your computer" in section 1, you can find the details on how to do so.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
% matplotlib inline

# for regression problems
from sklearn.linear_model import LinearRegression, Ridge

# to split and standarize the datasets
from sklearn.model_selection import train_test_split

# to evaluate regression models
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')

### House Price dataset

In [2]:
# let's load the dataset with a few columns for the demonstration
cols_to_use = ['BsmtQual', 'FireplaceQu', 'GarageType', 'SalePrice']

data = pd.read_csv('houseprice.csv', usecols=cols_to_use)

# let's inspect the percentage of missing values in each variable
data.isnull().mean().sort_values(ascending=True)

SalePrice      0.000000
BsmtQual       0.025342
GarageType     0.055479
FireplaceQu    0.472603
dtype: float64

**To evaluate whether adding this additional variables to indicate missingness improves the performance of the ML algorithms, I will replace missing values by random sampling (see previous lecture) as well.**

### Imputation important

Imputation should be done over the training set, and then propagated to the test set. This means that the random sampling of categories should be done from the training set, and used to replace NA both in train and test sets.

In [3]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data[['BsmtQual', 'FireplaceQu', 'GarageType']],
                                                    data.SalePrice, test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((1022, 3), (438, 3))

In [4]:
# let's create a variable to replace NA with a random sample of labels available within that variable
# in the same function we create the additional variable to indicate missingness

# make sure you understand every line of code.
# If unsure, run them separately in a cell in the notebook until you familiarise with the output
# of each line

def impute_na(df_train, df_test, variable):
    # add additional variable to indicate missingness
    df_train[variable+'_NA'] = np.where(df_train[variable].isnull(), 1, 0)
    df_test[variable+'_NA'] = np.where(df_test[variable].isnull(), 1, 0)
    
    # random sampling
    df_train[variable+'_random'] = df_train[variable]
    df_test[variable+'_random'] = df_test[variable]
    
    # extract the random sample to fill the na
    random_sample_train = df_train[variable].dropna().sample(df_train[variable].isnull().sum(), random_state=0)
    random_sample_test = df_train[variable].dropna().sample(df_test[variable].isnull().sum(), random_state=0)
    
    # pandas needs to have the same index in order to merge datasets
    random_sample_train.index = df_train[df_train[variable].isnull()].index
    random_sample_test.index = df_test[df_test[variable].isnull()].index
    
    df_train.loc[df_train[variable].isnull(), variable+'_random'] = random_sample_train
    df_test.loc[df_test[variable].isnull(), variable+'_random'] = random_sample_test

In [5]:
# and let's replace the NA
for variable in ['BsmtQual', 'FireplaceQu', 'GarageType',]:
    impute_na(X_train, X_test, variable)

In [6]:
# let's inspect that NA were replaced
X_train.isnull().sum()

BsmtQual               24
FireplaceQu           478
GarageType             54
BsmtQual_NA             0
BsmtQual_random         0
FireplaceQu_NA          0
FireplaceQu_random      0
GarageType_NA           0
GarageType_random       0
dtype: int64

In [7]:
X_train.head()

Unnamed: 0,BsmtQual,FireplaceQu,GarageType,BsmtQual_NA,BsmtQual_random,FireplaceQu_NA,FireplaceQu_random,GarageType_NA,GarageType_random
64,Gd,,Attchd,0,Gd,1,Gd,0,Attchd
682,Gd,Gd,Attchd,0,Gd,0,Gd,0,Attchd
960,TA,,,0,TA,1,TA,1,Attchd
1384,TA,,Detchd,0,TA,1,TA,0,Detchd
1100,TA,,Detchd,0,TA,1,Gd,0,Detchd


In [8]:
X_train.describe()

Unnamed: 0,BsmtQual_NA,FireplaceQu_NA,GarageType_NA
count,1022.0,1022.0,1022.0
mean,0.023483,0.46771,0.052838
std,0.151507,0.499201,0.223819
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,0.0,1.0,0.0
max,1.0,1.0,1.0


In [9]:
# let's transform the categories into numbers quick and dirty so we can use them in scikit-learn

# the below function numbers the labels from 0 to n, n being the number of different labels 
#  within the variable

for col in ['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random',]:
    labels_dict = {k:i for i, k in enumerate(X_train[col].unique(), 0)}
    X_train.loc[:, col] = X_train.loc[:, col].map(labels_dict )
    X_test.loc[:, col] = X_test.loc[:, col].map(labels_dict)

### Linear Regression

In [10]:
# Let's evaluate the performance of Linear Regression

# first we build a model using ONLY the variable with the NA replaced by a random sample
linreg = LinearRegression()
linreg.fit(X_train[['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random']], y_train)
print('Test set random imputation')
pred = linreg.predict(X_test[['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random']])
print('Linear Regression mse: {}'.format(mean_squared_error(y_test, pred)))
print()

# second we build a model including the variable that indicates missingness as well
linreg = LinearRegression()
linreg.fit(X_train[['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random',
                   'BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA']], y_train)
print('Test set random imputation + additional variable indicating missingness')
pred = linreg.predict(X_test[['BsmtQual_random', 'FireplaceQu_random', 'GarageType_random',
                             'BsmtQual_NA', 'FireplaceQu_NA', 'GarageType_NA']])
print('Linear Regression mse: {}'.format(mean_squared_error(y_test, pred)))

Test set random imputation
Linear Regression mse: 6456070592.706035

Test set random imputation + additional variable indicating missingness
Linear Regression mse: 4911877327.956806


Amazing, in this exercise we can see the power of creating that additional variable to capture missingness. The mse on the test set decreased dramatically when we included these variables that indicate that the observations contained missing values. That represents a savings of:

In [11]:
6456070592-4911877327

1544193265

1 billion dollars!

**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**