## Adding a Missing Indicator variable with Scikit-learn ==> MissingIndicator

Scikit-learn provides the **MissingIndicator** class to add a binary variable that flags NA.

The MissingIndicator has the option of adding a Missing indicator binary variable to all the variables in the dataset, or only those that show NA in the train set.

### Attention!

The transformer only returns the binary variables, which need to be added to the original train data.

### More details about the transformers

- [MissingIndicaror](https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html#sklearn.impute.MissingIndicator)

## In this demo:

We will add a Missing Indicator to the variables of the Ames House Price Dataset

- To download the dataset please refer to the lecture **Datasets** in **Section 1** of this course.

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# these are the objects we need to impute missing data
# with sklearn
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.pipeline import Pipeline

# to split the datasets
from sklearn.model_selection import train_test_split

In [3]:
# we use only the following variables for the demo:
# a mix of categorical and numerical

cols_to_use = ['BsmtQual', 'FireplaceQu', 'MSZoning',
               'BsmtUnfSF', 'LotFrontage', 'MasVnrArea',
               'Street', 'Alley', 'SalePrice']

In [4]:
# let's load the House Prices dataset

data = pd.read_csv('../houseprice.csv', usecols=cols_to_use)
print(data.shape)
data.head()

(1460, 9)


Unnamed: 0,MSZoning,LotFrontage,Street,Alley,MasVnrArea,BsmtQual,BsmtUnfSF,FireplaceQu,SalePrice
0,RL,65.0,Pave,,196.0,Gd,150,,208500
1,RL,80.0,Pave,,0.0,Gd,284,TA,181500
2,RL,68.0,Pave,,162.0,Gd,434,TA,223500
3,RL,60.0,Pave,,0.0,TA,540,Gd,140000
4,RL,84.0,Pave,,350.0,Gd,490,TA,250000


In [5]:
# let's check the null values
data.isnull().mean()

MSZoning       0.000000
LotFrontage    0.177397
Street         0.000000
Alley          0.937671
MasVnrArea     0.005479
BsmtQual       0.025342
BsmtUnfSF      0.000000
FireplaceQu    0.472603
SalePrice      0.000000
dtype: float64

In [6]:
# let's separate into training and testing set

# first let's remove the target from the features
cols_to_use.remove('SalePrice')

X_train, X_test, y_train, y_test = train_test_split(data[cols_to_use], # just the features
                                                    data['SalePrice'], # the target
                                                    test_size=0.3, # the percentage of obs in the test set
                                                    random_state=0) # for reproducibility
X_train.shape, X_test.shape

((1022, 8), (438, 8))

In [7]:
# let's check the misssing data again
X_train.isnull().mean()

BsmtQual       0.023483
FireplaceQu    0.467710
MSZoning       0.000000
BsmtUnfSF      0.000000
LotFrontage    0.184932
MasVnrArea     0.004892
Street         0.000000
Alley          0.939335
dtype: float64

## Add a Missing Indicator

In [8]:
indicator = MissingIndicator(error_on_new=True, features='missing-only')
indicator.fit(X_train)  

MissingIndicator(error_on_new=True, features='missing-only',
         missing_values=nan, sparse='auto')

In [9]:
# we can see the features with na:
# the result shows the index

indicator.features_

array([0, 1, 4, 5, 7], dtype=int64)

In [12]:
# we can find the feature names by passing the index to the
# list of columns

X_train.columns[indicator.features_]

Index(['BsmtQual', 'FireplaceQu', 'LotFrontage', 'MasVnrArea', 'Alley'], dtype='object')

In [13]:
# the indicator returns only the additional indicators
# when we transform the dataset

tmp = indicator.transform(X_train)

tmp

array([[False,  True,  True, False,  True],
       [False, False,  True, False,  True],
       [False,  True, False, False,  True],
       ...,
       [ True,  True, False, False,  True],
       [False, False,  True, False,  True],
       [False,  True, False, False,  True]])

In [14]:
# so we need to join it manually to the original X_train

# let's create a column name for each of the new MissingIndicators
indicator_cols = [c+'_NA' for c in X_train.columns[indicator.features_]]

# and now we concatenate
X_train = pd.concat([
    X_train.reset_index(),
    pd.DataFrame(tmp, columns = indicator_cols)],
    axis=1)

X_train.head()

Unnamed: 0,index,BsmtQual,FireplaceQu,MSZoning,BsmtUnfSF,LotFrontage,MasVnrArea,Street,Alley,BsmtQual_NA,FireplaceQu_NA,LotFrontage_NA,MasVnrArea_NA,Alley_NA
0,64,Gd,,RL,318,,573.0,Pave,,False,True,True,False,True
1,682,Gd,Gd,RL,288,,0.0,Pave,,False,False,True,False,True
2,960,TA,,RL,162,50.0,0.0,Pave,,False,True,False,False,True
3,1384,TA,,RL,356,60.0,0.0,Pave,,False,True,False,False,True
4,1100,TA,,RL,0,60.0,0.0,Pave,,False,True,False,False,True


In [15]:
# now the same for the test set
tmp = indicator.transform(X_test)

X_test = pd.concat([
    X_test.reset_index(),
    pd.DataFrame(tmp, columns = indicator_cols)],
    axis=1)

X_test.head()

Unnamed: 0,index,BsmtQual,FireplaceQu,MSZoning,BsmtUnfSF,LotFrontage,MasVnrArea,Street,Alley,BsmtQual_NA,FireplaceQu_NA,LotFrontage_NA,MasVnrArea_NA,Alley_NA
0,529,TA,TA,RL,816,,,Pave,,False,False,True,True,True
1,491,TA,TA,RL,238,79.0,0.0,Pave,,False,False,False,False,True
2,459,TA,TA,RL,524,,161.0,Pave,,False,False,True,False,True
3,279,Gd,TA,RL,768,83.0,299.0,Pave,,False,False,False,False,True
4,655,TA,,RM,525,21.0,381.0,Pave,,False,True,False,False,True


### SimpleImputer on the entire dataset

In [16]:
# Now we impute the missing values with SimpleImputer

# create an instance of the simple imputer
# we indicate that we want to impute with the 
# most frequent category

imputer = SimpleImputer(strategy='most_frequent')

# we fit the imputer to the train set
# the imputer will learn the median of all variables
imputer.fit(X_train)

SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='most_frequent', verbose=0)

In [17]:
# we can look at the learnt frequent values like this:
imputer.statistics_

array([0, 'TA', 'Gd', 'RL', 0, 60.0, 0.0, 'Pave', 'Pave', False, False,
       False, False, True], dtype=object)

**Note** that the transformer learns the most frequent value for both categorical AND numerical variables.

In [18]:
# and now we impute the train and test set

# NOTE: the data is returned as a numpy array!!!
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

X_train

array([[64, 'Gd', 'Gd', ..., True, False, True],
       [682, 'Gd', 'Gd', ..., True, False, True],
       [960, 'TA', 'Gd', ..., False, False, True],
       ...,
       [1216, 'TA', 'Gd', ..., False, False, True],
       [559, 'Gd', 'TA', ..., True, False, True],
       [684, 'Gd', 'Gd', ..., False, False, True]], dtype=object)