## Adding  Missing Indicators with Scikit-learn ==> MissingIndicator

Scikit-learn provides the **MissingIndicator** class to add a binary variable that highlight NA.

The MissingIndicator has the option of adding a Missing indicators to all the variables in the dataset, or only to those that show NA in the train set.

### Attention!

The transformer only returns the binary variables, which need to be added to the original train data.

### More details about the transformers

- [MissingIndicator](https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html)


## In this demo:

We will add a Missing Indicator to the variables of the Ames House Price Dataset

- To download the dataset please refer to the lecture **Datasets** in **Section 2** of this course.

In [1]:
import pandas as pd

from sklearn.impute import SimpleImputer, MissingIndicator

from sklearn.model_selection import train_test_split

In [2]:
# We use only the following variables for the demo,
# containing a mix of categorical and numerical features.

cols_to_use = [
    "BsmtQual",
    "FireplaceQu",
    "MSZoning",
    "BsmtUnfSF",
    "LotFrontage",
    "MasVnrArea",
    "Street",
    "Alley",
    "SalePrice",
]

In [3]:
# let's load the House Prices dataset

data = pd.read_csv("../../houseprice.csv", usecols=cols_to_use)

data.head()

Unnamed: 0,MSZoning,LotFrontage,Street,Alley,MasVnrArea,BsmtQual,BsmtUnfSF,FireplaceQu,SalePrice
0,RL,65.0,Pave,,196.0,Gd,150,,208500
1,RL,80.0,Pave,,0.0,Gd,284,TA,181500
2,RL,68.0,Pave,,162.0,Gd,434,TA,223500
3,RL,60.0,Pave,,0.0,TA,540,Gd,140000
4,RL,84.0,Pave,,350.0,Gd,490,TA,250000


In [4]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),  # just the features
    data["SalePrice"],  # the target
    test_size=0.3,  # the percentage of obs in the test set
    random_state=0,  # for reproducibility
)

X_train.shape, X_test.shape

((1022, 8), (438, 8))

In [5]:
# let's check the misssing data again
X_train.isnull().mean()

MSZoning       0.000000
LotFrontage    0.184932
Street         0.000000
Alley          0.939335
MasVnrArea     0.004892
BsmtQual       0.023483
BsmtUnfSF      0.000000
FireplaceQu    0.467710
dtype: float64

## Add Missing Indicators

In [6]:
indicator = MissingIndicator(
    error_on_new=True,
    features="missing-only",
)

indicator.fit(X_train)

In [7]:
# We can see the features that contained NA.
# The result shows the index of the columns.

indicator.features_

array([1, 3, 4, 5, 7], dtype=int64)

In [8]:
# we can find the feature names by passing the index to the
# list of columns

X_train.columns[indicator.features_]

Index(['LotFrontage', 'Alley', 'MasVnrArea', 'BsmtQual', 'FireplaceQu'], dtype='object')

In [9]:
# the indicator returns only the additional indicators
# when we transform the dataset

tmp = indicator.transform(X_train)

tmp

array([[ True,  True, False, False,  True],
       [ True,  True, False, False, False],
       [False,  True, False, False,  True],
       ...,
       [False,  True, False,  True,  True],
       [ True,  True, False, False, False],
       [False,  True, False, False,  True]])

In [10]:
# variable names for the indicators come out of the box

indicator.get_feature_names_out()

array(['missingindicator_LotFrontage', 'missingindicator_Alley',
       'missingindicator_MasVnrArea', 'missingindicator_BsmtQual',
       'missingindicator_FireplaceQu'], dtype=object)

In [11]:
# so we need to join them manually to the original X_train

# and now we concatenate the indicators
X_train = pd.concat(
    [
        X_train.reset_index(),
        pd.DataFrame(tmp, columns=indicator.get_feature_names_out()),
    ],
    axis=1,
)

X_train.head()

Unnamed: 0,index,MSZoning,LotFrontage,Street,Alley,MasVnrArea,BsmtQual,BsmtUnfSF,FireplaceQu,missingindicator_LotFrontage,missingindicator_Alley,missingindicator_MasVnrArea,missingindicator_BsmtQual,missingindicator_FireplaceQu
0,64,RL,,Pave,,573.0,Gd,318,,True,True,False,False,True
1,682,RL,,Pave,,0.0,Gd,288,Gd,True,True,False,False,False
2,960,RL,50.0,Pave,,0.0,TA,162,,False,True,False,False,True
3,1384,RL,60.0,Pave,,0.0,TA,356,,False,True,False,False,True
4,1100,RL,60.0,Pave,,0.0,TA,0,,False,True,False,False,True


In [12]:
# now the same for the test set
tmp = indicator.transform(X_test)

X_test = pd.concat(
    [
        X_test.reset_index(),
        pd.DataFrame(tmp, columns=indicator.get_feature_names_out()),
    ],
    axis=1,
)

X_test.head()

Unnamed: 0,index,MSZoning,LotFrontage,Street,Alley,MasVnrArea,BsmtQual,BsmtUnfSF,FireplaceQu,missingindicator_LotFrontage,missingindicator_Alley,missingindicator_MasVnrArea,missingindicator_BsmtQual,missingindicator_FireplaceQu
0,529,RL,,Pave,,,TA,816,TA,True,True,True,False,False
1,491,RL,79.0,Pave,,0.0,TA,238,TA,False,True,False,False,False
2,459,RL,,Pave,,161.0,TA,524,TA,True,True,False,False,False
3,279,RL,83.0,Pave,,299.0,Gd,768,TA,False,True,False,False,False
4,655,RM,21.0,Pave,,381.0,TA,525,,False,True,False,False,True


## Add indicators with the SimpleImputer

The simple imputer has the option to add indicators straightaway. 

In [13]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("SalePrice", axis=1),  # just the features
    data["SalePrice"],  # the target
    test_size=0.3,  # the percentage of obs in the test set
    random_state=0,  # for reproducibility
)

In [14]:
# We impute features with the most frequent value for
# simplicity to showcase how we can add indicators with
# the simple Imputer:

imputer = SimpleImputer(
    strategy="most_frequent",
    add_indicator=True,
).set_output(transform="pandas")

# we fit the imputer to the train set
# the imputer will learn the mode of all variables
imputer.fit(X_train)

In [15]:
# we can look at the frequent values like this:

imputer.statistics_

array(['RL', 60.0, 'Pave', 'Pave', 0.0, 'TA', 0, 'Gd'], dtype=object)

**Note** that the transformer learns the most frequent value for both categorical AND numerical variables.

In [16]:
# and now we impute the train and test set

# NOTE: the data is returned as a numpy array!!!

X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

X_train

Unnamed: 0,MSZoning,LotFrontage,Street,Alley,MasVnrArea,BsmtQual,BsmtUnfSF,FireplaceQu,missingindicator_LotFrontage,missingindicator_Alley,missingindicator_MasVnrArea,missingindicator_BsmtQual,missingindicator_FireplaceQu
64,RL,60.0,Pave,Pave,573.0,Gd,318,Gd,True,True,False,False,True
682,RL,60.0,Pave,Pave,0.0,Gd,288,Gd,True,True,False,False,False
960,RL,50.0,Pave,Pave,0.0,TA,162,Gd,False,True,False,False,True
1384,RL,60.0,Pave,Pave,0.0,TA,356,Gd,False,True,False,False,True
1100,RL,60.0,Pave,Pave,0.0,TA,0,Gd,False,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
763,RL,82.0,Pave,Pave,673.0,Gd,89,Gd,False,True,False,False,False
835,RL,60.0,Pave,Pave,0.0,Gd,625,Gd,False,True,False,False,True
1216,RM,68.0,Pave,Pave,0.0,TA,0,Gd,False,True,False,True,True
559,RL,60.0,Pave,Pave,18.0,Gd,1374,TA,True,True,False,False,False
