## Missing Indicator ==> Feature-Engine

Feature-engine is an open source Python package originally designed to support this course, but has increasingly gained popularity and now supports transformations beyond those taught in the course. It was launched in 2017, and since then, several releases have appeared and a growing international community is beginning to lead the development.

- Feature-engine works like to Scikit-learn, so it is easy to learn
- Feature-engine allows you to implement specific engineering steps to specific feature subsets
- Feature-engine can be integrated with the Scikit-learn pipeline allowing for smooth model building
- 
**Feature-Engine allows you to design and store a feature engineering pipeline with different procedures for different variable groups.**

- Make sure you have installed feature-engine before running this notebook.

## In this demo

We will use Feature-engine to add missing indicators using the Ames House Price Dataset.

- To download the dataset visit the lecture **Datasets** in **Section 1** of the course.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# to split the datasets
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# from feature-engine
from feature_engine.imputation import (
    AddMissingIndicator,
    MeanMedianImputer,
    CategoricalImputer
)

In [2]:
# let's load the dataset with a selected group of variables

cols_to_use = [
    'BsmtQual', 'FireplaceQu', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt',
    'SalePrice'
]

data = pd.read_csv('../houseprice.csv', usecols=cols_to_use)
data.head()

Unnamed: 0,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,GarageYrBlt,SalePrice
0,65.0,196.0,Gd,,2003.0,208500
1,80.0,0.0,Gd,TA,1976.0,181500
2,68.0,162.0,Gd,TA,2001.0,223500
3,60.0,0.0,TA,Gd,1998.0,140000
4,84.0,350.0,Gd,TA,2000.0,250000


In [3]:
data.isnull().mean()

LotFrontage    0.177397
MasVnrArea     0.005479
BsmtQual       0.025342
FireplaceQu    0.472603
GarageYrBlt    0.055479
SalePrice      0.000000
dtype: float64

In [4]:
# let's separate into training and testing set

# first drop the target from the feature list
cols_to_use.remove('SalePrice')

X_train, X_test, y_train, y_test = train_test_split(
    data[cols_to_use],
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 5), (438, 5))

### Feature-Engine's missing indicator selects all variables by default

In [5]:
# we call the imputer from feature-engine
# the argument how allows us to determine if we want
# to add missing indicators to all variables, or only to
# those that show missing data in the train set

imputer = AddMissingIndicator(missing_only=True)

In [6]:
# we fit the imputer

imputer.fit(X_train)

AddMissingIndicator()

In [7]:
# the attribute `variables` shows the variables entered by the user, in this
# case None

imputer.variables

In [8]:
# this attribute stores the variables, numerical and categorical,
# that had missing data in the train set

imputer.variables_

['BsmtQual', 'FireplaceQu', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [9]:
# feature-engine returns a dataframe
# with the additional features

# no need to contatenate!!

tmp = imputer.transform(X_train)
tmp.head()

Unnamed: 0,BsmtQual,FireplaceQu,LotFrontage,MasVnrArea,GarageYrBlt,BsmtQual_na,FireplaceQu_na,LotFrontage_na,MasVnrArea_na,GarageYrBlt_na
64,Gd,,,573.0,1998.0,0,1,1,0,0
682,Gd,Gd,,0.0,1996.0,0,0,1,0,0
960,TA,,50.0,0.0,,0,1,0,0,1
1384,TA,,60.0,0.0,1939.0,0,1,0,0,0
1100,TA,,60.0,0.0,1930.0,0,1,0,0,0


In [10]:
# let's check NA

tmp.isnull().mean()

BsmtQual          0.023483
FireplaceQu       0.467710
LotFrontage       0.184932
MasVnrArea        0.004892
GarageYrBlt       0.052838
BsmtQual_na       0.000000
FireplaceQu_na    0.000000
LotFrontage_na    0.000000
MasVnrArea_na     0.000000
GarageYrBlt_na    0.000000
dtype: float64

## Feature-engine allows you to specify variable groups

In [11]:
# let's do it imputation but this time
# and let's select a few variables

imputer = AddMissingIndicator(variables=['BsmtQual', 'FireplaceQu', 'LotFrontage'])

imputer.fit(X_train)

AddMissingIndicator(variables=['BsmtQual', 'FireplaceQu', 'LotFrontage'])

In [12]:
# now the imputer uses only the variables we indicated

imputer.variables

['BsmtQual', 'FireplaceQu', 'LotFrontage']

In [13]:
# missing indicators will be added for the following variables
# in case that these are different from the ones passed by the user

# remember that with the argument how set to 'missing_only' the imputer
# will learn and store the variables if they show NA in the train dataset

imputer.variables_

['BsmtQual', 'FireplaceQu', 'LotFrontage']

In [14]:
# feature-engine returns a dataframe
# with the additional features

# no need to contatenate!!

tmp = imputer.transform(X_train)

tmp.head()

Unnamed: 0,BsmtQual,FireplaceQu,LotFrontage,MasVnrArea,GarageYrBlt,BsmtQual_na,FireplaceQu_na,LotFrontage_na
64,Gd,,,573.0,1998.0,0,1,1
682,Gd,Gd,,0.0,1996.0,0,0,1
960,TA,,50.0,0.0,,0,1,0
1384,TA,,60.0,0.0,1939.0,0,1,0
1100,TA,,60.0,0.0,1930.0,0,1,0


## Feature-engine can be used with the Scikit-learn pipeline

In [15]:
# let's check the percentage of NA in each variable

X_train.isnull().mean()

BsmtQual       0.023483
FireplaceQu    0.467710
LotFrontage    0.184932
MasVnrArea     0.004892
GarageYrBlt    0.052838
dtype: float64

In [16]:
X_train.head()

Unnamed: 0,BsmtQual,FireplaceQu,LotFrontage,MasVnrArea,GarageYrBlt
64,Gd,,,573.0,1998.0
682,Gd,Gd,,0.0,1996.0
960,TA,,50.0,0.0,
1384,TA,,60.0,0.0,1939.0
1100,TA,,60.0,0.0,1930.0


These are the steps we will execute in series:

- Add Missing Indicator to all variables
- Median Imputation to numerical variables
- Frequent category imputation to categorical variables

In [17]:
pipe = Pipeline([
    ('missing_ind', AddMissingIndicator()),

    ('imputer_mode', CategoricalImputer(
        imputation_method='frequent', variables=['FireplaceQu', 'BsmtQual'])),

    ('imputer_median', MeanMedianImputer(imputation_method='median',
                                         variables=['LotFrontage', 'MasVnrArea', 'GarageYrBlt'])),
])

In [18]:
# fit the pipe
pipe.fit(X_train)

Pipeline(steps=[('missing_ind', AddMissingIndicator()),
                ('imputer_mode',
                 CategoricalImputer(imputation_method='frequent',
                                    variables=['FireplaceQu', 'BsmtQual'])),
                ('imputer_median',
                 MeanMedianImputer(variables=['LotFrontage', 'MasVnrArea',
                                              'GarageYrBlt']))])

In [19]:
# inspect the separate steps
pipe.named_steps['missing_ind'].variables_

['BsmtQual', 'FireplaceQu', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [20]:
pipe.named_steps['imputer_mode'].imputer_dict_

{'FireplaceQu': 'Gd', 'BsmtQual': 'TA'}

In [21]:
pipe.named_steps['imputer_median'].imputer_dict_

{'LotFrontage': 69.0, 'MasVnrArea': 0.0, 'GarageYrBlt': 1979.0}

In [22]:
# let's transform the data with the pipeline

# this pipeline will:
#- add the missing indicators
#- fill na in the original variables
# leaving the dataset ready to use for ML

tmp = pipe.transform(X_train)

# let's check null values are gone
tmp.isnull().mean()

BsmtQual          0.0
FireplaceQu       0.0
LotFrontage       0.0
MasVnrArea        0.0
GarageYrBlt       0.0
BsmtQual_na       0.0
FireplaceQu_na    0.0
LotFrontage_na    0.0
MasVnrArea_na     0.0
GarageYrBlt_na    0.0
dtype: float64

In [23]:
tmp.shape

(1022, 10)