## Missing Indicator ==> Feature Engine


### What is Feature Engine

Feature Engine is an open source python package that I created at the back of this course. 

- Feature Engine includes all the feature engineering techniques described in the course
- Feature Engine works like to Scikit-learn, so it is easy to learn
- Feature Engine allows you to implement specific engineering steps to specific feature subsets
- Feature Engine can be integrated with the Scikit-learn pipeline allowing for smooth model building
- 
**Feature Engine allows you to design and store a feature engineering pipeline with bespoke procedures for different variable groups.**

-------------------------------------------------------------------
Feature Engine can be installed via pip ==> pip install feature-engine

- Make sure you have installed feature-engine before running this notebook

For more information visit:
my website

## In this demo

We will use Feature Engine to add a missing indicator using the Ames House Price Dataset.

- To download the dataset visit the lecture **Datasets** in **Section 1** of the course.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# to split the datasets
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# from feature engine
from feature_engine import missing_data_imputers as mdi

In [2]:
# let's load the dataset with a selected group of variables

cols_to_use = [
    'BsmtQual', 'FireplaceQu', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt',
    'SalePrice'
]

data = pd.read_csv('../houseprice.csv', usecols=cols_to_use)
data.head()

Unnamed: 0,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,GarageYrBlt,SalePrice
0,65.0,196.0,Gd,,2003.0,208500
1,80.0,0.0,Gd,TA,1976.0,181500
2,68.0,162.0,Gd,TA,2001.0,223500
3,60.0,0.0,TA,Gd,1998.0,140000
4,84.0,350.0,Gd,TA,2000.0,250000


In [3]:
data.isnull().mean()

LotFrontage    0.177397
MasVnrArea     0.005479
BsmtQual       0.025342
FireplaceQu    0.472603
GarageYrBlt    0.055479
SalePrice      0.000000
dtype: float64

In [4]:
# let's separate into training and testing set

# first drop the target from the feature list
cols_to_use.remove('SalePrice')

X_train, X_test, y_train, y_test = train_test_split(data[cols_to_use],
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((1022, 5), (438, 5))

### Feature Engine's missing indicator selects all variables by default

In [5]:
# we call the imputer from feature engine
# we don't need to specify anything 

imputer = mdi.AddNaNBinaryImputer()

In [6]:
# we fit the imputer

imputer.fit(X_train)

AddNaNBinaryImputer(variables=['BsmtQual', 'FireplaceQu', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt'])

In [7]:
# we see that the imputer selected all the variables numerical
# and categorical

imputer.variables

['BsmtQual', 'FireplaceQu', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [8]:
# feature engine returns a dataframe
# with the additional features

# no need to contatenate!!

tmp = imputer.transform(X_train)
tmp.head()

Unnamed: 0,BsmtQual,FireplaceQu,LotFrontage,MasVnrArea,GarageYrBlt,BsmtQual_na,FireplaceQu_na,LotFrontage_na,MasVnrArea_na,GarageYrBlt_na
64,Gd,,,573.0,1998.0,0,1,1,0,0
682,Gd,Gd,,0.0,1996.0,0,0,1,0,0
960,TA,,50.0,0.0,,0,1,0,0,1
1384,TA,,60.0,0.0,1939.0,0,1,0,0,0
1100,TA,,60.0,0.0,1930.0,0,1,0,0,0


In [9]:
# let's check NA

tmp.isnull().mean()

BsmtQual          0.023483
FireplaceQu       0.467710
LotFrontage       0.184932
MasVnrArea        0.004892
GarageYrBlt       0.052838
BsmtQual_na       0.000000
FireplaceQu_na    0.000000
LotFrontage_na    0.000000
MasVnrArea_na     0.000000
GarageYrBlt_na    0.000000
dtype: float64

## Feature engine allows you to specify variable groups easily

In [10]:
# let's do it imputation but this time
# and let's select a few variables

imputer = mdi.AddNaNBinaryImputer(variables=['BsmtQual', 'FireplaceQu', 'LotFrontage'])

imputer.fit(X_train)

AddNaNBinaryImputer(variables=['BsmtQual', 'FireplaceQu', 'LotFrontage'])

In [11]:
# now the imputer uses only the variables we indicated

imputer.variables

['BsmtQual', 'FireplaceQu', 'LotFrontage']

In [12]:
# feature engine returns a dataframe
# with the additional features

# no need to contatenate!!

tmp = imputer.transform(X_train)

tmp.head()

Unnamed: 0,BsmtQual,FireplaceQu,LotFrontage,MasVnrArea,GarageYrBlt,BsmtQual_na,FireplaceQu_na,LotFrontage_na
64,Gd,,,573.0,1998.0,0,1,1
682,Gd,Gd,,0.0,1996.0,0,0,1
960,TA,,50.0,0.0,,0,1,0
1384,TA,,60.0,0.0,1939.0,0,1,0
1100,TA,,60.0,0.0,1930.0,0,1,0


## Feature engine can be used with the Scikit-learn pipeline

In [13]:
# let's check the percentage of NA in each variable

X_train.isnull().mean()

BsmtQual       0.023483
FireplaceQu    0.467710
LotFrontage    0.184932
MasVnrArea     0.004892
GarageYrBlt    0.052838
dtype: float64

In [14]:
X_train.head()

Unnamed: 0,BsmtQual,FireplaceQu,LotFrontage,MasVnrArea,GarageYrBlt
64,Gd,,,573.0,1998.0
682,Gd,Gd,,0.0,1996.0
960,TA,,50.0,0.0,
1384,TA,,60.0,0.0,1939.0
1100,TA,,60.0,0.0,1930.0


These are the steps we will concatenate

- Add Missing Indicator to all variables
- Median Imputation to numerical variables
- Missing category imputation to categorical variables

In [15]:
pipe = Pipeline([
    ('missing_ind', mdi.AddNaNBinaryImputer()),
    ('imputer_missing', mdi.CategoricalVariableImputer(variables=['FireplaceQu', 'BsmtQual'])),
    ('imputer_median', mdi.MeanMedianImputer(imputation_method = 'median',
                                             variables=['LotFrontage', 'MasVnrArea', 'GarageYrBlt'])),
])

In [16]:
# fit the pipe
pipe.fit(X_train)

Pipeline(memory=None,
     steps=[('missing_ind', AddNaNBinaryImputer(variables=['BsmtQual', 'FireplaceQu', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt'])), ('imputer_missing', CategoricalVariableImputer(variables=['FireplaceQu', 'BsmtQual'])), ('imputer_median', MeanMedianImputer(imputation_method='median',
         variables=['LotFrontage', 'MasVnrArea', 'GarageYrBlt']))])

In [17]:
# inspect the separate steps
pipe.named_steps['missing_ind'].variables

['BsmtQual', 'FireplaceQu', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [18]:
pipe.named_steps['imputer_missing'].variables

['FireplaceQu', 'BsmtQual']

In [19]:
pipe.named_steps['imputer_median'].imputer_dict_

{'LotFrontage': 69.0, 'MasVnrArea': 0.0, 'GarageYrBlt': 1979.0}

In [20]:
# let's transform the data with the pipeline

# this pipeline will:
#- add the missing indicators
#- fill na in the original variables
# leaving the dataset ready to use for ML

tmp = pipe.transform(X_train)

# let's check null values are gone
tmp.isnull().mean()

BsmtQual          0.0
FireplaceQu       0.0
LotFrontage       0.0
MasVnrArea        0.0
GarageYrBlt       0.0
BsmtQual_na       0.0
FireplaceQu_na    0.0
LotFrontage_na    0.0
MasVnrArea_na     0.0
GarageYrBlt_na    0.0
dtype: float64

In [21]:
tmp.shape

(1022, 10)