## Frequent Category Imputation ==> Feature-Engine

Feature-engine is an open source Python package originally designed to support this course, but has increasingly gained popularity and now supports transformations beyond those taught in the course. It was launched in 2017, and since then, several releases have appeared and a growing international community is beginning to lead the development.

- Feature-engine works like to Scikit-learn, so it is easy to learn
- Feature-engine allows you to implement specific engineering steps to specific feature subsets
- Feature-engine can be integrated with the Scikit-learn pipeline allowing for smooth model building
- 
**Feature-Engine allows you to design and store a feature engineering pipeline with different procedures for different variable groups.**

- Make sure you have installed feature-engine before running this notebook.

## In this demo

We will use Feature-engine to perform frequent category imputation using the Ames House Price Dataset.

- To download the dataset visit the lecture **Datasets** in **Section 1** of the course.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# to split the datasets
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# from feature-engine
from feature_engine.imputation import CategoricalImputer

In [2]:
# let's load the dataset with a selected group of variables

cols_to_use = [
    'BsmtQual', 'FireplaceQu', 'LotFrontage', 'MasVnrArea', 'GarageYrBlt',
    'SalePrice'
]

data = pd.read_csv('../houseprice.csv', usecols=cols_to_use)
data.head()

Unnamed: 0,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,GarageYrBlt,SalePrice
0,65.0,196.0,Gd,,2003.0,208500
1,80.0,0.0,Gd,TA,1976.0,181500
2,68.0,162.0,Gd,TA,2001.0,223500
3,60.0,0.0,TA,Gd,1998.0,140000
4,84.0,350.0,Gd,TA,2000.0,250000


In [3]:
data.isnull().mean()

LotFrontage    0.177397
MasVnrArea     0.005479
BsmtQual       0.025342
FireplaceQu    0.472603
GarageYrBlt    0.055479
SalePrice      0.000000
dtype: float64

All the predictor variables contain missing data.

In [4]:
# let's separate into training and testing set

# first drop the target from the feature list
cols_to_use.remove('SalePrice')

X_train, X_test, y_train, y_test = train_test_split(data[cols_to_use],
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((1022, 5), (438, 5))

### Feature-engine captures the categorical variables automatically

In [5]:
# We specify how we want to impute
# the categorical variables.

imputer = CategoricalImputer(imputation_method='frequent')

In [6]:
# we fit the imputer

imputer.fit(X_train)

CategoricalImputer(imputation_method='frequent')

In [7]:
# we see that the imputer found the categorical variables

imputer.variables_

['BsmtQual', 'FireplaceQu']

In [8]:
# here we can see the values that will be used
# to replace NA for each variable

imputer.imputer_dict_

{'BsmtQual': 'TA', 'FireplaceQu': 'Gd'}

In [9]:
# let's check those values agains the train data

X_train[imputer.variables_].mode()

Unnamed: 0,BsmtQual,FireplaceQu
0,TA,Gd


In [10]:
# feature-engine returns a dataframe

tmp = imputer.transform(X_train)
tmp.head()

Unnamed: 0,BsmtQual,FireplaceQu,LotFrontage,MasVnrArea,GarageYrBlt
64,Gd,Gd,,573.0,1998.0
682,Gd,Gd,,0.0,1996.0
960,TA,Gd,50.0,0.0,
1384,TA,Gd,60.0,0.0,1939.0
1100,TA,Gd,60.0,0.0,1930.0


In [11]:
# let's check that the numerical variables don't
# contain NA any more

tmp[imputer.variables_].isnull().mean()

BsmtQual       0.0
FireplaceQu    0.0
dtype: float64

## Feature-engine allows you to specify variable groups

In [12]:
# let's impute just 1 variable:

imputer =CategoricalImputer(
    imputation_method='frequent', variables=['BsmtQual'])

imputer.fit(X_train)

CategoricalImputer(imputation_method='frequent', variables=['BsmtQual'])

In [13]:
# now the imputer imputes only the variables we indicated

imputer.variables_

['BsmtQual']

In [14]:
# and we can see the value assigned to each variable

imputer.imputer_dict_

{'BsmtQual': 'TA'}

In [15]:
# feature-engine returns a dataframe

tmp = imputer.transform(X_train)

# let's check null values are gone
tmp[imputer.variables].isnull().mean()

BsmtQual    0.0
dtype: float64

It worked!