## Random Sample Imputation ==> Feature-Engine

[Feature Engineering for Machine Learning Course](https://www.trainindata.com/p/feature-engineering-for-machine-learning)

Feature-engine is an open source Python package originally designed to support this course, but has increasingly gained popularity and now supports transformations beyond those taught in the course. It was launched in 2017, and since then, several releases have appeared and a growing international community is beginning to lead the development.

- Feature-engine works like to Scikit-learn, so it is easy to learn
- Feature-engine allows you to implement specific engineering steps to specific feature subsets
- Feature-engine can be integrated with the Scikit-learn pipeline allowing for smooth model building

**Feature-Engine allows you to design and store a feature engineering pipeline with different procedures for different variable groups.**

- Make sure you have installed feature-engine before running this notebook.

## In this demo

We will use Feature-engine to perform random sample imputation using the Ames House Price Dataset.

- To download the dataset visit the lecture **Datasets** in **Section 2** of the course.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

# to split the datasets
from sklearn.model_selection import train_test_split

# from feature-engine
from feature_engine.imputation import RandomSampleImputer

In [2]:
# let's load the dataset with a selected group of variables

cols_to_use = [
    "BsmtQual",
    "FireplaceQu",
    "LotFrontage",
    "MasVnrArea",
    "GarageYrBlt",
    "SalePrice",
]

data = pd.read_csv("../../Datasets/houseprice.csv", usecols=cols_to_use)
data.head()

Unnamed: 0,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,GarageYrBlt,SalePrice
0,65.0,196.0,Gd,,2003.0,208500
1,80.0,0.0,Gd,TA,1976.0,181500
2,68.0,162.0,Gd,TA,2001.0,223500
3,60.0,0.0,TA,Gd,1998.0,140000
4,84.0,350.0,Gd,TA,2000.0,250000


In [3]:
data.isnull().mean()

LotFrontage    0.177397
MasVnrArea     0.005479
BsmtQual       0.025342
FireplaceQu    0.472603
GarageYrBlt    0.055479
SalePrice      0.000000
dtype: float64

In [4]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(["SalePrice"], axis=1),
    data["SalePrice"],
    test_size=0.3,
    random_state=0,
)

X_train.shape, X_test.shape

((1022, 5), (438, 5))

### Automatically impute all the variables

In [5]:
# we call the imputer from feature-engine

imputer = RandomSampleImputer(random_state=29)

In [6]:
# we fit the imputer

imputer.fit(X_train)

In [7]:
# we see that the imputer selected all the variables, numerical
# and categorical

imputer.variables_

['LotFrontage', 'MasVnrArea', 'BsmtQual', 'FireplaceQu', 'GarageYrBlt']

In [8]:
# the imputer stores a copy of the selected variables from
# the train set, from which to extract the random samples

imputer.X_.head()

Unnamed: 0,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,GarageYrBlt
64,,573.0,Gd,,1998.0
682,,0.0,Gd,Gd,1996.0
960,50.0,0.0,TA,,
1384,60.0,0.0,TA,,1939.0
1100,60.0,0.0,TA,,1930.0


In [9]:
# feature-engine returns a dataframe

X_train_t = imputer.transform(X_train)
X_test_t = imputer.transform(X_test)

X_train_t.head()

Unnamed: 0,LotFrontage,MasVnrArea,BsmtQual,FireplaceQu,GarageYrBlt
64,60.0,573.0,Gd,TA,1998.0
682,90.0,0.0,Gd,Gd,1996.0
960,50.0,0.0,TA,Gd,1977.0
1384,60.0,0.0,TA,Gd,1939.0
1100,60.0,0.0,TA,Gd,1930.0


In [10]:
# let's check absence of NA

X_train_t[imputer.variables_].isnull().mean()

LotFrontage    0.0
MasVnrArea     0.0
BsmtQual       0.0
FireplaceQu    0.0
GarageYrBlt    0.0
dtype: float64

The procedures to select a specific group of variables to use the RandomSampleImputer, or how to integrate it with the Scikit-learn pipeline are the same as we did in previous notebooks.

## Setting the seed observation per observation

For details on how to set the seed observation per observation, check out the [documentation](https://feature-engine.readthedocs.io/en/latest/user_guide/imputation/RandomSampleImputer.html).

