# Creating a feature engineering pipeline

In the previous notebook, we refactored our feature engineering steps into Scikit-learn-like classes or replaced them with transformers from Feature-engine. These new classes can be incorporated into a Scikit-learn pipeline to make the feature creation process easier.

In this notebook, we will line up all the feature transformation steps into a pipeline and execute the feature extraction in fewer lines of code.

## Data

We will work with the Air Quality Dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Air+Quality).

For instructions on how to download, prepare, and store the dataset, refer to notebook number 3, in the folder "01-Datasets" from this repo.

In [1]:
import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

from feature_engine.creation import CyclicalTransformer, MathematicalCombination
from feature_engine.datetime import DatetimeFeatures
from feature_engine.imputation import DropMissingData

# Load Data

In [2]:
# If you followed the instructions to download the data,
# it should be located here:

filename = '../datasets/AirQualityUCI_ready.csv'

# Load data.
data = pd.read_csv(filename)

In [3]:
# We'll only use sensor data, temperature
# and relative humidity. Thus, we drop all 
# other variables.

drop_vars  = [var for var in data.columns if '_true' in var]
drop_vars.append('AH')

# Remove variables.
data.drop(labels=drop_vars, axis=1, inplace=True)

print(data.shape)

data.head()

(9357, 8)


Unnamed: 0,Date_Time,CO_sensor,NMHC_sensor,NOX_sensor,NO2_sensor,O3_sensor,T,RH
0,2004-10-03 18:00:00,1360.0,1046.0,1056.0,1692.0,1268.0,13.6,48.9
1,2004-10-03 19:00:00,1292.0,955.0,1174.0,1559.0,972.0,13.3,47.7
2,2004-10-03 20:00:00,1402.0,939.0,1140.0,1555.0,1074.0,11.9,54.0
3,2004-10-03 21:00:00,1376.0,948.0,1092.0,1584.0,1203.0,11.0,60.0
4,2004-10-03 22:00:00,1272.0,836.0,1205.0,1490.0,1110.0,11.2,59.6


## Cast time variable as datetime

In [4]:
# Cast date variable in datetime format.

data['Date_Time'] = pd.to_datetime(data['Date_Time'])

In [5]:
# Set the index to the timestamp.

data.index = data['Date_Time']

In [6]:
# Sanity: sort index.

data.sort_index(inplace=True)

## Reduce data span

I will narrow down the dataset to those portions with fewer timestamps missing.

In [7]:
# Remove data with a lot of missing information.

# Check our notebook in section 2 to understand
# why we perform this step

data = data[(
    data['Date_Time'] >= '2004-04-01') &
    (data['Date_Time'] <= '2005-04-30')
]

# Quick check: data span.

data['Date_Time'].agg(['min', 'max'])

min   2004-04-04 00:00:00
max   2005-04-04 14:00:00
Name: Date_Time, dtype: datetime64[ns]

## Remove Outliers

Now, we are going to go ahead and remove those negative values from our dataset.

In [8]:
variables = [var for var in data.columns if var != 'Date_Time']

variables

['CO_sensor',
 'NMHC_sensor',
 'NOX_sensor',
 'NO2_sensor',
 'O3_sensor',
 'T',
 'RH']

In [9]:
print(data.shape)

data = data.loc[(data[variables]>0).all(axis=1)]

print(data.shape)

(7677, 8)
(7379, 8)


# Our Feature engineering classes

## Lag Features

In [10]:
class LagFeatures(BaseEstimator, TransformerMixin):

    def __init__(self, features, frequency, label):

        # In the init we specify the parameters that
        # the user needs to pass to start the transformer.

        # The user needs to indicate which features to lag,
        # how much we should lag the variables, 
        # and the name for the new variables.
        
        self.features = features
        self.frequency = frequency
        self.label = label

    def fit(self, X, y=None):

        # We do not need to learn parameters

        return self

    def transform(self, X):

        # We lag the features

        # We make a copy not to over-write the original data
        X = X.copy()

        # Shift the data forward.
        tmp = X[self.features].shift(freq=self.frequency)

        # Name the new variables.
        tmp.columns = [v + self.label for v in self.features]

        # Add the variables to the original data.
        X = X.merge(tmp, left_index=True, right_index=True, how='left')

        return X

## Window features

In [11]:
class WindowFeatures(BaseEstimator, TransformerMixin):

    def __init__(self, features, window, frequency):

        # In the init we specify the parameters that
        # the user needs to pass to start the transformer.

        # The user needs to indicate the features to use for the computation
        # the size of the window,
        # and the frequency to shift forward.
        
        self.features = features
        self.window = window
        self.frequency = frequency

    def fit(self, X, y=None):

        # We do not need to learn parameters

        return self

    def transform(self, X):

        # First we calculate the average of the feature in
        # the indicated window, then we shift the value forward
        # based on the indicated frequency.

        X = X.copy()

        tmp = (X[self.features]
               .rolling(window=self.window).mean()
               .shift(freq=self.frequency)
               )

        # Rename the columns
        tmp.columns = [v + '_window' for v in self.features]

        # Add the variables to the original data.
        X = X.merge(tmp, left_index=True, right_index=True, how='left')

        return X

## Seasonality Features

Note that this is the only class that learns parameters from the data!

In [12]:
class SeasonalTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, season_var, variables):

        # In the init we specify the parameters that
        # the user needs to pass to start the transformer.

        # The user needs to indicate the seasonal variable
        # and the variables that should be aggregated.

        self.season_var = season_var
        self.variables = variables

    def fit(self, X, y=None):

        # We want to estimate the mean value of the
        # time series in the seasonal term.

        # In our demo, that is the mean pollutant's 
        # concentration per hour.

        # We make a copy of the dataframe 
        # not to over-write the user's data.
        X = X.copy()

        # Calcualte mean pollutant per hr.
        # The learned values will be stored in this attribute.
        self.seasonal_ = X.groupby(self.season_var)[self.variables].mean()

        # Rename the new variables.
        self.seasonal_.columns = [v + '_season' for v in self.variables]

        return self

    def transform(self, X):

        # We want to add the seasonal component to the
        # dataset to transform.

        X = X.copy()

        X = X.merge(self.seasonal_, on=self.season_var, how='left')

        return X

# Feature Engineering Pipeline

In [13]:
# Some hard coded values

lag_1 = ['CO_sensor_lag_1',
         'NMHC_sensor_lag_1',
         'NOX_sensor_lag_1',
         'NO2_sensor_lag_1',
         'O3_sensor_lag_1',
         'T_lag_1',
         'RH_lag_1']

lag_24 = ['CO_sensor_lag_24',
          'NMHC_sensor_lag_24',
          'NOX_sensor_lag_24',
          'NO2_sensor_lag_24',
          'O3_sensor_lag_24',
          'T_lag_24',
          'RH_lag_24']

pollutants = variables[:-2]

In [14]:
engineering_pipe = Pipeline([

    # Extract datetime features
    ('datetime_features', DatetimeFeatures(
        features_to_extract=["month",
                             "week_of_the_year",
                             "day_of_the_week",
                             "day_of_the_month",
                             "hour",
                             "weekend",
                            ],
        drop_original=False)),

    # Lag Features
    ('lag_1', LagFeatures(variables, '1H', '_lag_1')),
    ('lag_24', LagFeatures(variables, '24H', '_lag_24')),

    # Window Features
    ('window_features', WindowFeatures(variables, '3H', '1H')),

    # Combine pollutants
    ('Combine_lag_1', MathematicalCombination(
        variables_to_combine=lag_1,
        math_operations=['sum'],
        new_variables_names=['total_poll_lag_1'],
        missing_values='ignore')),

    ('Combine_lag_24', MathematicalCombination(
        variables_to_combine=lag_24,
        math_operations=['sum'],
        new_variables_names=['total_poll_lag_1'],
        missing_values='ignore')),

    # Periodic features
    ('Periodic', CyclicalTransformer(
        variables=['Date_Time_month', 'Date_Time_hour'],
        drop_original=False)),

    # Drop missing data
    ('dropna', DropMissingData(missing_only=True)),

])

In [15]:
print(data.shape)

data = engineering_pipe.fit_transform(data)

print(data.shape)

(7379, 8)
(6892, 40)


In [16]:
data.head()

Unnamed: 0_level_0,Date_Time,CO_sensor,NMHC_sensor,NOX_sensor,NO2_sensor,O3_sensor,T,RH,Date_Time_month,Date_Time_woty,...,NOX_sensor_window,NO2_sensor_window,O3_sensor_window,T_window,RH_window,total_poll_lag_1,Date_Time_month_sin,Date_Time_month_cos,Date_Time_hour_sin,Date_Time_hour_cos
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2004-04-05 00:00:00,2004-04-05 00:00:00,1065.0,789.0,936.0,1620.0,929.0,18.4,65.8,4,15,...,942.666667,1529.666667,733.333333,15.9,58.566667,5576.2,0.866025,-0.5,0.0,1.0
2004-04-05 01:00:00,2004-04-05 01:00:00,999.0,692.0,1038.0,1588.0,778.0,16.4,79.2,4,15,...,925.0,1572.0,827.666667,16.5,61.8,5475.1,0.866025,-0.5,0.269797,0.962917
2004-04-05 02:00:00,2004-04-05 02:00:00,911.0,599.0,1189.0,1517.0,633.0,16.1,80.0,4,15,...,968.0,1582.666667,830.666667,16.7,68.6,5206.6,0.866025,-0.5,0.519584,0.854419
2004-04-05 03:00:00,2004-04-05 03:00:00,873.0,545.0,1308.0,1471.0,497.0,15.6,81.0,4,15,...,1054.333333,1575.0,780.0,16.966667,75.0,5273.7,0.866025,-0.5,0.730836,0.682553
2004-04-05 04:00:00,2004-04-05 04:00:00,881.0,546.0,1335.0,1465.0,458.0,15.6,81.0,4,15,...,1178.333333,1525.333333,636.0,16.033333,80.066667,5024.6,0.866025,-0.5,0.887885,0.460065


# Separate into train and test

Because 1 of our feature engineering steps needs to learn parameters from data, let's see how we can approach this issue.

In [17]:
# Find minimum and maximum dates.

data.index.min(), data.index.max()

(Timestamp('2004-04-05 00:00:00'), Timestamp('2005-04-04 14:00:00'))

In [18]:
# We will keep the last month of data to test
# The forecasting models

X_train = data[data.index<='2005-03-04']
X_test = data[data.index>'2005-03-04']

X_train.shape, X_test.shape

((6398, 40), (494, 40))

In [19]:
# Add seasonal features

season_tr = SeasonalTransformer(
    season_var='Date_Time_hour',
    variables=pollutants,
)

X_train = season_tr.fit_transform(X_train)
X_test = season_tr.transform(X_test)

X_train.head()

Unnamed: 0,Date_Time,CO_sensor,NMHC_sensor,NOX_sensor,NO2_sensor,O3_sensor,T,RH,Date_Time_month,Date_Time_woty,...,total_poll_lag_1,Date_Time_month_sin,Date_Time_month_cos,Date_Time_hour_sin,Date_Time_hour_cos,CO_sensor_season,NMHC_sensor_season,NOX_sensor_season,NO2_sensor_season,O3_sensor_season
0,2004-04-05 00:00:00,1065.0,789.0,936.0,1620.0,929.0,18.4,65.8,4,15,...,5576.2,0.866025,-0.5,0.0,1.0,1039.091603,871.484733,853.618321,1395.954198,973.206107
1,2004-04-05 01:00:00,999.0,692.0,1038.0,1588.0,778.0,16.4,79.2,4,15,...,5475.1,0.866025,-0.5,0.269797,0.962917,979.467681,797.292776,915.365019,1339.376426,882.163498
2,2004-04-05 02:00:00,911.0,599.0,1189.0,1517.0,633.0,16.1,80.0,4,15,...,5206.6,0.866025,-0.5,0.519584,0.854419,919.615094,714.524528,1009.015094,1283.0,781.750943
3,2004-04-05 03:00:00,873.0,545.0,1308.0,1471.0,497.0,15.6,81.0,4,15,...,5273.7,0.866025,-0.5,0.730836,0.682553,881.430712,657.846442,1086.344569,1245.958801,714.213483
4,2004-04-05 04:00:00,881.0,546.0,1335.0,1465.0,458.0,15.6,81.0,4,15,...,5024.6,0.866025,-0.5,0.887885,0.460065,865.406015,632.676692,1121.007519,1233.43985,689.834586


In [20]:
X_test.head()

Unnamed: 0,Date_Time,CO_sensor,NMHC_sensor,NOX_sensor,NO2_sensor,O3_sensor,T,RH,Date_Time_month,Date_Time_woty,...,total_poll_lag_1,Date_Time_month_sin,Date_Time_month_cos,Date_Time_hour_sin,Date_Time_hour_cos,CO_sensor_season,NMHC_sensor_season,NOX_sensor_season,NO2_sensor_season,O3_sensor_season
0,2005-03-04 01:00:00,951.0,642.0,895.0,929.0,666.0,12.7,40.8,3,9,...,4811.7,1.0,6.123234000000001e-17,0.269797,0.962917,979.467681,797.292776,915.365019,1339.376426,882.163498
1,2005-03-04 02:00:00,938.0,631.0,882.0,931.0,679.0,12.0,44.2,3,9,...,4681.7,1.0,6.123234000000001e-17,0.519584,0.854419,919.615094,714.524528,1009.015094,1283.0,781.750943
2,2005-03-04 03:00:00,921.0,642.0,861.0,948.0,768.0,10.5,48.7,3,9,...,4683.3,1.0,6.123234000000001e-17,0.730836,0.682553,881.430712,657.846442,1086.344569,1245.958801,714.213483
3,2005-03-04 04:00:00,850.0,583.0,955.0,914.0,589.0,10.1,49.7,3,9,...,5028.7,1.0,6.123234000000001e-17,0.887885,0.460065,865.406015,632.676692,1121.007519,1233.43985,689.834586
4,2005-03-04 05:00:00,811.0,494.0,1118.0,872.0,470.0,9.6,51.0,3,9,...,5075.8,1.0,6.123234000000001e-17,0.979084,0.203456,873.902256,646.620301,1104.515038,1246.18797,704.541353


We have now packaged most of our feature engineering steps into one pipeline and created suitable features for forecasting in a manner that can be re-utilized with new input data.

In the next video, we will add a Naive Forecaster and train a Linear Regression at the back of this feature engineering pipeline. 

That is all for this notebook. I hope you enjoyed it!