# Creating a feature engineering pipeline

In the previous notebook, we refactored our feature engineering steps into Scikit-learn-like classes or replaced them with transformers from Feature-engine. These new classes can be incorporated into a Scikit-learn pipeline to make the feature creation process easier.

In this notebook, we will line up the feature transformation steps into a pipeline and execute the feature extraction in fewer lines of code.

## Data

We will work with the Air Quality Dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Air+Quality).

For instructions on how to download, prepare, and store the dataset, refer to notebook number 3, in the folder "01-Datasets" from this repo.

In [1]:
import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin

# new import respect to the previous notebook
from sklearn.pipeline import Pipeline

from feature_engine.creation import CyclicalTransformer, MathematicalCombination
from feature_engine.datetime import DatetimeFeatures
from feature_engine.imputation import MeanMedianImputer

# Load Data

In [2]:
# Function to load and prepare input data.

def load_data():
    
    # Data lives here.
    filename = '../datasets/AirQualityUCI_ready.csv'

    # Load data: only the time variable and CO
    data = pd.read_csv(filename, usecols=['Date_Time', 'CO_sensor'])
        
    # Cast date variable in datetime format.
    data['Date_Time'] = pd.to_datetime(data['Date_Time'])

    # Set the index to the timestamp.
    data.index = data['Date_Time']

    # Sanity: sort index.
    data.sort_index(inplace=True)
    
    # Reduce data span.
    data = data[(
        data['Date_Time'] >= '2004-04-01') &
        (data['Date_Time'] <= '2005-04-30')
    ]
    
    # Remove outliers
    data = data.loc[(data['CO_sensor']>0)]
    
    return data

In [3]:
# Load data.

data = load_data()

data.head()

Unnamed: 0_level_0,Date_Time,CO_sensor
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,2004-04-04 00:00:00,1224.0
2004-04-04 01:00:00,2004-04-04 01:00:00,1215.0
2004-04-04 02:00:00,2004-04-04 02:00:00,1115.0
2004-04-04 03:00:00,2004-04-04 03:00:00,1124.0
2004-04-04 04:00:00,2004-04-04 04:00:00,1028.0


# Our Feature engineering classes

I re-organise the notebook so that we have the new classes one after the other.

## Lag Features

In [4]:
class LagFeatures(BaseEstimator, TransformerMixin):

    def __init__(self, features, frequency, label):

        # In the init we specify the parameters that
        # the user needs to pass to start the transformer.

        # The user needs to indicate which features to lag,
        # how much we should lag the variables, 
        # and the name for the new variables.
        
        self.features = features
        self.frequency = frequency
        self.label = label

    def fit(self, X, y=None):

        # We do not need to learn parameters

        return self

    def transform(self, X):

        # We lag the features

        # We make a copy not to over-write the original data
        X = X.copy()

        # Shift the data forward.
        tmp = X[self.features].shift(freq=self.frequency)

        # Name the new variables.
        tmp.columns = [v + self.label for v in self.features]

        # Add the variables to the original data.
        X = X.merge(tmp, left_index=True, right_index=True, how='left')

        return X

## Window features

In [5]:
class WindowFeatures(BaseEstimator, TransformerMixin):

    def __init__(self, features, window, frequency):

        # In the init we specify the parameters that
        # the user needs to pass to start the transformer.

        # The user needs to indicate the features to use for the computation
        # the size of the window,
        # and the frequency to shift forward.
        
        self.features = features
        self.window = window
        self.frequency = frequency

    def fit(self, X, y=None):

        # We do not need to learn parameters

        return self

    def transform(self, X):

        # First we calculate the average of the feature in
        # the indicated window, then we shift the value forward
        # based on the indicated frequency.

        X = X.copy()

        tmp = (X[self.features]
               .rolling(window=self.window).mean()
               .shift(freq=self.frequency)
               )

        # Rename the columns
        tmp.columns = [v + '_window' for v in self.features]

        # Add the variables to the original data.
        X = X.merge(tmp, left_index=True, right_index=True, how='left')

        return X

## Seasonality Features

Note that this is the only class that learns parameters from the data!

In [6]:
class SeasonalTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, season_var, variables):

        # In the init we specify the parameters that
        # the user needs to pass to start the transformer.

        # The user needs to indicate the seasonal variable
        # and the variables that should be aggregated.

        self.season_var = season_var
        self.variables = variables

    def fit(self, X, y=None):

        # We want to estimate the mean value of the
        # time series in the seasonal term.

        # In our demo, that is the mean pollutant's 
        # concentration per hour.

        # We make a copy of the dataframe 
        # not to over-write the user's data.
        X = X.copy()

        # Calculate mean pollutant per hr.
        # The learned values will be stored in this attribute.
        self.seasonal_ = X.groupby(self.season_var)[self.variables].mean()

        # Rename the new variables.
        self.seasonal_.columns = [v + '_season' for v in self.variables]
        
        # Ensure returned grouping is a dataframe
        self.seasonal_ = self.seasonal_.reset_index()

        return self

    def transform(self, X):

        # Add the seasonal component to the
        # dataset to transform.

        X = X.copy()
        
        # Store the datetime index (it is lost in merge)
        index = X.index

        # Add the seasonal feature
        X = X.merge(self.seasonal_, on=self.season_var, how='left')
        
        # restore the datetime index to the df
        X.index = index

        return X

# Feature Engineering Pipeline

Now, we add our classes and Feature-engine's transformers within a pipeline.

In [7]:
# New: we line-up most feature engineering
# steps into a Pipeline

engineering_pipe = Pipeline([

    # Extract datetime features
    ('datetime_features', DatetimeFeatures(
        features_to_extract=["month",
                             "week",
                             "day_of_week",
                             "day_of_month",
                             "hour",
                             "weekend",
                             ],
        drop_original=True)),

    # Lag Features
    ('lag_1', LagFeatures(['CO_sensor'], '1H', '_lag_1')),
    ('lag_24', LagFeatures(['CO_sensor'], '24H', '_lag_24')),

    # Window Features
    ('window_features', WindowFeatures(['CO_sensor'], '3H', '1H')),

    # Combine pollutants
    ('Combine', MathematicalCombination(

        # the variables to average
        variables_to_combine=['CO_sensor_lag_1', 'CO_sensor_lag_24'],

        # we indicate we want the average
        math_operations=['mean'],

        # the name of the new feature
        new_variables_names=['CO_lag_ave'],

        # what to do if the variables have NA
        missing_values='ignore',
    )),


    # Periodic features
    ('Periodic', CyclicalTransformer(
        variables=['Date_Time_month', 'Date_Time_hour'],
        drop_original=False)),

    # Missing Data Imputation
    ('imputer', MeanMedianImputer(
        # the variables to impute
        variables=[
            'CO_sensor_lag_1',
            'CO_sensor_lag_24',
            'CO_sensor_window',
            'CO_lag_ave',
        ],
    )),

    # Seasonal features
    ('seasonal', SeasonalTransformer(

        # the variable for the grouping
        season_var='Date_Time_hour',

        # the variables to group
        variables=['CO_sensor'],
    )),

])

In [8]:
print(data.shape)

# In this simple step, we create most features
data = engineering_pipe.fit_transform(data)

print(data.shape)

(7393, 2)
(7393, 16)


See how the feature space expanded from 2 to 16.

In [9]:
# We now have many new features in our data.

data.head()

Unnamed: 0_level_0,CO_sensor,Date_Time_month,Date_Time_week,Date_Time_day_of_week,Date_Time_day_of_month,Date_Time_hour,Date_Time_weekend,CO_sensor_lag_1,CO_sensor_lag_24,CO_sensor_window,CO_lag_ave,Date_Time_month_sin,Date_Time_month_cos,Date_Time_hour_sin,Date_Time_hour_cos,CO_sensor_season
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2004-04-04 00:00:00,1224.0,4,14,6,4,0,1,1050.0,1053.0,1054.0,1062.0,0.866025,-0.5,0.0,1.0,1051.169935
2004-04-04 01:00:00,1215.0,4,14,6,4,1,1,1224.0,1053.0,1224.0,1224.0,0.866025,-0.5,0.269797,0.962917,990.540717
2004-04-04 02:00:00,1115.0,4,14,6,4,2,1,1215.0,1053.0,1219.5,1215.0,0.866025,-0.5,0.519584,0.854419,930.584416
2004-04-04 03:00:00,1124.0,4,14,6,4,3,1,1115.0,1053.0,1184.666667,1115.0,0.866025,-0.5,0.730836,0.682553,892.711974
2004-04-04 04:00:00,1028.0,4,14,6,4,4,1,1124.0,1053.0,1151.333333,1124.0,0.866025,-0.5,0.887885,0.460065,874.703226


We have now packaged our feature engineering steps into one pipeline and created suitable features for forecasting in a manner that can be re-utilized with new input data.

In the next notebook, we will train a Linear Regression at the back of this feature engineering pipeline to perform multi-step forecasting. 

That is all for this notebook. I hope you enjoyed it!