# Refactoring the Feature engineering steps

In this notebook, we will modify the code to make it suitable for multi-step forecasting.

To perform multi-step forecasting, we will package the feature engineering code into Scikit-learn like classes or replace pandas multi-line code with transformers from the open source library Feature-engine, which works like Scikit-learn.

In the previous notebook, we engineered most of the features by utilizing pandas. We performed the modifications directly on the original dataset.

This way of engineering features has some limitations. For example, we don't leave behind code to create features from new input data. In fact, we would have to re-write the entire block of code for every new input data set. This is not efficient.

To avoid this, we can create classes with fit() and transform() functionality that can be easily deployed on new input data to recreate the features given the new information.

In this notebook, we will wrap the code that creates the features into classes that can be reused.

**If you haven't done so already, check the Air Pollutants notebook in Section 2 to familiarize yourself with the feature engineering steps we used previously.**

## Data

We will work with the Air Quality Dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Air+Quality).

For instructions on how to download, prepare, and store the dataset, refer to notebook number 3, in the folder "01-Datasets" from this repo.

In [1]:
import numpy as np
import pandas as pd

# for our new classes
from sklearn.base import BaseEstimator, TransformerMixin

# to automate many of our engineering processes
from feature_engine.creation import CyclicalTransformer, MathematicalCombination
from feature_engine.datetime import DatetimeFeatures
from feature_engine.imputation import DropMissingData

# Load Data

In [2]:
# If you followed the instructions to download the data,
# it should be located here:

filename = '../datasets/AirQualityUCI_ready.csv'

# Load data.
data = pd.read_csv(filename)

In [3]:
# We'll only use sensor data, temperature
# and relative humidity. Thus, we drop all 
# other variables.

drop_vars  = [var for var in data.columns if '_true' in var]
drop_vars.append('AH')

# Remove variables.
data.drop(labels=drop_vars, axis=1, inplace=True)

print(data.shape)

data.head()

(9357, 8)


Unnamed: 0,Date_Time,CO_sensor,NMHC_sensor,NOX_sensor,NO2_sensor,O3_sensor,T,RH
0,2004-10-03 18:00:00,1360.0,1046.0,1056.0,1692.0,1268.0,13.6,48.9
1,2004-10-03 19:00:00,1292.0,955.0,1174.0,1559.0,972.0,13.3,47.7
2,2004-10-03 20:00:00,1402.0,939.0,1140.0,1555.0,1074.0,11.9,54.0
3,2004-10-03 21:00:00,1376.0,948.0,1092.0,1584.0,1203.0,11.0,60.0
4,2004-10-03 22:00:00,1272.0,836.0,1205.0,1490.0,1110.0,11.2,59.6


## Cast time variable as datetime

In [4]:
# Cast date variable in datetime format.

data['Date_Time'] = pd.to_datetime(data['Date_Time'])

In [5]:
# Set the index to the timestamp.

data.index = data['Date_Time']

In [6]:
# Sanity: sort index.

data.sort_index(inplace=True)

## Reduce data span

I will narrow down the dataset to those portions with fewer timestamps missing.

In [7]:
# Remove data with a lot of missing information.

# Check our notebook in section 2 to understand
# why we perform this step

data = data[(
    data['Date_Time'] >= '2004-04-01') &
    (data['Date_Time'] <= '2005-04-30')
]

# Quick check: data span.

data['Date_Time'].agg(['min', 'max'])

min   2004-04-04 00:00:00
max   2005-04-04 14:00:00
Name: Date_Time, dtype: datetime64[ns]

## Remove Outliers

Outliers are the negative values returned by the sensors.

In [8]:
variables = [var for var in data.columns if var != 'Date_Time']

variables

['CO_sensor',
 'NMHC_sensor',
 'NOX_sensor',
 'NO2_sensor',
 'O3_sensor',
 'T',
 'RH']

In [9]:
print(data.shape)

# remove negative observations
data = data.loc[(data[variables]>0).all(axis=1)]

print(data.shape)

(7677, 8)
(7379, 8)


# Feature engineering

In the rest of the notebook, we create the new features from the time series. Our time series are the sensors data, temperature and humidity.

## Extract time related features

These are features derived from the timestamp.

We can extract date and time features automatically utilizing Feature-engine.

[DatetimeFeatures](add url)

In [10]:
# We can automate that with Feature-engine

dt_features = DatetimeFeatures(
    
    # the timestamp
    variables='Date_Time',
    
    # the features we want from the timestamp
    features_to_extract=["month",
                         "week_of_the_year",
                         "day_of_the_week",
                         "day_of_the_month",
                         "hour",
                         "weekend",
                        ],
    
    # if we want to drop the timestamp.
    drop_original=False
)

# Extract the datetime features
data = dt_features.fit_transform(data)

# Show new variables
data[[v for v in data.columns if "Date_Time"in v]].head()

Unnamed: 0_level_0,Date_Time,Date_Time_month,Date_Time_woty,Date_Time_dotw,Date_Time_dotm,Date_Time_hour,Date_Time_weekend
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2004-04-04 00:00:00,2004-04-04 00:00:00,4,14,6,4,0,1
2004-04-04 01:00:00,2004-04-04 01:00:00,4,14,6,4,1,1
2004-04-04 02:00:00,2004-04-04 02:00:00,4,14,6,4,2,1
2004-04-04 03:00:00,2004-04-04 03:00:00,4,14,6,4,3,1
2004-04-04 04:00:00,2004-04-04 04:00:00,4,14,6,4,4,1


## Lag Features

We create the following lagged features:

- The pollutant concentration for the previous hour (t-1).

- The pollutant concentration for the same hour on the previous day (t-24).

**We need to be careful because we do not have values for all timestamps. To be safe, we must shift the data using pandas frequency.**

In [11]:
class LagFeatures(BaseEstimator, TransformerMixin):

    def __init__(self, features, frequency, label):

        # In the init we specify the parameters that
        # the user needs to pass to start the transformer.

        # The user needs to indicate which features to lag,
        # how much we should lag the variables, 
        # and the name for the new variables.
        
        self.features = features
        self.frequency = frequency
        self.label = label

    def fit(self, X, y=None):

        # We do not need to learn parameters

        return self

    def transform(self, X):

        # We lag the features

        # We make a copy not to over-write the original data
        X = X.copy()

        # Shift the data forward.
        tmp = X[self.features].shift(freq=self.frequency)

        # Name the new variables.
        tmp.columns = [v + self.label for v in self.features]

        # Add the variables to the original data.
        X = X.merge(tmp, left_index=True, right_index=True, how='left')

        return X

In [12]:
# Add the lag 1 Hr features.

lag1 = LagFeatures(variables, '1H', '_lag_1')

data = lag1.fit_transform(data)

data[['Date_Time', 'CO_sensor', 'CO_sensor_lag_1']].head()

Unnamed: 0_level_0,Date_Time,CO_sensor,CO_sensor_lag_1
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2004-04-04 00:00:00,2004-04-04 00:00:00,1224.0,
2004-04-04 01:00:00,2004-04-04 01:00:00,1215.0,1224.0
2004-04-04 02:00:00,2004-04-04 02:00:00,1115.0,1215.0
2004-04-04 03:00:00,2004-04-04 03:00:00,1124.0,1115.0
2004-04-04 04:00:00,2004-04-04 04:00:00,1028.0,1124.0


We see for example that 1224 is now moved forward to the next t.

In [13]:
# Add the lag 24 Hr features.

lag24 = LagFeatures(variables, '24H', '_lag_24')

data = lag24.fit_transform(data)

data[['Date_Time', 'CO_sensor', 'CO_sensor_lag_24']].head(25)

Unnamed: 0_level_0,Date_Time,CO_sensor,CO_sensor_lag_24
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2004-04-04 00:00:00,2004-04-04 00:00:00,1224.0,
2004-04-04 01:00:00,2004-04-04 01:00:00,1215.0,
2004-04-04 02:00:00,2004-04-04 02:00:00,1115.0,
2004-04-04 03:00:00,2004-04-04 03:00:00,1124.0,
2004-04-04 04:00:00,2004-04-04 04:00:00,1028.0,
2004-04-04 05:00:00,2004-04-04 05:00:00,1010.0,
2004-04-04 06:00:00,2004-04-04 06:00:00,1074.0,
2004-04-04 07:00:00,2004-04-04 07:00:00,1034.0,
2004-04-04 08:00:00,2004-04-04 08:00:00,1130.0,
2004-04-04 09:00:00,2004-04-04 09:00:00,1275.0,


Note now that 1224, which is the value corresponding to April 4 at midnight, is now located on April 5th at midnight. We have NA for all previous values because there is no information about the features 24 hours before.

## Window features

We take the average of the previous 3 hours of the TS to predict the current hour. 

We first need to calculate the average of the 3 previous values, and then move that value forward.

In [14]:
class WindowFeatures(BaseEstimator, TransformerMixin):

    def __init__(self, features, window, frequency):

        # In the init we specify the parameters that
        # the user needs to pass to start the transformer.

        # The user needs to indicate the features to use for the computation
        # the size of the window,
        # and the frequency to shift forward.
        
        self.features = features
        self.window = window
        self.frequency = frequency

    def fit(self, X, y=None):

        # We do not need to learn parameters

        return self

    def transform(self, X):

        # First we calculate the average of the feature in
        # the indicated window, then we shift the value forward
        # based on the indicated frequency.

        X = X.copy()

        tmp = (X[self.features]
               .rolling(window=self.window).mean()
               .shift(freq=self.frequency)
               )

        # Rename the columns
        tmp.columns = [v + '_window' for v in self.features]

        # Add the variables to the original data.
        X = X.merge(tmp, left_index=True, right_index=True, how='left')

        return X

In [15]:
# With the new class, we add the window features

win_feat = WindowFeatures(variables, '3H', '1H')

data = win_feat.fit_transform(data)

data[['Date_Time', 'CO_sensor', 'CO_sensor_window']].head()

Unnamed: 0_level_0,Date_Time,CO_sensor,CO_sensor_window
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2004-04-04 00:00:00,2004-04-04 00:00:00,1224.0,
2004-04-04 01:00:00,2004-04-04 01:00:00,1215.0,1224.0
2004-04-04 02:00:00,2004-04-04 02:00:00,1115.0,1219.5
2004-04-04 03:00:00,2004-04-04 03:00:00,1124.0,1184.666667
2004-04-04 04:00:00,2004-04-04 04:00:00,1028.0,1151.333333


**Important:** Notice how the average of the previous three hours was advanced an hour to time t, the time we want to forecast.

## Feature combinations: total pollutants

We want to add up the pollutants concentrations in the previous hr and in the previous 24 hs.

We will automate this process with Feature-engine.

[MathematicalCombination](add url)

In [16]:
lag_1 = [v for v in data.columns if 'lag_1' in v]

lag_1

['CO_sensor_lag_1',
 'NMHC_sensor_lag_1',
 'NOX_sensor_lag_1',
 'NO2_sensor_lag_1',
 'O3_sensor_lag_1',
 'T_lag_1',
 'RH_lag_1']

In [17]:
combine_1 = MathematicalCombination(
    
    # the variables to add up
    variables_to_combine=lag_1,
    
    # we indicate we want to add the variables
    math_operations=['sum'],
    
    # the name of the new feature
    new_variables_names=['total_poll_lag_1'],
    
    # what to do if the variables have NA
    missing_values='ignore',
)

data = combine_1.fit_transform(data)

data['total_poll_lag_1'].head()

Date_Time
2004-04-04 00:00:00       0.0
2004-04-04 01:00:00    5576.2
2004-04-04 02:00:00    5475.1
2004-04-04 03:00:00    5206.6
2004-04-04 04:00:00    5273.7
Name: total_poll_lag_1, dtype: float64

In [18]:
lag_24 = [v for v in data.columns if 'lag_24' in v]

combine_24 = MathematicalCombination(
    
    # the variables to add up
    variables_to_combine=lag_24,
    
    # we indicate we want to add the variables
    math_operations=['sum'],
    
    # the name of the new feature
    new_variables_names=['total_poll_lag_24'],
    
    # what to do if the variables have NA
    missing_values='ignore',
)

data = combine_24.fit_transform(data)

data['total_poll_lag_24'].head()

Date_Time
2004-04-04 00:00:00    0.0
2004-04-04 01:00:00    0.0
2004-04-04 02:00:00    0.0
2004-04-04 03:00:00    0.0
2004-04-04 04:00:00    0.0
Name: total_poll_lag_24, dtype: float64

## Periodic Features

We transform the month and the hour with the sin and cosine to have a periodic representation of the features.

We automate this procedure with Feature-engine.

[CyclicalTransformer](https://feature-engine.readthedocs.io/en/latest/creation/CyclicalTransformer.html)

In [19]:
# Create features that capture the cyclical representation.

cyclical = CyclicalTransformer(
    
    # The features we want to transform.
    variables=['Date_Time_month', 'Date_Time_hour'],
    
    # Whether to drop the original features.
    drop_original=False, 
)

data = cyclical.fit_transform(data)

data.head()

Unnamed: 0_level_0,Date_Time,CO_sensor,NMHC_sensor,NOX_sensor,NO2_sensor,O3_sensor,T,RH,Date_Time_month,Date_Time_woty,...,NO2_sensor_window,O3_sensor_window,T_window,RH_window,total_poll_lag_1,total_poll_lag_24,Date_Time_month_sin,Date_Time_month_cos,Date_Time_hour_sin,Date_Time_hour_cos
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2004-04-04 00:00:00,2004-04-04 00:00:00,1224.0,892.0,884.0,1580.0,923.0,16.7,56.5,4,14,...,,,,,0.0,0.0,0.866025,-0.5,0.0,1.0
2004-04-04 01:00:00,2004-04-04 01:00:00,1215.0,843.0,929.0,1551.0,862.0,15.9,59.2,4,14,...,1580.0,923.0,16.7,56.5,5576.2,0.0,0.866025,-0.5,0.269797,0.962917
2004-04-04 02:00:00,2004-04-04 02:00:00,1115.0,782.0,980.0,1500.0,752.0,15.2,62.4,4,14,...,1565.5,892.5,16.3,57.85,5475.1,0.0,0.866025,-0.5,0.519584,0.854419
2004-04-04 03:00:00,2004-04-04 03:00:00,1124.0,793.0,965.0,1521.0,791.0,14.7,65.0,4,14,...,1543.666667,845.666667,15.933333,59.366667,5206.6,0.0,0.866025,-0.5,0.730836,0.682553
2004-04-04 04:00:00,2004-04-04 04:00:00,1028.0,682.0,1090.0,1448.0,697.0,14.3,65.3,4,14,...,1524.0,801.666667,15.266667,62.2,5273.7,0.0,0.866025,-0.5,0.887885,0.460065


We can see the newly created features at the end of the dataframe.

## Drop Missing data

When creating lag and window features, we introduced some missing data. 

In [20]:
data.isnull().sum()

Date_Time                0
CO_sensor                0
NMHC_sensor              0
NOX_sensor               0
NO2_sensor               0
O3_sensor                0
T                        0
RH                       0
Date_Time_month          0
Date_Time_woty           0
Date_Time_dotw           0
Date_Time_dotm           0
Date_Time_hour           0
Date_Time_weekend        0
CO_sensor_lag_1         29
NMHC_sensor_lag_1       29
NOX_sensor_lag_1        29
NO2_sensor_lag_1        29
O3_sensor_lag_1         29
T_lag_1                 29
RH_lag_1                29
CO_sensor_lag_24       475
NMHC_sensor_lag_24     475
NOX_sensor_lag_24      475
NO2_sensor_lag_24      475
O3_sensor_lag_24       475
T_lag_24               475
RH_lag_24              475
CO_sensor_window        29
NMHC_sensor_window      29
NOX_sensor_window       29
NO2_sensor_window       29
O3_sensor_window        29
T_window                29
RH_window               29
total_poll_lag_1         0
total_poll_lag_24        0
D

In [21]:
print('data size before')
print(data.shape)

dropna = DropMissingData()

data = dropna.fit_transform(data)

print('data size after')
print(data.shape)

data size before
(7379, 41)
data size after
(6892, 41)


In [22]:
# the variables that had NA
dropna.variables_

['CO_sensor_lag_1',
 'NMHC_sensor_lag_1',
 'NOX_sensor_lag_1',
 'NO2_sensor_lag_1',
 'O3_sensor_lag_1',
 'T_lag_1',
 'RH_lag_1',
 'CO_sensor_lag_24',
 'NMHC_sensor_lag_24',
 'NOX_sensor_lag_24',
 'NO2_sensor_lag_24',
 'O3_sensor_lag_24',
 'T_lag_24',
 'RH_lag_24',
 'CO_sensor_window',
 'NMHC_sensor_window',
 'NOX_sensor_window',
 'NO2_sensor_window',
 'O3_sensor_window',
 'T_window',
 'RH_window']

## Seasonality Features

We know that the pollutants have an intra day seasonality. And we want to capture this, using only data in the past, respect to the time of prediction. So we need to create a transformer class that we can use together with Scikit-learn's cross validation functions.

In [23]:
class SeasonalTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, season_var, variables):

        # In the init we specify the parameters that
        # the user needs to pass to start the transformer.

        # The user needs to indicate the seasonal variable
        # and the variables that should be aggregated.

        self.season_var = season_var
        self.variables = variables

    def fit(self, X, y=None):

        # We want to estimate the mean value of the
        # time series in the seasonal term.

        # In our demo, that is the mean pollutant's 
        # concentration per hour.

        # We make a copy of the dataframe 
        # not to over-write the user's data.
        X = X.copy()

        # Calcualte mean pollutant per hr.
        # The learned values will be stored in this attribute.
        self.seasonal_ = X.groupby(self.season_var)[self.variables].mean()

        # Rename the new variables.
        self.seasonal_.columns = [v + '_season' for v in self.variables]

        return self

    def transform(self, X):

        # We want to add the seasonal component to the
        # dataset to transform.

        X = X.copy()

        X = X.merge(self.seasonal_, on=self.season_var, how='left')

        return X

In [24]:
pollutants = variables[:-2]

pollutants

['CO_sensor', 'NMHC_sensor', 'NOX_sensor', 'NO2_sensor', 'O3_sensor']

In [25]:
seasonal_t = SeasonalTransformer(
    
    # the variable for the grouping
    season_var='Date_Time_hour', 
    
    # the variables to group
    variables=pollutants,
)

# Add the seasonal features
data = seasonal_t.fit_transform(data)

data.head()

Unnamed: 0,Date_Time,CO_sensor,NMHC_sensor,NOX_sensor,NO2_sensor,O3_sensor,T,RH,Date_Time_month,Date_Time_woty,...,total_poll_lag_24,Date_Time_month_sin,Date_Time_month_cos,Date_Time_hour_sin,Date_Time_hour_cos,CO_sensor_season,NMHC_sensor_season,NOX_sensor_season,NO2_sensor_season,O3_sensor_season
0,2004-04-05 00:00:00,1065.0,789.0,936.0,1620.0,929.0,18.4,65.8,4,15,...,5576.2,0.866025,-0.5,0.0,1.0,1046.262411,867.762411,840.716312,1385.095745,982.609929
1,2004-04-05 01:00:00,999.0,692.0,1038.0,1588.0,778.0,16.4,79.2,4,15,...,5475.1,0.866025,-0.5,0.269797,0.962917,986.690141,793.503521,902.602113,1327.982394,889.644366
2,2004-04-05 02:00:00,911.0,599.0,1189.0,1517.0,633.0,16.1,80.0,4,15,...,5206.6,0.866025,-0.5,0.519584,0.854419,926.783217,711.363636,994.255245,1272.265734,789.213287
3,2004-04-05 03:00:00,873.0,545.0,1308.0,1471.0,497.0,15.6,81.0,4,15,...,5273.7,0.866025,-0.5,0.730836,0.682553,888.892361,655.559028,1070.447917,1235.996528,721.458333
4,2004-04-05 04:00:00,881.0,546.0,1335.0,1465.0,458.0,15.6,81.0,4,15,...,5024.6,0.866025,-0.5,0.887885,0.460065,871.860627,629.745645,1106.108014,1223.090592,693.836237


We have now created a lot of features that we can use to predict the pollutant concentration. And we have also created many classes that we can put in a sequence inside a Pipeline.

In the next notebook, we will see how we can integrate the feature engineering steps inside a pipeline, and make a single point forecast.

That is all for this notebook. I hope you enjoyed it!