# Refactoring the feature engineering steps

We forecasted 24 hours of pollutant concentration using a naive forecast and then a simple linear regression using the previous hour's pollutant concentration as an input variable.

In Section 2, we learned that we can extract a lot of features both from a time series. We created most of the features using pandas. In this notebook, we will optimize feature creation using the open-source library Feature-engine. This will take us one step closer to multi-step forecasting utilizing multiple features.

For simplicity, we will only predict the concentration of CO.

**Note, this notebook will bring forward the feature creation steps that we discussed in the second notebook in Section 2.**


## Data

We will work with the Air Quality Dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Air+Quality).

For instructions on how to download, prepare, and store the dataset, refer to notebook number 3, in the folder "01-Datasets" from this repo.

In [1]:
import pandas as pd

from feature_engine.creation import (
    CyclicalTransformer,
    MathematicalCombination,
)

from feature_engine.datetime import DatetimeFeatures
from feature_engine.imputation import DropMissingData
from feature_engine.timeseries.forecasting import (
    LagFeatures,
    WindowFeatures,
)

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

# Load data

In [2]:
# We pack all data preparation steps from the
# previous notebook in a function.

def load_data():

    # Data lives here.
    filename = "../datasets/AirQualityUCI_ready.csv"

    # Load data: only the time variable and CO.
    data = pd.read_csv(
        filename,
        usecols=["Date_Time", "CO_sensor"],
        parse_dates=["Date_Time"],
        index_col=["Date_Time"],
    )

    # Sanity: sort index.
    data.sort_index(inplace=True)

    # Reduce data span.
    data = data["2004-04-01":"2005-04-30"]

    # Remove outliers
    data = data.loc[(data["CO_sensor"] > 0)]
    
    # Add timestamp as column
    data['Date_Time'] = data.index

    return data[['Date_Time', "CO_sensor"]]

In [3]:
# Load data.

data = load_data()

data.head()

Unnamed: 0_level_0,Date_Time,CO_sensor
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,2004-04-04 00:00:00,1224.0
2004-04-04 01:00:00,2004-04-04 01:00:00,1215.0
2004-04-04 02:00:00,2004-04-04 02:00:00,1115.0
2004-04-04 03:00:00,2004-04-04 03:00:00,1124.0
2004-04-04 04:00:00,2004-04-04 04:00:00,1028.0


# Feature engineering

## Datetime features

We can extract date and time features automatically utilizing Feature-engine.

[DatetimeFeatures](https://feature-engine.readthedocs.io/en/latest/api_doc/datetime/DatetimeFeatures.html)

In [4]:
dtf = DatetimeFeatures(
    
    # the input dt variable
    variables="Date_Time",
    
    # the features we want to create
    features_to_extract=[
        "month",
        "week",
        "day_of_week",
        "day_of_month",
        "hour",
        "weekend",
    ],
    
    # if we want to drop the dt column.
    drop_original=True,
)

# Extract the datetime features
data = dtf.fit_transform(data)

# Show new variables
data[[v for v in data.columns if "Date_Time" in v]].head()

Unnamed: 0_level_0,Date_Time_month,Date_Time_week,Date_Time_day_of_week,Date_Time_day_of_month,Date_Time_hour,Date_Time_weekend
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2004-04-04 00:00:00,4,14,6,4,0,1
2004-04-04 01:00:00,4,14,6,4,1,1
2004-04-04 02:00:00,4,14,6,4,2,1
2004-04-04 03:00:00,4,14,6,4,3,1
2004-04-04 04:00:00,4,14,6,4,4,1


## Lag features

We create the following lagged features:

- The pollutant concentration for the previous hour (t-1).

- The pollutant concentration for the same hour on the previous day (t-24).

In [5]:
# Add the lag features.

lagf = LagFeatures(
    variables="CO_sensor", # the input variable
    freq=["1H", "24H"],    # move 1 hr forward
    missing_values="ignore",
)

# Add the lag features.

data = lagf.fit_transform(data)

# Show new variables
data[[v for v in data.columns if "CO" in v]].head()

Unnamed: 0_level_0,CO_sensor,CO_sensor_lag_1H,CO_sensor_lag_24H
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2004-04-04 00:00:00,1224.0,,
2004-04-04 01:00:00,1215.0,1224.0,
2004-04-04 02:00:00,1115.0,1215.0,
2004-04-04 03:00:00,1124.0,1115.0,
2004-04-04 04:00:00,1028.0,1124.0,


## Window features

We take the average of the previous 3 hours of the time series to predict the current hour. 

In [6]:
winf = WindowFeatures(
    variables="CO_sensor", # the input variable
    window="3H",           # average of 3 previous hours
    freq="1H",             # move 1 hr forward
    missing_values="ignore",
)

# Add the window features.
data = winf.fit_transform(data)

# Show new variables
data[[v for v in data.columns if "CO" in v]].head()

Unnamed: 0_level_0,CO_sensor,CO_sensor_lag_1H,CO_sensor_lag_24H,CO_sensor_window_3H_mean
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2004-04-04 00:00:00,1224.0,,,
2004-04-04 01:00:00,1215.0,1224.0,,1224.0
2004-04-04 02:00:00,1115.0,1215.0,,1219.5
2004-04-04 03:00:00,1124.0,1115.0,,1184.666667
2004-04-04 04:00:00,1028.0,1124.0,,1151.333333


**Important:** Notice how the average of the previous three hours was advanced an hour to time t, the time we want to forecast.

## Feature combination

W will determine the mean value of the pollutant at times t-1 and t-24.

We will automate this process with Feature-engine.

[MathematicalCombination](https://feature-engine.readthedocs.io/en/latest/api_doc/creation/MathematicalCombination.html)

In [7]:
combine = MathematicalCombination(
    
    # the variables to combine
    variables_to_combine=["CO_sensor_lag_1H", "CO_sensor_lag_24H"],
    
    # we indicate we want the average
    math_operations=["mean"],
    
    # the name of the new feature
    new_variables_names=["CO_lag_ave"],
    
    # what to do if the variables have NA
    missing_values="ignore",
)

data = combine.fit_transform(data)

# Show new variables
data[[v for v in data.columns if "CO" in v]].head()

Unnamed: 0_level_0,CO_sensor,CO_sensor_lag_1H,CO_sensor_lag_24H,CO_sensor_window_3H_mean,CO_lag_ave
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-04-04 00:00:00,1224.0,,,,
2004-04-04 01:00:00,1215.0,1224.0,,1224.0,1224.0
2004-04-04 02:00:00,1115.0,1215.0,,1219.5,1215.0
2004-04-04 03:00:00,1124.0,1115.0,,1184.666667,1115.0
2004-04-04 04:00:00,1028.0,1124.0,,1151.333333,1124.0


## Periodic features

We transform the month and the hour with the sine and cosine to have a periodic representation of the features.

We automate this procedure with Feature-engine.

[CyclicalTransformer](https://feature-engine.readthedocs.io/en/latest/creation/CyclicalTransformer.html)

In [8]:
# Create features that capture the cyclical representation.

cyclicf = CyclicalTransformer(
    
    # The features we want to transform.
    variables=["Date_Time_month", "Date_Time_hour"],
    
    # Whether to drop the original features.
    drop_original=False,
)

data = cyclicf.fit_transform(data)

data[[v for v in data.columns if "Date_Time_month" in v or "Date_Time_hour" in v]].head()

Unnamed: 0_level_0,Date_Time_month,Date_Time_hour,Date_Time_month_sin,Date_Time_month_cos,Date_Time_hour_sin,Date_Time_hour_cos
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2004-04-04 00:00:00,4,0,0.866025,-0.5,0.0,1.0
2004-04-04 01:00:00,4,1,0.866025,-0.5,0.269797,0.962917
2004-04-04 02:00:00,4,2,0.866025,-0.5,0.519584,0.854419
2004-04-04 03:00:00,4,3,0.866025,-0.5,0.730836,0.682553
2004-04-04 04:00:00,4,4,0.866025,-0.5,0.887885,0.460065


We can see the newly created features at the end of the dataframe.

## Missing data

When creating lag and window features, we introduced some missing data.

In [9]:
data.isnull().sum()

CO_sensor                     0
Date_Time_month               0
Date_Time_week                0
Date_Time_day_of_week         0
Date_Time_day_of_month        0
Date_Time_hour                0
Date_Time_weekend             0
CO_sensor_lag_1H             27
CO_sensor_lag_24H           461
CO_sensor_window_3H_mean     27
CO_lag_ave                   17
Date_Time_month_sin           0
Date_Time_month_cos           0
Date_Time_hour_sin            0
Date_Time_hour_cos            0
dtype: int64

In [10]:
# We drop the observations with NA

print(data.shape)

imputer = DropMissingData()

data = imputer.fit_transform(data)

print(data.shape)

(7393, 15)
(6922, 15)


## Seasonality features

We want to calculate the mean pollutant concentration per hour. 

We went over this class in section 2, so hopefully it is not too scary.

In [11]:
class SeasonalTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self, season_var, variables):

        # In the init we specify the parameters that
        # the user needs to pass to start the transformer.

        # The user needs to indicate the seasonal variable
        # and the variables that should be aggregated.

        self.season_var = season_var
        self.variables = variables

    def fit(self, X, y=None):

        # We want to estimate the mean value of the
        # time series in the seasonal term.

        # In our demo, that is the mean pollutant's
        # concentration per hour.

        # We make a copy of the dataframe
        # not to over-write the user's data.
        X = X.copy()

        # Calculate mean pollutant per hr.
        # The learned values will be stored in this attribute.
        self.seasonal_ = X.groupby(self.season_var)[self.variables].mean()

        # Rename the new variables.
        self.seasonal_.columns = [v + "_season" for v in self.variables]

        # Reset index
        self.seasonal_ = self.seasonal_.reset_index()

        return self

    def transform(self, X):

        # Add the seasonal component to the
        # dataset to transform.

        X = X.copy()

        # Store the datetime index (it is lost in merge)
        index = X.index

        # Add the seasonal feature.
        X = X.merge(self.seasonal_, on=self.season_var, how="left")

        # Restore the datetime index to the df
        X.index = index
        
        # Drop input variable
        X = X.drop(self.variables, axis=1)

        return X

In [12]:
seasonf = SeasonalTransformer(
    
    # the seasonal variable
    season_var="Date_Time_hour",
    
    # the time series
    variables=["CO_sensor"],
)

# Add the seasonal features
data = seasonf.fit_transform(data)

# Show new variables
data[[v for v in data.columns if "CO" in v]].head()

Unnamed: 0_level_0,CO_sensor_lag_1H,CO_sensor_lag_24H,CO_sensor_window_3H_mean,CO_lag_ave,CO_sensor_season
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-04-05 00:00:00,1188.0,1224.0,1165.666667,1206.0,1046.262411
2004-04-05 01:00:00,1065.0,1215.0,1149.666667,1140.0,985.395105
2004-04-05 02:00:00,999.0,1115.0,1084.0,1057.0,925.888889
2004-04-05 03:00:00,911.0,1124.0,991.666667,1017.5,888.306897
2004-04-05 04:00:00,873.0,1028.0,927.666667,950.5,870.66323


In [13]:
# The transformer learned and stored the mean
# pollutant concentration per hour.

seasonf.seasonal_.head()

Unnamed: 0,Date_Time_hour,CO_sensor_season
0,0,1046.262411
1,1,985.395105
2,2,925.888889
3,3,888.306897
4,4,870.66323


In the two views of the dataframes above, we can see how the seasonal features with the mean pollutant concentration per hour were added to the corresponding hours in the main data set.

# Pipeline

We have now created a lot of features that we can use to predict the pollutant concentration. Let's extract all these features in one step using a Feature-engineering pipeline.

In [14]:
data.head()

Unnamed: 0_level_0,Date_Time_month,Date_Time_week,Date_Time_day_of_week,Date_Time_day_of_month,Date_Time_hour,Date_Time_weekend,CO_sensor_lag_1H,CO_sensor_lag_24H,CO_sensor_window_3H_mean,CO_lag_ave,Date_Time_month_sin,Date_Time_month_cos,Date_Time_hour_sin,Date_Time_hour_cos,CO_sensor_season
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2004-04-05 00:00:00,4,15,0,5,0,0,1188.0,1224.0,1165.666667,1206.0,0.866025,-0.5,0.0,1.0,1046.262411
2004-04-05 01:00:00,4,15,0,5,1,0,1065.0,1215.0,1149.666667,1140.0,0.866025,-0.5,0.269797,0.962917,985.395105
2004-04-05 02:00:00,4,15,0,5,2,0,999.0,1115.0,1084.0,1057.0,0.866025,-0.5,0.519584,0.854419,925.888889
2004-04-05 03:00:00,4,15,0,5,3,0,911.0,1124.0,991.666667,1017.5,0.866025,-0.5,0.730836,0.682553,888.306897
2004-04-05 04:00:00,4,15,0,5,4,0,873.0,1028.0,927.666667,950.5,0.866025,-0.5,0.887885,0.460065,870.66323


In [15]:
# Let's re-load the data, to start
# from scratch.

data = load_data()

data.head()

Unnamed: 0_level_0,Date_Time,CO_sensor
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,2004-04-04 00:00:00,1224.0
2004-04-04 01:00:00,2004-04-04 01:00:00,1215.0
2004-04-04 02:00:00,2004-04-04 02:00:00,1115.0
2004-04-04 03:00:00,2004-04-04 03:00:00,1124.0
2004-04-04 04:00:00,2004-04-04 04:00:00,1028.0


In [16]:
pipe = Pipeline([
    ("datetime_features", dtf),
    ("lagf", lagf),
    ("winf", winf),
    ("combine", combine),
    ("Periodic", cyclicf),
    ("dropna", imputer),
    ("seasonal", seasonf),
    ]
)

In [17]:
data = pipe.fit_transform(data)

data.head()

Unnamed: 0_level_0,Date_Time_month,Date_Time_week,Date_Time_day_of_week,Date_Time_day_of_month,Date_Time_hour,Date_Time_weekend,CO_sensor_lag_1H,CO_sensor_lag_24H,CO_sensor_window_3H_mean,CO_lag_ave,Date_Time_month_sin,Date_Time_month_cos,Date_Time_hour_sin,Date_Time_hour_cos,CO_sensor_season
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2004-04-05 00:00:00,4,15,0,5,0,0,1188.0,1224.0,1165.666667,1206.0,0.866025,-0.5,0.0,1.0,1046.262411
2004-04-05 01:00:00,4,15,0,5,1,0,1065.0,1215.0,1149.666667,1140.0,0.866025,-0.5,0.269797,0.962917,985.395105
2004-04-05 02:00:00,4,15,0,5,2,0,999.0,1115.0,1084.0,1057.0,0.866025,-0.5,0.519584,0.854419,925.888889
2004-04-05 03:00:00,4,15,0,5,3,0,911.0,1124.0,991.666667,1017.5,0.866025,-0.5,0.730836,0.682553,888.306897
2004-04-05 04:00:00,4,15,0,5,4,0,873.0,1028.0,927.666667,950.5,0.866025,-0.5,0.887885,0.460065,870.66323


In the next notebook, we will train a linear regression at the back of this feature engineering pipeline to perform multi-step forecasting. 

That is all for this notebook. I hope you enjoyed it!