# Refactoring the feature engineering steps


In Section 2, we learned that we can extract a lot of features from a time series. We used pandas to create most features. Then, we used those features to forecast CO concentration for the next hour. 

In this notebook, we will use the open-source library Feature-engine to line the feature extraction steps within a Scikit-learn pipeline.

**In this notebook we bring forward the feature creation steps that we implemented in the second notebook in Section 2.**

We will create the following features from the hourly CO concentration:

- Date and time features
- Lag features
- Window features
- Cyclical features
- Remove missing data


## Data

We will work with the Air Quality Dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Air+Quality).

For instructions on how to download, prepare, and store the dataset, refer to notebook number 3, in the folder "01-Datasets" from this repo.

In [1]:
import pandas as pd

from feature_engine.creation import CyclicalFeatures
from feature_engine.datetime import DatetimeFeatures
from feature_engine.imputation import DropMissingData
from feature_engine.selection import DropFeatures
from feature_engine.timeseries.forecasting import (
    LagFeatures,
    WindowFeatures,
)

from sklearn.pipeline import Pipeline

## Load data

In [2]:
# Same function we saw in section 2.

def load_data():

    # Data lives here.
    filename = "../datasets/AirQualityUCI_ready.csv"

    # Load data: only the time variable and CO.
    data = pd.read_csv(
        filename,
        usecols=["Date_Time", "CO_sensor", "RH"],
        parse_dates=["Date_Time"],
        index_col=["Date_Time"],
    )

    # Sanity: sort index.
    data.sort_index(inplace=True)

    # Reduce data span.
    data = data["2004-04-01":"2005-04-30"]

    # Remove outliers
    data = data.loc[(data["CO_sensor"] > 0)]

    return data

In [3]:
# Load data.

data = load_data()

data.head()

Unnamed: 0_level_0,CO_sensor,RH
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,1224.0,56.5
2004-04-04 01:00:00,1215.0,59.2
2004-04-04 02:00:00,1115.0,62.4
2004-04-04 03:00:00,1124.0,65.0
2004-04-04 04:00:00,1028.0,65.3


## Datetime features

We can extract date and time features automatically utilizing Feature-engine.

[DatetimeFeatures](https://feature-engine.readthedocs.io/en/latest/api_doc/datetime/DatetimeFeatures.html)

In [4]:
dtf = DatetimeFeatures(
    # the datetime variable
    variables="index",
    
    # the features we want to create
    features_to_extract=[
        "month",
        "week",
        "day_of_week",
        "day_of_month",
        "hour",
        "weekend",
    ],
)

# Extract the datetime features
data = dtf.fit_transform(data)

# Show new variables
data.head()

Unnamed: 0_level_0,CO_sensor,RH,month,week,day_of_week,day_of_month,hour,weekend
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2004-04-04 00:00:00,1224.0,56.5,4,14,6,4,0,1
2004-04-04 01:00:00,1215.0,59.2,4,14,6,4,1,1
2004-04-04 02:00:00,1115.0,62.4,4,14,6,4,2,1
2004-04-04 03:00:00,1124.0,65.0,4,14,6,4,3,1
2004-04-04 04:00:00,1028.0,65.3,4,14,6,4,4,1


## Lag features

We create the following lagged features:

- The pollutant concentration for the previous hour (t-1).

- The pollutant concentration for the same hour on the previous day (t-24).

[LagFeatures](https://feature-engine.readthedocs.io/en/latest/api_doc/timeseries/forecasting/LagFeatures.html).

In [5]:
# Add the lag features.

lagf = LagFeatures(
    variables=["CO_sensor", "RH"],  # the input variables
    freq=["1H", "24H"],  # move 1 hr and 24 hrs forward
    missing_values="ignore",
)

# Add the lag features.
data = lagf.fit_transform(data)

# Show new variables
data[[v for v in data.columns if "lag" in v]].head()

Unnamed: 0_level_0,CO_sensor_lag_1H,RH_lag_1H,CO_sensor_lag_24H,RH_lag_24H
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2004-04-04 00:00:00,,,,
2004-04-04 01:00:00,1224.0,56.5,,
2004-04-04 02:00:00,1215.0,59.2,,
2004-04-04 03:00:00,1115.0,62.4,,
2004-04-04 04:00:00,1124.0,65.0,,


## Window features

We take the average of the previous 3 hours of the time series to predict the current hour.

[WindowFeatures](https://feature-engine.readthedocs.io/en/latest/api_doc/timeseries/forecasting/WindowFeatures.html).

In [6]:
winf = WindowFeatures(
    variables=["CO_sensor", "RH"],  # the input variables
    window="3H",  # average of 3 previous hours
    freq="1H",  # move 1 hr forward
    missing_values="ignore",
)

# Add the window features.
data = winf.fit_transform(data)

# Show new variables
data[[v for v in data.columns if "window" in v]].head()

Unnamed: 0_level_0,CO_sensor_window_3H_mean,RH_window_3H_mean
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,,
2004-04-04 01:00:00,1224.0,56.5
2004-04-04 02:00:00,1219.5,57.85
2004-04-04 03:00:00,1184.666667,59.366667
2004-04-04 04:00:00,1151.333333,62.2


## Periodic features

We transform the month and the hour with the sine and cosine to have a periodic representation of the features.

We automate this procedure with Feature-engine's [CyclicalFeatures](https://feature-engine.readthedocs.io/en/latest/api_doc/creation/CyclicalFeatures.html).

In [7]:
# Create features that capture the cyclical representation.

cyclicf = CyclicalFeatures(
    # The features we want to transform.
    variables=["month", "hour"],
    # Whether to drop the original features.
    drop_original=False,
)

data = cyclicf.fit_transform(data)

data[[v for v in data.columns if "month" in v or "hour" in v]].head()

Unnamed: 0_level_0,month,day_of_month,hour,month_sin,month_cos,hour_sin,hour_cos
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2004-04-04 00:00:00,4,4,0,0.866025,-0.5,0.0,1.0
2004-04-04 01:00:00,4,4,1,0.866025,-0.5,0.269797,0.962917
2004-04-04 02:00:00,4,4,2,0.866025,-0.5,0.519584,0.854419
2004-04-04 03:00:00,4,4,3,0.866025,-0.5,0.730836,0.682553
2004-04-04 04:00:00,4,4,4,0.866025,-0.5,0.887885,0.460065


We can see the newly created features at the end of the dataframe.

## Missing data

When creating lag and window features, we introduced some missing data.

[DropMissingData](https://feature-engine.readthedocs.io/en/latest/api_doc/imputation/DropMissingData.html).

In [8]:
data.isnull().sum()

CO_sensor                     0
RH                            0
month                         0
week                          0
day_of_week                   0
day_of_month                  0
hour                          0
weekend                       0
CO_sensor_lag_1H             27
RH_lag_1H                    27
CO_sensor_lag_24H           461
RH_lag_24H                  461
CO_sensor_window_3H_mean     27
RH_window_3H_mean            27
month_sin                     0
month_cos                     0
hour_sin                      0
hour_cos                      0
dtype: int64

In [9]:
# We drop the observations with NA.

print(data.shape)

imputer = DropMissingData()

data = imputer.fit_transform(data)

print(data.shape)

(7393, 18)
(6922, 18)


In [10]:
data.head()

Unnamed: 0_level_0,CO_sensor,RH,month,week,day_of_week,day_of_month,hour,weekend,CO_sensor_lag_1H,RH_lag_1H,CO_sensor_lag_24H,RH_lag_24H,CO_sensor_window_3H_mean,RH_window_3H_mean,month_sin,month_cos,hour_sin,hour_cos
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2004-04-05 00:00:00,1065.0,65.8,4,15,0,5,0,0,1188.0,60.8,1224.0,56.5,1165.666667,58.566667,0.866025,-0.5,0.0,1.0
2004-04-05 01:00:00,999.0,79.2,4,15,0,5,1,0,1065.0,65.8,1215.0,59.2,1149.666667,61.8,0.866025,-0.5,0.269797,0.962917
2004-04-05 02:00:00,911.0,80.0,4,15,0,5,2,0,999.0,79.2,1115.0,62.4,1084.0,68.6,0.866025,-0.5,0.519584,0.854419
2004-04-05 03:00:00,873.0,81.0,4,15,0,5,3,0,911.0,80.0,1124.0,65.0,991.666667,75.0,0.866025,-0.5,0.730836,0.682553
2004-04-05 04:00:00,881.0,81.0,4,15,0,5,4,0,873.0,81.0,1028.0,65.3,927.666667,80.066667,0.866025,-0.5,0.887885,0.460065


## Drop original time series

To avoid look-ahead bias.

[DropFeatures](https://feature-engine.readthedocs.io/en/latest/api_doc/selection/DropFeatures.html).

In [11]:
drop_ts = DropFeatures(features_to_drop=["CO_sensor", "RH"])

data = drop_ts.fit_transform(data)

data.head()

Unnamed: 0_level_0,month,week,day_of_week,day_of_month,hour,weekend,CO_sensor_lag_1H,RH_lag_1H,CO_sensor_lag_24H,RH_lag_24H,CO_sensor_window_3H_mean,RH_window_3H_mean,month_sin,month_cos,hour_sin,hour_cos
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2004-04-05 00:00:00,4,15,0,5,0,0,1188.0,60.8,1224.0,56.5,1165.666667,58.566667,0.866025,-0.5,0.0,1.0
2004-04-05 01:00:00,4,15,0,5,1,0,1065.0,65.8,1215.0,59.2,1149.666667,61.8,0.866025,-0.5,0.269797,0.962917
2004-04-05 02:00:00,4,15,0,5,2,0,999.0,79.2,1115.0,62.4,1084.0,68.6,0.866025,-0.5,0.519584,0.854419
2004-04-05 03:00:00,4,15,0,5,3,0,911.0,80.0,1124.0,65.0,991.666667,75.0,0.866025,-0.5,0.730836,0.682553
2004-04-05 04:00:00,4,15,0,5,4,0,873.0,81.0,1028.0,65.3,927.666667,80.066667,0.866025,-0.5,0.887885,0.460065


# Pipeline

We have now created a lot of features that we can use to predict the CO concentration. Let's extract all these features in one step using a feature engineering pipeline.

In [12]:
# Let's re-load the data, to start
# from scratch.

data = load_data()

data.head()

Unnamed: 0_level_0,CO_sensor,RH
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,1224.0,56.5
2004-04-04 01:00:00,1215.0,59.2
2004-04-04 02:00:00,1115.0,62.4
2004-04-04 03:00:00,1124.0,65.0
2004-04-04 04:00:00,1028.0,65.3


In [13]:
# We line up the engineering steps within
# a pipeline.

pipe = Pipeline(
    [
        ("datetime_features", dtf),
        ("lagf", lagf),
        ("winf", winf),
        ("Periodic", cyclicf),
        ("dropna", imputer),
        ("drop_ts", drop_ts),
    ]
)

In [14]:
# Fit the pipeline to the data and add
# features.

data = pipe.fit_transform(data)

data.head()

Unnamed: 0_level_0,month,week,day_of_week,day_of_month,hour,weekend,CO_sensor_lag_1H,RH_lag_1H,CO_sensor_lag_24H,RH_lag_24H,CO_sensor_window_3H_mean,RH_window_3H_mean,month_sin,month_cos,hour_sin,hour_cos
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2004-04-05 00:00:00,4,15,0,5,0,0,1188.0,60.8,1224.0,56.5,1165.666667,58.566667,0.866025,-0.5,0.0,1.0
2004-04-05 01:00:00,4,15,0,5,1,0,1065.0,65.8,1215.0,59.2,1149.666667,61.8,0.866025,-0.5,0.269797,0.962917
2004-04-05 02:00:00,4,15,0,5,2,0,999.0,79.2,1115.0,62.4,1084.0,68.6,0.866025,-0.5,0.519584,0.854419
2004-04-05 03:00:00,4,15,0,5,3,0,911.0,80.0,1124.0,65.0,991.666667,75.0,0.866025,-0.5,0.730836,0.682553
2004-04-05 04:00:00,4,15,0,5,4,0,873.0,81.0,1028.0,65.3,927.666667,80.066667,0.866025,-0.5,0.887885,0.460065


In the next notebook, we will train a linear regression at the back of this feature engineering pipeline to forecast 1 step ahead. 

That is all for this notebook. I hope you enjoyed it!