# Feature engineering

- In this notebook, we will create features suitable to predict the CO concentration in the next hour. 

- We assume that we have data available up to the hour before the forecast.

<img src='../images/forecasting_framework.png' width="600" height="600">

We want to predict the pollutant concentration at time t, and we know the concentration up to t-1. So for each t, we can use data up to t-1. 

Except for the timestamp, because we know at which time we want to predict pollutants.

Let's create some features.

In [1]:
import numpy as np
import pandas as pd
from feature_engine.creation import CyclicalFeatures

## Load data

In [2]:
# This function summarizes the various steps in
# the previous notebook.

def load_data():

    # Data lives here.
    filename = "../datasets/AirQualityUCI_ready.csv"

    # Load data: only the time variable and CO.
    data = pd.read_csv(
        filename,
        usecols=["Date_Time", "CO_sensor", "RH"],
        parse_dates=["Date_Time"],
        index_col=["Date_Time"],
    )

    # Sanity: sort index.
    data.sort_index(inplace=True)

    # Reduce data span.
    data = data.loc["2004-04-01":"2005-04-30"]

    # Remove outliers
    data = data.loc[(data["CO_sensor"] >= 0) & (data["RH"] >= 0)]

    return data

In [3]:
# Load data.

data = load_data()

data.head()

Unnamed: 0_level_0,CO_sensor,RH
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,1224.0,56.5
2004-04-04 01:00:00,1215.0,59.2
2004-04-04 02:00:00,1115.0,62.4
2004-04-04 03:00:00,1124.0,65.0
2004-04-04 04:00:00,1028.0,65.3


- timestamp in the index. 

- CO_sensor: carbon monoxide concentration.

- RH: relative humidity (in the air).

## Extract time related features

These are features that capture information from the timestamp.

In [4]:
# Extract date and time features.

data["Month"] = data.index.month
data["Week"] = data.index.isocalendar().week
data["Day"] = data.index.day
data["Day_of_week"] = data.index.day_of_week
data["Hour"] = data.index.hour

# find out if it is a weekend.
data["is_weekend"] = np.where(data["Day_of_week"]>4, 1, 0)

# Show new variables
data.head()

Unnamed: 0_level_0,CO_sensor,RH,Month,Week,Day,Day_of_week,Hour,is_weekend
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2004-04-04 00:00:00,1224.0,56.5,4,14,4,6,0,1
2004-04-04 01:00:00,1215.0,59.2,4,14,4,6,1,1
2004-04-04 02:00:00,1115.0,62.4,4,14,4,6,2,1
2004-04-04 03:00:00,1124.0,65.0,4,14,4,6,3,1
2004-04-04 04:00:00,1028.0,65.3,4,14,4,6,4,1


## Lag features

Lag features are past values of the variable that we can use to predict future values.

<img src='../images/lag_features.png' width="600" height="600">


I will use the following lag features to predict the next hour's pollutant concentration:

- The pollutant concentration for the previous hour (t-1).

- The pollutant concentration for the same hour on the previous day (t-24).

The reasoning behind this is that pollutant concentrations do not change quickly and, as previously demonstrated, have a 24-hour seasonality.

**We need to be careful because we do not have values for all timestamps. To be safe, we must shift the data using pandas frequency.**

In [5]:
# Here, I show how to move the variables forward by 1 hr,
# so that the pollutant concentration from the previous
# hour (t-1) is aligned with the current hour (t),
# which is the forecasting point.

# raw time series
variables = ["CO_sensor", "RH"]

# Shift the data forward 1 Hr.
tmp = data[variables].shift(freq="1H")

# Names for the new variables.
tmp.columns = [v + "_lag_1" for v in variables]

# Add the variables to the original data.
print("data size before")
print(data.shape)

data = data.merge(tmp, left_index=True, right_index=True, how="left")

print("data size after")
print(data.shape)

data.head()

data size before
(7393, 8)
data size after
(7393, 10)


Unnamed: 0_level_0,CO_sensor,RH,Month,Week,Day,Day_of_week,Hour,is_weekend,CO_sensor_lag_1,RH_lag_1
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2004-04-04 00:00:00,1224.0,56.5,4,14,4,6,0,1,,
2004-04-04 01:00:00,1215.0,59.2,4,14,4,6,1,1,1224.0,56.5
2004-04-04 02:00:00,1115.0,62.4,4,14,4,6,2,1,1215.0,59.2
2004-04-04 03:00:00,1124.0,65.0,4,14,4,6,3,1,1115.0,62.4
2004-04-04 04:00:00,1028.0,65.3,4,14,4,6,4,1,1124.0,65.0


In [6]:
data[["CO_sensor", "CO_sensor_lag_1"]].head()

Unnamed: 0_level_0,CO_sensor,CO_sensor_lag_1
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,1224.0,
2004-04-04 01:00:00,1215.0,1224.0
2004-04-04 02:00:00,1115.0,1215.0
2004-04-04 03:00:00,1124.0,1115.0
2004-04-04 04:00:00,1028.0,1124.0


We see for example that 1224 is now moved forward to the next t.

In [7]:
# In this procedure, we introduced missing
# data whenever there was no data available in
# the previous hour.

data.isnull().sum()

CO_sensor           0
RH                  0
Month               0
Week                0
Day                 0
Day_of_week         0
Hour                0
is_weekend          0
CO_sensor_lag_1    27
RH_lag_1           27
dtype: int64

Our timestamps are not equidistant. This means that not every row has information from the previous hour.

In [8]:
# Now we repeat the exercise, but this time
# the values are moved forward 24 hours.

# Move forward 24 hrs.
tmp = data[variables].shift(freq="24H")

# Rename the variables.
tmp.columns = [v + "_lag_24" for v in variables]

# Add the features to the original data.
print("data size before")
print(data.shape)

data = data.merge(tmp, left_index=True, right_index=True, how="left")

print("data size after")
print(data.shape)

data[["CO_sensor", "CO_sensor_lag_24"]].head(25)

data size before
(7393, 10)
data size after
(7393, 12)


Unnamed: 0_level_0,CO_sensor,CO_sensor_lag_24
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,1224.0,
2004-04-04 01:00:00,1215.0,
2004-04-04 02:00:00,1115.0,
2004-04-04 03:00:00,1124.0,
2004-04-04 04:00:00,1028.0,
2004-04-04 05:00:00,1010.0,
2004-04-04 06:00:00,1074.0,
2004-04-04 07:00:00,1034.0,
2004-04-04 08:00:00,1130.0,
2004-04-04 09:00:00,1275.0,


See how 1224, which is the value corresponding to April 4 at midnight, is now located on April 5th at midnight.

We have NA for all previous rows because there is no information about the pollutant concentration 24 hours before for those rows.

In [9]:
# In this procedure, we introduced missing
# data whenever there was no data available in
# the previous 24 hours.

data.isnull().sum()

CO_sensor             0
RH                    0
Month                 0
Week                  0
Day                   0
Day_of_week           0
Hour                  0
is_weekend            0
CO_sensor_lag_1      27
RH_lag_1             27
CO_sensor_lag_24    461
RH_lag_24           461
dtype: int64

## Window features

Window features are mathematical computations of the features' values over a pre-defined time window, prior to the time we want to forecast.

<img src='../images/window_features.png' width="600" height="600">

For the demonstration, I will take the average of the previous 3 values of the TS to predict the current value. 

We first need to calculate the average of the 3 previous values, and then move that value forward.

In [10]:
# Use the mean of the 3 previous hours as input variables.

tmp = (
    data[variables]
    .rolling(window="3H")
    .mean()  # Average the last 3 hr values.
    .shift(freq="1H")  # Move the average 1 hour forward
)

# Rename the columns
tmp.columns = [v + "_window" for v in variables]


# view of the result
tmp.head(10)

Unnamed: 0_level_0,CO_sensor_window,RH_window
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 01:00:00,1224.0,56.5
2004-04-04 02:00:00,1219.5,57.85
2004-04-04 03:00:00,1184.666667,59.366667
2004-04-04 04:00:00,1151.333333,62.2
2004-04-04 05:00:00,1089.0,64.233333
2004-04-04 06:00:00,1054.0,65.6
2004-04-04 07:00:00,1037.333333,66.966667
2004-04-04 08:00:00,1039.333333,66.8
2004-04-04 09:00:00,1079.333333,64.3
2004-04-04 10:00:00,1146.333333,57.866667


In [11]:
# Join the new variables to the original data.
print("data size before")
print(data.shape)

data = data.merge(tmp, left_index=True, right_index=True, how="left")

print("data size after")
print(data.shape)

data[["CO_sensor", "CO_sensor_window"]].head()

data size before
(7393, 12)
data size after
(7393, 14)


Unnamed: 0_level_0,CO_sensor,CO_sensor_window
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,1224.0,
2004-04-04 01:00:00,1215.0,1224.0
2004-04-04 02:00:00,1115.0,1219.5
2004-04-04 03:00:00,1124.0,1184.666667
2004-04-04 04:00:00,1028.0,1151.333333


Now we do some manual calculations to convince ourselves of the results.

In [12]:
(1215 + 1224) / 2

1219.5

In [13]:
(1115 + 1215 + 1224) / 3

1184.6666666666667

**Important:** Notice how the average of the previous three hours was moved forward an hour to time t, the time we want to forecast.

## Periodic features

Some features are periodic. For example, hours, months, and days.

We can encode those periodic features using a sine and cosine transformation with the feature's period. This will cause the values of the features that are far apart to come closer. For example, December (12) is closer to January (1) than June (6). This relationship is not captured by the numerical representation of these features. But we could change it, if we transformed these variables with sine and cosine.

We will discuss this technique later on in the course. For now, let's create these features automatically with the open source library Feature-engine.

In [14]:
# Create features that capture the cyclical representation.

cyclical = CyclicalFeatures(
    variables=["Month", "Hour"],  # The features we want to transform.
    drop_original=False,  # Whether to drop the original features.
)

data = cyclical.fit_transform(data)

In [15]:
cyclical_vars = [var for var in data.columns if "sin" in var or "cos" in var]

data[cyclical_vars].head()

Unnamed: 0_level_0,Month_sin,Month_cos,Hour_sin,Hour_cos
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2004-04-04 00:00:00,0.866025,-0.5,0.0,1.0
2004-04-04 01:00:00,0.866025,-0.5,0.269797,0.962917
2004-04-04 02:00:00,0.866025,-0.5,0.519584,0.854419
2004-04-04 03:00:00,0.866025,-0.5,0.730836,0.682553
2004-04-04 04:00:00,0.866025,-0.5,0.887885,0.460065


We can see the newly created features at the end of the dataframe.

## Drop missing data

When creating lag and window features, we introduced missing data. 

In [16]:
# Determine fraction of missing data.

data.isnull().sum() / len(data)

CO_sensor           0.000000
RH                  0.000000
Month               0.000000
Week                0.000000
Day                 0.000000
Day_of_week         0.000000
Hour                0.000000
is_weekend          0.000000
CO_sensor_lag_1     0.003652
RH_lag_1            0.003652
CO_sensor_lag_24    0.062356
RH_lag_24           0.062356
CO_sensor_window    0.003652
RH_window           0.003652
Month_sin           0.000000
Month_cos           0.000000
Hour_sin            0.000000
Hour_cos            0.000000
dtype: float64

## Imputation

There is not a lot of data missing, so I will just remove those observations.

In [17]:
print("data size before")
print(data.shape)

data.dropna(inplace=True)

print("data size after")
print(data.shape)

data size before
(7393, 18)
data size after
(6922, 18)


## Seasonality features

We know that the pollutants have an intra day seasonality. And we want to capture this.

At this point, we will use the entire dataset to create these features. In the next section we will explain why this is not correct.

In [18]:
tmp = data.groupby(["Hour"])[variables].mean()

# Rename the new variables.
tmp.columns = [v + "_season" for v in variables]

tmp.head()

Unnamed: 0_level_0,CO_sensor_season,RH_season
Hour,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1046.262411,54.527305
1,985.395105,55.729021
2,925.888889,56.929167
3,888.306897,58.190345
4,870.66323,59.357388


In [19]:
tmp.merge(data, on="Hour", how="left")

Unnamed: 0,Hour,CO_sensor_season,RH_season,CO_sensor,RH,Month,Week,Day,Day_of_week,is_weekend,CO_sensor_lag_1,RH_lag_1,CO_sensor_lag_24,RH_lag_24,CO_sensor_window,RH_window,Month_sin,Month_cos,Hour_sin,Hour_cos
0,0,1046.262411,54.527305,1065.0,65.8,4,15,5,0,0,1188.0,60.8,1224.0,56.5,1165.666667,58.566667,0.866025,-5.000000e-01,0.000000e+00,1.0
1,0,1046.262411,54.527305,974.0,49.2,4,15,6,1,0,932.0,59.5,1065.0,65.8,996.333333,62.100000,0.866025,-5.000000e-01,0.000000e+00,1.0
2,0,1046.262411,54.527305,1217.0,51.9,4,15,7,2,0,1120.0,44.7,974.0,49.2,1036.333333,40.866667,0.866025,-5.000000e-01,0.000000e+00,1.0
3,0,1046.262411,54.527305,1074.0,51.5,4,15,8,3,0,1086.0,39.2,1217.0,51.9,1041.666667,34.533333,0.866025,-5.000000e-01,0.000000e+00,1.0
4,0,1046.262411,54.527305,1197.0,45.3,4,15,9,4,0,939.0,49.4,1074.0,51.5,956.333333,49.100000,0.866025,-5.000000e-01,0.000000e+00,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6917,23,1068.169611,53.007774,1107.0,68.9,3,13,28,0,0,1120.0,65.8,1160.0,77.7,1128.666667,60.266667,1.000000,6.123234e-17,-2.449294e-16,1.0
6918,23,1068.169611,53.007774,1025.0,70.8,3,13,29,1,0,1010.0,67.3,1107.0,68.9,1140.666667,64.666667,1.000000,6.123234e-17,-2.449294e-16,1.0
6919,23,1068.169611,53.007774,1090.0,65.4,3,13,30,2,0,1066.0,61.9,1025.0,70.8,1158.000000,57.566667,1.000000,6.123234e-17,-2.449294e-16,1.0
6920,23,1068.169611,53.007774,886.0,47.6,3,13,31,3,0,886.0,47.1,1090.0,65.4,919.000000,46.400000,1.000000,6.123234e-17,-2.449294e-16,1.0


In [20]:
# Join the new variables to the original data.
print("data size before")
print(data.shape)

# save index for later
index_ = data.index

data = data.merge(tmp, on="Hour", how="left")

# add index
data.index = index_

print("data size after")
print(data.shape)

data[["CO_sensor", "CO_sensor_season"]].head()

data size before
(6922, 18)
data size after
(6922, 20)


Unnamed: 0_level_0,CO_sensor,CO_sensor_season
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-05 00:00:00,1065.0,1046.262411
2004-04-05 01:00:00,999.0,985.395105
2004-04-05 02:00:00,911.0,925.888889
2004-04-05 03:00:00,873.0,888.306897
2004-04-05 04:00:00,881.0,870.66323


In [21]:
# Now we add the variables to our data.

data.head()

Unnamed: 0_level_0,CO_sensor,RH,Month,Week,Day,Day_of_week,Hour,is_weekend,CO_sensor_lag_1,RH_lag_1,CO_sensor_lag_24,RH_lag_24,CO_sensor_window,RH_window,Month_sin,Month_cos,Hour_sin,Hour_cos,CO_sensor_season,RH_season
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2004-04-05 00:00:00,1065.0,65.8,4,15,5,0,0,0,1188.0,60.8,1224.0,56.5,1165.666667,58.566667,0.866025,-0.5,0.0,1.0,1046.262411,54.527305
2004-04-05 01:00:00,999.0,79.2,4,15,5,0,1,0,1065.0,65.8,1215.0,59.2,1149.666667,61.8,0.866025,-0.5,0.269797,0.962917,985.395105,55.729021
2004-04-05 02:00:00,911.0,80.0,4,15,5,0,2,0,999.0,79.2,1115.0,62.4,1084.0,68.6,0.866025,-0.5,0.519584,0.854419,925.888889,56.929167
2004-04-05 03:00:00,873.0,81.0,4,15,5,0,3,0,911.0,80.0,1124.0,65.0,991.666667,75.0,0.866025,-0.5,0.730836,0.682553,888.306897,58.190345
2004-04-05 04:00:00,881.0,81.0,4,15,5,0,4,0,873.0,81.0,1028.0,65.3,927.666667,80.066667,0.866025,-0.5,0.887885,0.460065,870.66323,59.357388


In [22]:
# drop Relative humidity raw (we do not know its
# values at time of forecast)

data.drop("RH", inplace=True, axis=1)

data.head()

Unnamed: 0_level_0,CO_sensor,Month,Week,Day,Day_of_week,Hour,is_weekend,CO_sensor_lag_1,RH_lag_1,CO_sensor_lag_24,RH_lag_24,CO_sensor_window,RH_window,Month_sin,Month_cos,Hour_sin,Hour_cos,CO_sensor_season,RH_season
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2004-04-05 00:00:00,1065.0,4,15,5,0,0,0,1188.0,60.8,1224.0,56.5,1165.666667,58.566667,0.866025,-0.5,0.0,1.0,1046.262411,54.527305
2004-04-05 01:00:00,999.0,4,15,5,0,1,0,1065.0,65.8,1215.0,59.2,1149.666667,61.8,0.866025,-0.5,0.269797,0.962917,985.395105,55.729021
2004-04-05 02:00:00,911.0,4,15,5,0,2,0,999.0,79.2,1115.0,62.4,1084.0,68.6,0.866025,-0.5,0.519584,0.854419,925.888889,56.929167
2004-04-05 03:00:00,873.0,4,15,5,0,3,0,911.0,80.0,1124.0,65.0,991.666667,75.0,0.866025,-0.5,0.730836,0.682553,888.306897,58.190345
2004-04-05 04:00:00,881.0,4,15,5,0,4,0,873.0,81.0,1028.0,65.3,927.666667,80.066667,0.866025,-0.5,0.887885,0.460065,870.66323,59.357388


In [23]:
# store new dataset

data.to_csv("air_qual_preprocessed.csv", index=True)