# Feature engineering for forecasting

[Forecasting with Machine Learning - Course](https://www.trainindata.com/p/forecasting-with-machine-learning)

In this notebook, we will create a table of predictive features and a target from a time series dataset, utilizing [Feature-engine](https://feature-engine.trainindata.com)

In [2]:
import matplotlib.pyplot as plt
import pandas as pd

from feature_engine.datetime import DatetimeFeatures
from feature_engine.timeseries.forecasting import LagFeatures, WindowFeatures

# Load data

We will use the electricity demand dataset found [here](https://github.com/tidyverts/tsibbledata/tree/master/data-raw/vic_elec/VIC2015).

**Citation:**

Godahewa, Rakshitha, Bergmeir, Christoph, Webb, Geoff, Hyndman, Rob, & Montero-Manso, Pablo. (2021). Australian Electricity Demand Dataset (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4659727

**Description of data:**

A description of the data can be found [here](https://rdrr.io/cran/tsibbledata/man/vic_elec.html). The data contains electricity demand in Victoria, Australia, at 30 minute intervals over a period of 12 years, from 2002 to early 2015. There is also the temperature in Melbourne at 30 minute intervals and public holiday dates.

In [3]:
# Electricity demand.
url = "https://raw.githubusercontent.com/tidyverts/tsibbledata/master/data-raw/vic_elec/VIC2015/demand.csv"
df = pd.read_csv(url)

df.drop(columns=["Industrial"], inplace=True)

# Convert the integer Date to an actual date with datetime type
df["date"] = df["Date"].apply(
    lambda x: pd.Timestamp("1899-12-30") + pd.Timedelta(x, unit="days")
)

# Create a timestamp from the integer Period representing 30 minute intervals
df["date_time"] = df["date"] + \
    pd.to_timedelta((df["Period"] - 1) * 30, unit="m")

df.dropna(inplace=True)

# Rename columns
df = df[["date_time", "OperationalLessIndustrial"]]

df.columns = ["date_time", "demand"]

# Resample to hourly
df = (
    df.set_index("date_time")
    .resample("h")
    .agg({"demand": "sum"})
)

df.head()

Unnamed: 0_level_0,demand
date_time,Unnamed: 1_level_1
2002-01-01 00:00:00,6919.366092
2002-01-01 01:00:00,7165.974188
2002-01-01 02:00:00,6406.542994
2002-01-01 03:00:00,5815.537828
2002-01-01 04:00:00,5497.732922


## Lag features

We shift past values of the time series forward.

With feature-engine, we can create all lags in one go.

In [7]:
# We'll use the previous value, the value 24 hs before, 
# and the value at the same time the prior week.

lag_f = LagFeatures(
    variables = "demand", # if none, it will make lags of all numerical variables
    periods=[1,24, 7*24],
)

df = lag_f.fit_transform(df)

df.head()

Unnamed: 0_level_0,demand,demand_lag_1_x,demand_lag_24_x,demand_lag_144_x,demand_lag_1_y,demand_lag_24_y,demand_lag_144_y,demand_window_3_mean,demand_window_3_std,demand_window_24_mean,demand_window_24_std,demand_lag_1,demand_lag_24,demand_lag_168
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2002-01-01 00:00:00,6919.366092,,,,,,,,,,,,,
2002-01-01 01:00:00,7165.974188,6919.366092,,,6919.366092,,,,,,,6919.366092,,
2002-01-01 02:00:00,6406.542994,7165.974188,,,7165.974188,,,,,,,7165.974188,,
2002-01-01 03:00:00,5815.537828,6406.542994,,,6406.542994,,,6830.627758,387.414253,,,6406.542994,,
2002-01-01 04:00:00,5497.732922,5815.537828,,,5815.537828,,,6462.685003,676.966421,,,5815.537828,,


## Window features

We aggregate values within windows in the past.

With Feature-engine, we can create many windows by using many functions, all in one go.

In [6]:
window_f = WindowFeatures(
    variables = "demand", # if none, it will make window features from all numerical variables
    window = [3, 24],
    functions = ["mean", "std"],
    missing_values="ignore"
)

df = window_f.fit_transform(df)

df.head()

Unnamed: 0_level_0,demand,demand_lag_1_x,demand_lag_24_x,demand_lag_144_x,demand_lag_1_y,demand_lag_24_y,demand_lag_144_y,demand_window_3_mean,demand_window_3_std,demand_window_24_mean,demand_window_24_std
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2002-01-01 00:00:00,6919.366092,,,,,,,,,,
2002-01-01 01:00:00,7165.974188,6919.366092,,,6919.366092,,,,,,
2002-01-01 02:00:00,6406.542994,7165.974188,,,7165.974188,,,,,,
2002-01-01 03:00:00,5815.537828,6406.542994,,,6406.542994,,,6830.627758,387.414253,,
2002-01-01 04:00:00,5497.732922,5815.537828,,,5815.537828,,,6462.685003,676.966421,,


## Datetime features

With feature-engine, we can create many features automatically.

In [8]:
date_f = DatetimeFeatures(
    variables="index",
    features_to_extract=["month", "day_of_week", "hour"]
)

df = date_f.fit_transform(df)

df.head()

Unnamed: 0_level_0,demand,demand_lag_1_x,demand_lag_24_x,demand_lag_144_x,demand_lag_1_y,demand_lag_24_y,demand_lag_144_y,demand_window_3_mean,demand_window_3_std,demand_window_24_mean,demand_window_24_std,demand_lag_1,demand_lag_24,demand_lag_168,month,day_of_week,hour
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2002-01-01 00:00:00,6919.366092,,,,,,,,,,,,,,1,1,0
2002-01-01 01:00:00,7165.974188,6919.366092,,,6919.366092,,,,,,,6919.366092,,,1,1,1
2002-01-01 02:00:00,6406.542994,7165.974188,,,7165.974188,,,,,,,7165.974188,,,1,1,2
2002-01-01 03:00:00,5815.537828,6406.542994,,,6406.542994,,,6830.627758,387.414253,,,6406.542994,,,1,1,3
2002-01-01 04:00:00,5497.732922,5815.537828,,,5815.537828,,,6462.685003,676.966421,,,5815.537828,,,1,1,4


## Finalize tabularization

Now we just separate our data into the table of features and the target variable.

In [9]:
df.dropna(inplace=True)

y = df["demand"]
X = df.drop("demand", axis=1)

# Predictors

X.head()

Unnamed: 0_level_0,demand_lag_1_x,demand_lag_24_x,demand_lag_144_x,demand_lag_1_y,demand_lag_24_y,demand_lag_144_y,demand_window_3_mean,demand_window_3_std,demand_window_24_mean,demand_window_24_std,demand_lag_1,demand_lag_24,demand_lag_168,month,day_of_week,hour
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2002-01-08 00:00:00,7406.04791,6808.008916,6579.21988,7406.04791,6808.008916,6579.21988,7003.111898,369.947168,7637.818532,865.185351,7406.04791,6808.008916,6919.366092,1,1,0
2002-01-08 01:00:00,7077.081904,7209.285712,6990.82642,7077.081904,7209.285712,6990.82642,7053.973603,364.178734,7649.029907,855.655756,7077.081904,7209.285712,7165.974188,1,1,1
2002-01-08 02:00:00,7445.35431,6535.818342,6382.915018,7445.35431,6535.818342,6382.915018,7309.494708,202.232618,7658.866098,851.728742,7445.35431,6535.818342,6406.542994,1,1,2
2002-01-08 03:00:00,6800.577478,6112.382636,5896.928138,6800.577478,6112.382636,5896.928138,7107.671231,323.474993,7669.897729,838.157008,6800.577478,6112.382636,5815.537828,1,1,3
2002-01-08 04:00:00,6340.914086,6165.882096,5853.93714,6340.914086,6165.882096,5853.93714,6862.281958,554.799634,7679.419873,820.811716,6340.914086,6165.882096,5497.732922,1,1,4


In [10]:
# target

y.head()

date_time
2002-01-08 00:00:00    7077.081904
2002-01-08 01:00:00    7445.354310
2002-01-08 02:00:00    6800.577478
2002-01-08 03:00:00    6340.914086
2002-01-08 04:00:00    6277.978250
Freq: h, Name: demand, dtype: float64

That's it! We can now forecast the energy demand in the next hour as a regression.

In this notebook, we only extracted features from the time series. We can add more features from external data sources. We will address that in coming notebooks.