# Feature engineering for forecasting

[Forecasting with Machine Learning - Course](https://www.trainindata.com/p/forecasting-with-machine-learning)

In this notebook, we will create a table of predictive features and a target from a time series dataset, utilizing [Feature-engine](https://feature-engine.trainindata.com)

In [1]:
import matplotlib.pyplot as plt
import pandas as pd

from feature_engine.datetime import DatetimeFeatures
from feature_engine.timeseries.forecasting import LagFeatures, WindowFeatures

# Load data

We will use the electricity demand dataset found [here](https://github.com/tidyverts/tsibbledata/tree/master/data-raw/vic_elec/VIC2015).

**Citation:**

Godahewa, Rakshitha, Bergmeir, Christoph, Webb, Geoff, Hyndman, Rob, & Montero-Manso, Pablo. (2021). Australian Electricity Demand Dataset (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4659727

**Description of data:**

A description of the data can be found [here](https://rdrr.io/cran/tsibbledata/man/vic_elec.html). The data contains electricity demand in Victoria, Australia, at 30 minute intervals over a period of 12 years, from 2002 to early 2015. There is also the temperature in Melbourne at 30 minute intervals and public holiday dates.

In [2]:
# Electricity demand.
url = "https://raw.githubusercontent.com/tidyverts/tsibbledata/master/data-raw/vic_elec/VIC2015/demand.csv"
df = pd.read_csv(url)

df.drop(columns=["Industrial"], inplace=True)

# Convert the integer Date to an actual date with datetime type
df["date"] = df["Date"].apply(
    lambda x: pd.Timestamp("1899-12-30") + pd.Timedelta(x, unit="days")
)

# Create a timestamp from the integer Period representing 30 minute intervals
df["date_time"] = df["date"] + \
    pd.to_timedelta((df["Period"] - 1) * 30, unit="m")

df.dropna(inplace=True)

# Rename columns
df = df[["date_time", "OperationalLessIndustrial"]]

df.columns = ["date_time", "demand"]

# Resample to hourly
df = (
    df.set_index("date_time")
    .resample("H")
    .agg({"demand": "sum"})
)

df.head()

Unnamed: 0_level_0,demand
date_time,Unnamed: 1_level_1
2002-01-01 00:00:00,6919.366092
2002-01-01 01:00:00,7165.974188
2002-01-01 02:00:00,6406.542994
2002-01-01 03:00:00,5815.537828
2002-01-01 04:00:00,5497.732922


## Lag features

We shift past values of the time series forward.

With feature-engine, we can create all lags in one go.

In [3]:
# We'll use the previous value, the value 24 hs before, 
# and the value at the same time the prior week.

lag_f = LagFeatures(
    variables = "demand", # if none, it will make lags of all numerical variables
    periods=[1,24, 6*24],
)

df = lag_f.fit_transform(df)

df.head()

Unnamed: 0_level_0,demand,demand_lag_1,demand_lag_24,demand_lag_144
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2002-01-01 00:00:00,6919.366092,,,
2002-01-01 01:00:00,7165.974188,6919.366092,,
2002-01-01 02:00:00,6406.542994,7165.974188,,
2002-01-01 03:00:00,5815.537828,6406.542994,,
2002-01-01 04:00:00,5497.732922,5815.537828,,


## Window features

We aggregate values within windows in the past.

With Feature-engine, we can create many windows by using many functions, all in one go.

In [4]:
window_f = WindowFeatures(
    variables = "demand", # if none, it will make window features from all numerical variables
    window = [3, 24],
    functions = ["mean", "std"],
    missing_values="ignore"
)

df = window_f.fit_transform(df)

df.head()

Unnamed: 0_level_0,demand,demand_lag_1,demand_lag_24,demand_lag_144,demand_window_3_mean,demand_window_3_std,demand_window_24_mean,demand_window_24_std
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2002-01-01 00:00:00,6919.366092,,,,,,,
2002-01-01 01:00:00,7165.974188,6919.366092,,,,,,
2002-01-01 02:00:00,6406.542994,7165.974188,,,,,,
2002-01-01 03:00:00,5815.537828,6406.542994,,,6830.627758,387.414253,,
2002-01-01 04:00:00,5497.732922,5815.537828,,,6462.685003,676.966421,,


## Datetime features

With feature-engine, we can create many features automatically.

In [5]:
date_f = DatetimeFeatures(
    variables="index",
    features_to_extract=["month", "day_of_week", "hour"]
)

df = date_f.fit_transform(df)

df.head()

Unnamed: 0_level_0,demand,demand_lag_1,demand_lag_24,demand_lag_144,demand_window_3_mean,demand_window_3_std,demand_window_24_mean,demand_window_24_std,month,day_of_week,hour
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2002-01-01 00:00:00,6919.366092,,,,,,,,1,1,0
2002-01-01 01:00:00,7165.974188,6919.366092,,,,,,,1,1,1
2002-01-01 02:00:00,6406.542994,7165.974188,,,,,,,1,1,2
2002-01-01 03:00:00,5815.537828,6406.542994,,,6830.627758,387.414253,,,1,1,3
2002-01-01 04:00:00,5497.732922,5815.537828,,,6462.685003,676.966421,,,1,1,4


## Finalize tabularization

Now we just separate our data into the table of features and the target variable.

In [6]:
df.dropna(inplace=True)

y = df["demand"]
X = df.drop("demand", axis=1)

# Predictors

X.head()

Unnamed: 0_level_0,demand_lag_1,demand_lag_24,demand_lag_144,demand_window_3_mean,demand_window_3_std,demand_window_24_mean,demand_window_24_std,month,day_of_week,hour
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2002-01-07 00:00:00,7290.234268,6722.984526,6919.366092,7060.366613,268.288238,7172.477074,882.158974,1,0,0
2002-01-07 01:00:00,6808.008916,7140.591176,7165.974188,6954.605689,291.436254,7176.019757,880.444422,1,0,1
2002-01-07 02:00:00,7209.285712,6562.022104,6406.542994,7102.509632,258.236657,7178.882029,880.435899,1,0,2
2002-01-07 03:00:00,6535.818342,5976.02078,5815.537828,6851.037657,338.789284,7177.790206,881.249994,1,0,3
2002-01-07 04:00:00,6112.382636,5688.468222,5497.732922,6619.16223,553.180569,7183.47195,873.570993,1,0,4


In [7]:
# target

y.head()

date_time
2002-01-07 00:00:00    6808.008916
2002-01-07 01:00:00    7209.285712
2002-01-07 02:00:00    6535.818342
2002-01-07 03:00:00    6112.382636
2002-01-07 04:00:00    6165.882096
Freq: H, Name: demand, dtype: float64

That's it! We can now forecast the energy demand in the next hour as a regression.

In this notebook, we only extracted features from the time series. We can add more features from external data sources. We will address that in coming notebooks.