# Exercise 1: Multiple independent time series

[Forecasting for machine learning](https://www.trainindata.com/p/forecasting-with-machine-learning)

In this notebook we have an exercise to do multiple independent time series forecasting. The solutions we show are only one way of answering these questions.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Data preparation

The dataset we shall use is the Quarterly overnight trips (in thousands) from 1998 Q1 to 2016 Q4 across
Australia. The number of trips is split by `State`, `Region`, and `Purpose`. 

**In this exercise we are going to forecast the total number of trips for each Region (there are 76 regions therefore we will have 76 time series). We shall treat this as a multiple independent time series forecasting problem.**

Source: A new tidy data structure to support
exploration and modeling of temporal data, Journal of Computational and
Graphical Statistics, 29:3, 466-478, doi:10.1080/10618600.2019.1695624.
Shape of the dataset: (24320, 5)

In [2]:
from skforecast.datasets import fetch_dataset

# Load the data
data = fetch_dataset(name="australia_tourism", raw=True)
data.head()

australia_tourism
-----------------
Quarterly overnight trips (in thousands) from 1998 Q1 to 2016 Q4 across
Australia. The tourism regions are formed through the aggregation of Statistical
Local Areas (SLAs) which are defined by the various State and Territory tourism
authorities according to their research and marketing needs.
Wang, E, D Cook, and RJ Hyndman (2020). A new tidy data structure to support
exploration and modeling of temporal data, Journal of Computational and
Graphical Statistics, 29:3, 466-478, doi:10.1080/10618600.2019.1695624.
Shape of the dataset: (24320, 5)


Unnamed: 0,date_time,Region,State,Purpose,Trips
0,1998-01-01,Adelaide,South Australia,Business,135.07769
1,1998-04-01,Adelaide,South Australia,Business,109.987316
2,1998-07-01,Adelaide,South Australia,Business,166.034687
3,1998-10-01,Adelaide,South Australia,Business,127.160464
4,1999-01-01,Adelaide,South Australia,Business,137.448533


Pre-process the data by performing the following:
1) Convert the `date_time` column to datetime type
2) Create a dataframe with one column per `Region` which gives the total number of Trips for each date.
3) Ensure the index is `date_time` and resampled to quarterly start `QS`


Region,Adelaide,Adelaide Hills,Alice Springs,Australia's Coral Coast,Australia's Golden Outback,Australia's North West,Australia's South West,Ballarat,Barkly,Barossa,...,Sunshine Coast,Sydney,The Murray,Tropical North Queensland,Upper Yarra,Western Grampians,Whitsundays,Wilderness West,Wimmera,Yorke Peninsula
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1998-01-01,658.553895,9.79863,20.207638,132.516409,161.726948,120.77545,474.858729,182.239341,18.465206,46.796083,...,742.602299,2288.955629,356.500087,220.915346,102.791022,86.996591,60.226649,63.335097,18.804743,160.681637
1998-04-01,449.853935,26.066952,56.356223,172.615378,164.97378,158.404387,411.622281,137.566539,7.510969,49.428717,...,609.883333,1814.45948,312.291189,253.097616,74.855136,84.939977,106.190848,42.607076,52.482311,104.324252
1998-07-01,592.904597,26.491072,110.918441,173.904335,206.879934,184.619035,360.039657,117.642761,43.565625,29.743302,...,615.306331,1989.731939,376.718698,423.506735,59.465405,79.974884,81.771005,18.851214,35.657551,68.996468
1998-10-01,524.24276,27.256859,40.86827,207.002571,198.509591,138.878263,462.62005,136.072724,29.359239,78.193066,...,684.430239,2150.913627,336.367694,283.694451,35.238855,116.235617,105.600143,50.450965,27.204455,103.340264
1999-01-01,548.394105,13.772975,48.368038,198.856638,140.213443,103.337122,562.974629,156.456242,6.341997,35.27791,...,842.167418,1779.286905,323.418472,194.509904,67.823457,101.765635,111.504972,59.888003,50.219851,146.65829


Check for missing values.

Later we may want to use LightGBM, it does not support special JSON characters (e.g., `'`)  in the column name. Let's remove these characters from the column names.

In [5]:
import re

data = data.rename(columns=lambda x: re.sub("[^A-Za-z0-9_]+", "", x))

Assign the name of each state to a variable `Region`. We will use this later.

# Exploratory data analysis

Print the number of data points in the time series, the start time, and the end time of the time series.

Plot the time series summed over all states.

Plot a subsample of the time series from different regions.

It appears that there is yearly seasonality for these series and they appear to be anti-correlated (i.e., some areas experience peaks whilst others experience troughs).

Create a quarter of the year feature which could help with the yearly seasonality.

# Forecasting

Import the class needed for recursive forecasting for multiple independent time series from `skforecast`

Import a transformer from `sklearn` to scale the data.

Import a model of your choice.

Assign the names of the states to a `target_cols` variable and any exogenous features to an `exog_cols` variable.

Specify a forecast horizon and assign it to a variable `steps`. Try forecasting 8 quarters into the future.

Create a dataframe for the future values of any exogenous features.

Hint: `pd.DateOffset` and using `freq=QS` in `pd.date_range` might be helpful 

Define window features using the `RollingFeatures` class from skforecast. Try a window of 4 and 8 (1 and 2 years).

Define a weight function (a function of the time axis) that linearly decreases the weight from 1 to 0 as we go back in time. This will give more weight to recent dates. Define it so there are no harded coded dates in the function.

Hint: Consider using `np.linspace`

Define a forecaster to predict all the time series. Pass your weight function and custom predictors function to the forecaster.

Fit the forecaster.

Make a forecast.

Plot the a random subset of the time series and the forecast.

# Feature importance

Plot the 10 most important features.