# Part 5. Time series forecasting - exercise

> Try your best in one of the Monash datasets!

Today you'll apply the knowledge acquired in part 5 to perform forecasting on one 
of the datasets from the Monash time series forecasting archive (TSF). You don't 
have to build the TSC algorithm from scratch if you don't want to, but rather make
use of high level tools. Use the ones used in previous exercises such as:
- [aeon](https://github.com/aeon-toolkit/aeon)
- [tsai](https://github.com/timeseriesAI/tsai)
- [tslearn](https://github.com/tslearn-team/tslearn#available-features)
- [sk-time](https://github.com/sktime/sktime)

Or new ones seen in this course:
- [statsmodels](https://www.statsmodels.org/stable/index.html). Implements traditional
statistical forecasting models.
- [pytorch-forecasting](https://pytorch-forecasting.readthedocs.io/en/stable/): Pytorch
library built on top of [pytorch lightning](https://lightning.ai/docs/pytorch/stable/)
that implements several neural forecasting models including NHiTS. 

We are going to use the sunspot dataset. This dataset contains a single very long 
daily time series of sunspot numbers from 1818-01-08 to 2020-05-31. Be aware that
there is missing data. The nonmissing data version of this dataset was filled with
the LOCV method of imputation.

## Preparing the data

Running the following code is **mandatory**, as it will load the datasets as meant for
the competition.

### Download

In [61]:
import requests
import zipfile
import os
url_train = 'https://zenodo.org/api/records/4654773/files-archive'
url_test = 'https://zenodo.org/api/records/4654722/files-archive'
# Download the zip file

DATA_FOLDER = 'data'
TEMP_FOLDER = 'raw_data'

response = requests.get(url_train)

def get(url):
    response = requests.get(url)
    zip_file_path = os.path.join(TEMP_FOLDER,'temp.zip')  # Specify the path to save the zip file
    with open(zip_file_path, 'wb') as file:
        file.write(response.content)
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(TEMP_FOLDER)
    os.remove(zip_file_path)

def extract():
    extracted_zipfile = list(os.listdir(TEMP_FOLDER))[0]
    extracted_zipfile = os.path.join(TEMP_FOLDER, extracted_zipfile)
    with zipfile.ZipFile(extracted_zipfile, 'r') as zip_ref:
        zip_ref.extractall(DATA_FOLDER)
    os.remove(extracted_zipfile)

if not os.path.exists(TEMP_FOLDER):
    os.mkdir(TEMP_FOLDER)
if not os.path.exists('data'):
    os.mkdir(DATA_FOLDER)

get(url_train)
extract()
get(url_test)
extract()

os.rmdir(TEMP_FOLDER)

### Load to memory

In [64]:
# We use the aeon package to load the data. use ``!pip3 install aeon''
from aeon.datasets import load_from_tsf_file

#DATA_FOLDER = 'data' # Specify if the dataset is alread downloaded
TRAIN_DATA_FOLDER = os.path.join(DATA_FOLDER, 'sunspot_dataset_with_missing_values.tsf')
TEST_DATA_FOLDER = os.path.join(DATA_FOLDER, 'sunspot_dataset_without_missing_values.tsf')

missing_data, missing_metadata = load_from_tsf_file(TRAIN_DATA_FOLDER)
nonmissing_data, nonmissing_metadata = load_from_tsf_file(TEST_DATA_FOLDER)

print(missing_metadata)
print(nonmissing_metadata)

{'frequency': 'daily', 'forecast_horizon': None, 'contain_missing_values': True, 'contain_equal_length': True}
{'frequency': 'daily', 'forecast_horizon': None, 'contain_missing_values': False, 'contain_equal_length': True}


### Prepare train and test sets

In [69]:
import datetime
import numpy as np
import pandas as pd

def to_dataframe(dataset):
    numeric_data = np.array(dataset.series_value[0])
    interval_date = datetime.timedelta(days=1) * (len(numeric_data) - 1)
    start_date = dataset.start_timestamp[0]
    date_index = pd.date_range(start_date, interval_date + start_date , freq='D')
    return pd.DataFrame(numeric_data, index=date_index, columns=['sunspot'])

training_data = to_dataframe(missing_data)[:datetime.datetime(2020, 1, 1)]
# The nonmissing data starts from 2020, it si filled with LOCF, some errors are to be expected
TESTING_DATA = to_dataframe(nonmissing_data)[datetime.datetime(2020, 1, 1):] 

Unnamed: 0,sunspot
1818-01-08,65.0
1818-01-09,
1818-01-10,
1818-01-11,
1818-01-12,
...,...
2020-05-27,0.0
2020-05-28,0.0
2020-05-29,0.0
2020-05-30,0.0


# Task Definition

You have available a training and testing dataset. Both ``training_data`` and ``TESTING_DATA`` are not preprocessed, while training data should be preprocessed, the testing data is to never be modified in any way. The training data contains sunspot information up to (not including) 2020-01-1, while the testing data contains information from 2020-01-1 to 2020-05-31. Your task is to adequately forecast the 2020-05-31. You have to generate a forecast for each day of the testing time series. The objective metric to minimize is *RMSE*.

> Good luck!
 

## Your implementation