# Persistence model

You remember your professor of the Time Series class. 

Don't build a crazy model before trying with persistence. A baseline is always valuable: sometimes it provides good enough results, but it always sets the level for more complex approaches.

<div class="alert alert-block alert-warning">
<b>Simplification.</b> 

There is absolutely no good reason to go through the following complex procedure just to build a persistence baseline: the `.shift()` method of `pandas.Series` makes the job. However, we will use the persistence model to demostrate how to create a model with custom code and container.
</div>

You heard from your fellow data scientist Marta about a cool library for time series forecasting, built on top of sklearn and you want to try it out. So, you start by installing sktime.

# Setup

In [None]:
# 'ml.m5.xlarge' is included in the AWS Free Tier
INSTANCE_TYPE = 'ml.m5.xlarge'

In [None]:
! pip install sktime --user
! pip install pandas s3fs --upgrade

Please, restart the kernel if this is the first time you run this notebook.

This is necessary to ensure that we can actually import the libraries we've just installed in the previous cells.

In [None]:
import os

import pandas as pd
import numpy as np
from sktime.forecasting.base import ForecastingHorizon

import boto3
import sagemaker
from sagemaker.estimator import Estimator

In [None]:
# Configuring the default size for matplotlib plots
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (20, 6)

In [None]:
image_name = prefix = 'persistence-baseline'

boto3_session = boto3.Session()
sagemaker_session = sagemaker.Session()
sagemaker_bucket = sagemaker_session.default_bucket()

region = boto3_session.region_name

# Image preparation
You figure the process of building your docker image. To start, you create the ECR (Elastic Container Registry) repository. Then, you build the Docker image with the train and inference code, and finally you push it to such repository.    

Fortunately, your favourite ML Engineers, Matteo and Gabriele, have already done it for you, and you can use it directly.

In [None]:
# %%bash -s "$image_name" "$region"
# chmod 755 build_push.sh
# ./build_push.sh $1 $2

In [None]:
# ! docker image ls

In [None]:
image_uri = "919788038405.dkr.ecr.eu-west-1.amazonaws.com/persistence-baseline:latest"

# Raw data gathering
The data processing pipeline created by Matteo and Gabriele deposits the final dataset in a conventional location on S3.
In order to retrieve the data to crunch, you first load the S3 object. 

With `pandas`, the integration is immediate: S3 URI are resolved as if they were file paths.

You also create some objects that will be useful throughout the notebook.

In [None]:
raw_data_s3_path = "s3://public-workshop/normalized_data/processed/2006_2022_data.parquet"
raw_df = pd.read_parquet(raw_data_s3_path)
resampled_df = raw_df.resample('D').sum()

In [None]:
NOW = '2019-12-31 23:59'
TRAIN_END = '2017-12-31 23:59'

load_df = resampled_df[:NOW].copy()
load_df.head()

# Upload on S3
SageMaker train jobs retrieve data from S3: you thus need to upload the train set.

In [None]:
main_prefix = "amld22-workshop-sagemaker"
local_train_path = "persistence_train.parquet"

s3_train_path = f's3://{sagemaker_bucket}/{main_prefix}/data/modelling/{prefix}/train.parquet'
load_df.to_parquet(s3_train_path)
print(f"Data uploaded to: {s3_train_path}")

# Create estimator & Train
Then, you can use your custom docker image to train the persistence model - yeah, you smile when you think about "training persistence".

Yet, you have a final look to your `persistence/train.py` and `persistence/serve.py` files and you run the cell.

In [None]:
sk_model = Estimator(
    image_uri=image_uri,
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type=INSTANCE_TYPE,
    hyperparameters={
        "strategy": "last",
        "sp": 365,
    }
)

sk_model.fit({
    "training": s3_train_path,
})

# Deployment
Using the facility of AWS SageMaker, you deploy the model to a managed endpoint.

SageMaker then uses the `persistence/serve.py` module to spin up a Flask server and make predictions.

<div class="alert alert-block alert-warning">
<b>Simplification.</b> 

The Flask development server we use is **NOT** a production server, please make sure to set up a more robust and secure serving method when deploying to production. For example, have a look at the inference approach of: https://github.com/aws/amazon-sagemaker-examples/tree/main/advanced_functionality/scikit_bring_your_own/container
</div>

In [None]:
sk_predictor = sk_model.deploy(
    initial_instance_count=1,
    instance_type=INSTANCE_TYPE,
    serializer=sagemaker.serializers.CSVSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer()
)

# Prediction
You use the deployed API to predict on the test set.

Results are not that bad, but there is definitely room for improvements. 

You smile, and open Google Scholar to look for inspiration.

In [None]:
def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred) / y_pred)

In [None]:
y_true = load_df[TRAIN_END:].Load
prediction_index = y_true.index
fh_absolute = ForecastingHorizon(prediction_index, is_relative=False)

# Predict using the deployed model
y_pred = sk_predictor.predict(fh_absolute)
y_pred_series = pd.Series(y_pred.values(), index=prediction_index)

# Compute MAPE
naive_mape = mean_absolute_percentage_error(y_true, y_pred_series)

# Plot results
plt.title(f"Persistence | MAPE: {100 * naive_mape:.2f} %")
plt.plot(y_true, label='Actual')
plt.plot(y_pred_series, label='Predicted')
plt.legend()
plt.grid(0.4)
plt.show()

# Cleanup
If you’re ready to be done with this notebook, please run the cells below with `CLEANUP = True`. 

This will remove the model and hosted endpoint to avoid any charges from a stray instance being left on.

In [None]:
CLEANUP = True

In [None]:
if CLEANUP:
    sk_predictor.delete_model()
    sk_predictor.delete_endpoint()