# ML Pipeline for "Forecasting Air Quality with Amazon SageMaker DeepAR

In this example, we are going to build a ML Pipeline to automate air quality forecasting application with [AWS Step Functions Data Science SDK](https://aws-step-functions-data-science-sdk.readthedocs.io). 

## ML Pipeline

### Outcome
* Create the flow for ML process for air quality forcasting build/train/deploy
* Create simple retrain flow

### Design
* Use Step Functions Data Science SDK to orchestrate the ML flow
* Use SageMaker Processing to do data preprocessing, especially,
 * A common Docker image will be build for data retrieving (interact with Amazon Athena) and data/feature engineering
* Use SageMaker Processing to do Model Evaluation
* A scheduled job mechanism will be used to do model retraining.

## Implementation

### Initialize Environment

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# import sys
# !{sys.executable} -m pip install --upgrade pip
# !{sys.executable} -m pip install -qU awscli boto3 "sagemaker==1.72.0"
# !{sys.executable} -m pip install -qU "stepfunctions==1.1.1"
# !{sys.executable} -m pip show sagemaker stepfunctions

In [None]:
import uuid
import time
import boto3
import os, urllib.request
import stepfunctions
from stepfunctions import steps
from stepfunctions.inputs import ExecutionInput
from stepfunctions.steps.sagemaker import *
from stepfunctions.steps.states import *
from stepfunctions.workflow import Workflow

import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.tuner import HyperparameterTuner, ContinuousParameter, IntegerParameter
from sagemaker.model import Model

session = boto3.Session()
region = session.region_name
account_id = session.client('sts').get_caller_identity().get('Account')
bucket_name = f'{account_id}-openaq-forecasting'

sagemaker_session = sagemaker.Session()
role = get_execution_role()

In [None]:
# upload existing model artifact to working bucket
s3 = boto3.client('s3')

os.makedirs('model', exist_ok=True)
urllib.request.urlretrieve('https://d8pl0xx4oqh22.cloudfront.net/model.tar.gz', 'model/model.tar.gz')
s3.upload_file('model/model.tar.gz', bucket_name, 'sagemaker/model/model.tar.gz')

In [None]:
EXISTING_MODEL_URI = f"s3://{bucket_name}/sagemaker/model/model.tar.gz"

Setup the workflow execution role. For the role arn, please refer to the output tab of the CloudFormation stack. 

In [None]:
WORKFLOW_EXECUTION_ROLE = "arn:aws:iam::593380422482:role/aqf-workshop-StepFunctionsWorkflowExecutionRole-1P06POQ2UPORW"

### Build Docker image for SageMaker Processing

Define your own processing container and install related dependencies.

Below, you talk through how to create a processing container, and how to use a `ScriptProcessor` to run your own code within a container. Create a container support data preprocessing, feature engineering and model evaluation. 

This block of code buils the container using the docker command, creates an Amazon Elastic Container Registry (Amazon ECR) repository, and pushes the image to Amazon ECR

In [None]:
# define repository name and uri variables
ecr_repository = 'air-quality-forecasting-preprocessing'
tag = ':latest'
uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
    uri_suffix = 'amazonaws.com.cn'
processing_repository_uri = f'{account_id}.dkr.ecr.{region}.{uri_suffix}/{ecr_repository + tag}'

# build the image.
!docker build -t $ecr_repository docker_preprocessing

In [None]:
# ECR repository should have been created with CloudFormation stack. Uncomment below to create it in case it wasn't.
#!aws ecr create-repository --repository-name $ecr_repository

# Login and push the built docker image
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

### Create the ProcessingStep
We will now create the [ProcessingStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/stable/sagemaker.html#stepfunctions.steps.sagemaker.ProcessingStep) that will launch a SageMaker Processing Job.

In the processing job script `preprocessing.py`, the actions will be done:

* Create Athena table with external source - OpenAQ
* Query Sydney OpenAQ data 
* Feature engineering on the dataset
* Split training and test data 
* Store the data on S3 buckets.

Upload the preprocessing script.

In [None]:
PREPROCESSING_SCRIPT_LOCATION = "preprocessing.py"
input_code = sagemaker_session.upload_data(
    PREPROCESSING_SCRIPT_LOCATION,
    bucket = bucket_name,
    key_prefix = "preprocessing/code",
)

S3 locations of preprocessing output with training, test & all features.

In [None]:
output_data = f"s3://{bucket_name}/preprocessing/output"

The `ScriptProcessor` class lets you run a command inside the container, which you can use to run your own script.

In [None]:
from sagemaker.processing import ScriptProcessor

preprocessing_processor = ScriptProcessor(
    command = ['python3'],
    image_uri = processing_repository_uri,
    role = role,
    instance_count = 1,
    instance_type = 'ml.m5.xlarge',
    max_runtime_in_seconds = 1200
)

This step will use ScriptProcessor as defined in previous steps along with the inputs and outputs objects that are defined in the below steps.

In [None]:
inputs = [
    ProcessingInput(
        source = input_code,
        destination = "/opt/ml/processing/input/code",
        input_name = "code"
    )
]

outputs = [
    ProcessingOutput(
        source = "/opt/ml/processing/output/all",
        destination = f"{output_data}/all",
        output_name = "all_data"
    ),
    ProcessingOutput(
        source = "/opt/ml/processing/output/train",
        destination = f"{output_data}/train",
        output_name = "train_data"
    ),
    ProcessingOutput(
        source = "/opt/ml/processing/output/test",
        destination = f"{output_data}/test",
        output_name = "test_data"
    )
]

In [None]:
# Workflow Execution parameters
execution_input = ExecutionInput(
    schema = {
        "PreprocessingJobName": str,
        "ToDoHPO": bool,
        "ToDoTraining": bool,
        "TrainingJobName": str,
        "TuningJobName": str,
        "ModelName": str,
        "EndpointName": str,
        "EvaluationProcessingJobName": str
    }
)

In [None]:
processing_step = ProcessingStep(
    "AirQualityForecasting Pre-processing Step",
    processor = preprocessing_processor,
    job_name = execution_input["PreprocessingJobName"],
    inputs = inputs,
    outputs = outputs,
    container_arguments = ["--split-days", "30"],
    container_entrypoint = ["python3", "/opt/ml/processing/input/code/preprocessing.py"]
)

### Hyperparameter Tuning

Setup tuning step and use choice state to decide whether we should do HPO.

In [None]:
image_name = get_image_uri(region, "forecasting-deepar", "latest")

In [None]:
tuning_output_path = f's3://{bucket_name}/sagemaker/tuning/output'

ml_instance_type = 'ml.c5.9xlarge'

tuning_estimator = sagemaker.estimator.Estimator(
        sagemaker_session = sagemaker_session,
        image_name = image_name,
        role = role,
        train_instance_count = 1,
        train_instance_type = ml_instance_type,
        base_job_name = 'deepar-openaq-demo',
        output_path = tuning_output_path
)

#### Set static hyperparameters
The static parameters are the ones we know to be the best based on previously run HPO jobs, as well as the non-tunable parameters like prediction length and time frequency that are set according to requirements.

In [None]:
hpo = dict(
    time_freq= '1H'
    ,early_stopping_patience= 40
    ,prediction_length= 48
    ,num_eval_samples= 10

    # default quantiles [0.1, 0.2, 0.3, ..., 0.9] is used
    #,test_quantiles= quantiles
    
    # not setting these since HPO will use range of values
    #,epochs= 400
    #,context_length= 3
    #,num_cells= 157
    #,num_layers= 4
    #,dropout_rate= 0.04
    #,embedding_dimension= 12
    #,mini_batch_size= 633
    #,learning_rate= 0.0005
)

#### Set hyper-parameter ranges
The hyperparameter ranges define the parameters we want the runer to search across.

> Explore: Look in the [user guide](https://docs.aws.amazon.com/sagemaker/latest/dg/deepar_hyperparameters.html) for DeepAR and add the recommended ranges for `embedding_dimension` to the below.

In [None]:
hpo_ranges = dict(
    epochs= IntegerParameter(1, 1000)
    ,context_length= IntegerParameter(7, 48)
    ,num_cells= IntegerParameter(30,200)
    ,num_layers= IntegerParameter(1,8)
    ,dropout_rate= ContinuousParameter(0.0, 0.2)
    ,embedding_dimension= IntegerParameter(1, 50)
    ,mini_batch_size= IntegerParameter(32, 1028)
    ,learning_rate= ContinuousParameter(.00001, .1)
)

#### Create HPO tunning job step
Once we have the HPO tuner defined, we can define the tuning step.

In [None]:
tuning_estimator.set_hyperparameters(**hpo)

hpo_tuner = HyperparameterTuner(
    estimator = tuning_estimator, 
    objective_metric_name = 'train:final_loss',
    objective_type = 'Minimize',
    hyperparameter_ranges = hpo_ranges,
    max_jobs = 2,
    max_parallel_jobs = 1
)

hpo_data = dict(
    train = f"{output_data}/train",
    test = f"{output_data}/test"
)
# as long as HPO is selected, wait for completion.
tuning_step = TuningStep(
    "HPO Step",
    tuner = hpo_tuner,
    job_name = execution_input["TuningJobName"],
    data = hpo_data,
    wait_for_completion = True
)

### Training

We create a DeepAR instance, which we will use to run a training job. This will be used to create a TrainingStep for the workflow.

#### Setup the training job step

In [None]:
training_output_path = f's3://{bucket_name}/sagemaker/training/output'
training_estimator = sagemaker.estimator.Estimator(
        sagemaker_session = sagemaker_session,
        image_name = image_name,
        role = role,
        train_instance_count = 1,
        train_instance_type = ml_instance_type,
        base_job_name = 'deepar-openaq-demo',
        output_path = training_output_path
)


In [None]:
# best hyper parameters for tuning
hpo = dict(
    time_freq= '1H'
    ,early_stopping_patience= 40
    ,prediction_length= 48
    ,num_eval_samples= 10
    #,test_quantiles= quantiles
    ,epochs= 400
    ,context_length= 3
    ,num_cells= 157
    ,num_layers= 4
    ,dropout_rate= 0.04
    ,embedding_dimension= 12
    ,mini_batch_size= 633
    ,learning_rate= 0.0005
)
training_estimator.set_hyperparameters(**hpo)

In [None]:
# use all the features for training.
data = dict(train = f"{output_data}/all/all_features.json")
training_step = TrainingStep(
    "Training Step",
    estimator = training_estimator,
    data = data,
    job_name = execution_input["TrainingJobName"],
    wait_for_completion = True
)

#### Create Model Step

In the following cell, we define a model step that will create a model in Amazon SageMaker using the artifacts created during the TrainingStep. See  [ModelStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/sagemaker.html#stepfunctions.steps.sagemaker.ModelStep) in the AWS Step Functions Data Science SDK documentation to learn more.

The model creation step typically follows the training step. The Step Functions SDK provides the [get_expected_model](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/sagemaker.html#stepfunctions.steps.sagemaker.TrainingStep.get_expected_model) method in the TrainingStep class to provide a reference for the trained model artifacts. Please note that this method is only useful when the ModelStep directly follows the TrainingStep.

In [None]:
model_step = steps.ModelStep(
    "Save Model",
    model = training_step.get_expected_model(),
    model_name = execution_input["ModelName"],
    result_path = "$.ModelStepResults"
)

# for deploying existing model
existing_model_name = f"aqf-model-{uuid.uuid1().hex}"
existing_model = Model(
    model_data = EXISTING_MODEL_URI,
    image = image_name,
    role = role,
    name = existing_model_name
)
existing_model_step = steps.ModelStep(
    "Existing Model",
    model = existing_model,
    model_name = execution_input["ModelName"]
)

### Create an Endpoint Configuration Step
In the following cell we create an endpoint configuration step. See [EndpointConfigStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/sagemaker.html#stepfunctions.steps.sagemaker.EndpointConfigStep) in the AWS Step Functions Data Science SDK documentation to learn more.

In [None]:
endpoint_config_step = steps.EndpointConfigStep(
    "Create Model Endpoint Config",
    endpoint_config_name = execution_input["ModelName"],
    model_name = execution_input["ModelName"],
    initial_instance_count = 1,
    instance_type = 'ml.c5.xlarge'
)

### Update the Model Endpoint Step
In the following cell, we create the Endpoint step to deploy the new model as a managed API endpoint, updating an existing SageMaker endpoint if our choice state is sucessful.

In [None]:
endpoint_step = steps.EndpointStep(
    "Update Model Endpoint",
    endpoint_name = execution_input["EndpointName"],
    endpoint_config_name = execution_input["ModelName"],
    update = False
)

#### Setup workflow process

Create `Fail` state to mark the workflow failed in case any of the steps fail.

In [None]:
failed_state_sagemaker_pipeline_step_failure = Fail(
    "ML Workflow failed", cause = "SageMakerPipelineStepFailed"
)

In [None]:
training_path = Chain([training_step, model_step, endpoint_config_step, endpoint_step])
deploy_existing_model_path = Chain([existing_model_step, endpoint_config_step, endpoint_step])

#### Choice state

Now, we need to setup choice state for choose HPO / Training or not. See *Choice Rules* in the [AWS Step Functions Data Science SDK documentation](https://aws-step-functions-data-science-sdk.readthedocs.io) .

In [None]:
from stepfunctions.steps import *

hpo_choice = Choice(
    "To do HPO?"
)
training_choice = Choice(
    "To do Model Training?"
)

# refer to execution input variable with required format - not user friendly.
hpo_choice.add_choice(
    rule = ChoiceRule.BooleanEquals(variable = "$$.Execution.Input['ToDoHPO']", value = True),
    next_step = tuning_step
)
hpo_choice.add_choice(
    rule = ChoiceRule.BooleanEquals(variable = "$$.Execution.Input['ToDoHPO']", value = False),
    next_step = training_choice
)
training_choice.add_choice(
    rule = ChoiceRule.BooleanEquals(variable = "$$.Execution.Input['ToDoTraining']", value = True),
    next_step = training_path
)
training_choice.add_choice(
    rule = ChoiceRule.BooleanEquals(variable = "$$.Execution.Input['ToDoTraining']", value = False),
    next_step = deploy_existing_model_path
)

#### Add the Error handling in the workflow

In [None]:
catch_state_processing = stepfunctions.steps.states.Catch(
    error_equals = ["States.TaskFailed"],
    next_step = failed_state_sagemaker_pipeline_step_failure   
)
processing_step.add_catch(catch_state_processing)
tuning_step.add_catch(catch_state_processing)
training_step.add_catch(catch_state_processing)
model_step.add_catch(catch_state_processing)
endpoint_config_step.add_catch(catch_state_processing)
endpoint_step.add_catch(catch_state_processing)
existing_model_step.add_catch(catch_state_processing)

#### Create StepFunctions Workflow execution Input schema

In [None]:
preprocessing_job_name = f"aqf-preprocessing-{uuid.uuid1().hex}"
tuning_job_name = f"aqf-tuning-{uuid.uuid1().hex}"
training_job_name = f"aqf-training-{uuid.uuid1().hex}"
model_job_name = f"aqf-model-{uuid.uuid1().hex}"
endpoint_job_name = f"aqf-endpoint-{uuid.uuid1().hex}"
evaluation_job_name = f"aqf-evaluation-{uuid.uuid1().hex}"

### Create and execute the workflow

In [None]:
#workflow_graph = Chain([processing_step, hpo_choice])
workflow_graph = Chain([processing_step, hpo_choice])
workflow = Workflow(
    name = "AirQualityForecastingWorkflow2-02",
    definition = workflow_graph,
    role = WORKFLOW_EXECUTION_ROLE
)
workflow.create()
# update() to ensure existing workflow can get updated as create() just return ARN for the existing one.
workflow.update(definition = workflow_graph) 

# execute workflow
execution = workflow.execute(
    inputs = {
        "PreprocessingJobName": preprocessing_job_name,
        "ToDoHPO": False,
        "ToDoTraining": False,
        "TrainingJobName": training_job_name,
        "TuningJobName": tuning_job_name,
        "ModelName": model_job_name,
        "EndpointName": endpoint_job_name,
        "EvaluationProcessingJobName": evaluation_job_name
    }
)
execution_output = execution.get_output(wait = True)

In [None]:
execution.render_progress()

### [Pending] Create inferences (predictions)

Now that we have a trained model, we need to evaluate it using the holdout data. Using this holdout data is only needed when you first are creating the model in order to get an idea of how the model will peform against new data in production. After the model is running in production, it is better to always retrain the model on all available data, and then monitor model perfromance over time against a trailing set of historical data.

#### Generate test sets to predict
To get an idea of how the model peforms, we will create predictions on a 12 hour rolling basis for all of the  locations, and then graph and compare them to the actuals. The method below generates the features from the hold out set to do this.

from datetime import date, timedelta
import pandas as pd

def filter_dates(df, min_time, max_time, frequency):
    min_time = None if min_time is None else pd.to_datetime(min_time)
    max_time = None if max_time is None else pd.to_datetime(max_time)
    interval = pd.Timedelta(frequency)
    
    def _filter_dates(r): 
        if min_time is not None and r['start'] < min_time:
            start_idx = int((min_time - r['start']) / interval)
            r['target'] = r['target'][start_idx:]
            r['start'] = min_time
        
        end_time = r['start'] + len(r['target']) * interval
        if max_time is not None and end_time > max_time:
            end_idx = int((end_time - max_time) / interval)
            r['target'] = r['target'][:-end_idx]
            
        return r
    
    filtered = df.apply(_filter_dates, axis=1) 
    filtered = filtered[filtered['target'].str.len() > 0]
    return filtered

def get_tests(features, split_dates, frequency, context_length, prediction_length):
    tests = []
    end_date_delta = pd.Timedelta(f'{frequency} hour') * context_length
    prediction_id = 0
    for split_date in split_dates:
        context_end = split_date + end_date_delta
        test = filter_dates(features, split_date, context_end, f'{frequency}H')
        test['prediction_start'] = context_end
        test['prediction_id'] = prediction_id
        test['start'] = test['start'].dt.strftime('%Y-%m-%d %H:%M:%S')
        tests.append(test)
        prediction_id += 1
        
    tests = pd.concat(tests).reset_index().set_index(['id', 'prediction_id', 'prediction_start']).sort_index()
    return tests


test_data_uri = f"{output_data}/test/test.json"
test_data_uri
local_result_file = "test.json"
s3 = boto3.resource('s3')
s3.Bucket(bucket_name).download_file("preprocessing/output/test/test.json", local_result_file)
test = pd.read_json(local_result_file, orient="records", lines = True, convert_dates=['start'])
#test.reset_index(inplace=True)
test.index.set_names(['id'], inplace = True)

test.head()

ten_days_ago = date.today() - timedelta(days = 10)
test_dates = pd.date_range(ten_days_ago, periods = 216, freq = '1H')
tests = get_tests(test, test_dates, '1', 3, 48)
tests.head()

### Test the endpoint
From the above, you can see that will will need to call our endpoint 4060 times for each of our tests, as we are back testing every hour, across all locations for the previous 10 days. 
Before we call the endpoint with all of the tests we have generated, let's first try calling it for just one location and time. The request passes in an array of features, one for each location, as well as configuration settings.

> **Try this:** Modify the request to get a different quantile, or the predictions for a different test set.

predictor = Predictor(
    endpoint_name, 
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer()
)

features = tests[['start','target','cat']].iloc[0].to_dict()
json.dumps(predictor.predict({
    'instances': [features]
    ,'configuration': {
        'num_samples': 20
        ,'output_types': ['quantiles']
        ,'quantiles': ['0.5']
    }
}))

predictions = predict(predictor.endpoint_name, tests, quantiles) 
predictions.head()