# ML Pipeline for "Forecasting Air Quality with Amazon SageMaker DeepAR

In this example, we are going to build a ML Pipeline to automate air quality forecasting application with [AWS Step Functions Data Science SDK](https://aws-step-functions-data-science-sdk.readthedocs.io). 

## ML Pipeline

### Outcome
* Create the flow for ML process for air quality forcasting build/train/deploy
* Create simple retrain flow

### Design
* Use Step Functions Data Science SDK to orchestrate the ML flow
* Use SageMaker Processing to do data preprocessing, especially,
 * A common Docker image will be build for data retrieving (interact with Amazon Athena) and data/feature engineering
* Use SageMaker Processing to do Model Evaluation
* A scheduled job mechanism will be used to do model retraining.

### Implementation

#### Initialize

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import sys
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install -qU awscli boto3 "sagemaker==1.71.0"
!{sys.executable} -m pip install -qU "stepfunctions==1.1.0"
!{sys.executable} -m pip show sagemaker stepfunctions

In [None]:
import uuid
import time
import boto3
import stepfunctions
from stepfunctions import steps
from stepfunctions.inputs import ExecutionInput
from stepfunctions.steps import (
    Chain,
    ChoiceRule,
    ModelStep,
    ProcessingStep,
    TrainingStep,
    TransformStep
)
from stepfunctions.workflow import Workflow

import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.processing import ProcessingInput, ProcessingOutput

sagemaker_session = sagemaker.Session()

region = boto3.session.Session().region_name
role = get_execution_role()

#### Create Docker Image for SageMaker Processing

Define your own processing container and install related dependencies.

Below, you talk through how to create a processing container, and how to use a `ScriptProcessor` to run your own code within a container. Create a container support data preprocessing, feature engineering and model evaluation. 

In [None]:
# create a subfolder for docker 
!mkdir -p docker

Below is the Dockerfile to create processing container. Install PyAthena, pandas and GeoPandas into it. You can install your own dependencies.

In [1]:
%%writefile docker/Dockerfile

FROM python:3.7-slim-buster
    
COPY ./sql /opt/ml/processing/sql
    
RUN pip install pandas numpy geopandas scikit-learn fsspec s3fs boto3

ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python3"]

Overwriting docker/Dockerfile


This block of code buils the container using the docker command, creates an Amazon Elastic Container Registry (Amazon ECR) repository, and pushes the image to Amazon ECR

In [None]:
import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'aq-forecasting-processing-container'
tag = ':latest'

uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
    uri_suffix = 'amazonaws.com.cn'
processing_repository_uri = f'{account_id}.dkr.ecr.{region}.{uri_suffix}/{ecr_repository + tag}'


In [None]:
processing_repository_uri

In [None]:
# @todo consider using CFN template to create ECR repo and only manage the docker image build and push.
!docker build -t $ecr_repository docker

In [None]:
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

Below cell writes a file `preprocessing.py`, which contains the pre-processing script. You can update the script, and rerun the cell to overwrite `preprocessing.py`. You run this as a processing job in the next cell. In this script, the actions will be done:

* Create Athena table with external source - OpenAQ
* Query OpenAQ data 
* Feature engineering on the dataset
* Split and store the data on S3 buckets.

Upload the pre processing script.

In [None]:
PREPROCESSING_SCRIPT_LOCATION = "preprocessing.py"

input_code = sagemaker_session.upload_data(
    PREPROCESSING_SCRIPT_LOCATION,
    bucket = sagemaker_session.default_bucket(),
    key_prefix = "processing/code",
)

S3 locations of preprocessing output and training data.

In [None]:
s3_bucket_base_uri = f"s3://{sagemaker_session.default_bucket()}"
output_data = f"{s3_bucket_base_uri}/preprocessing_data/output"
#preprocessed_training_data = f"{output_data}/train_data"

The `ScriptProcessor` class lets you run a command inside the container, which you can use to run your own script.

In [None]:
from sagemaker.processing import ScriptProcessor

preprocessing_processor = ScriptProcessor(
    command = ['python3'],
    image_uri = processing_repository_uri,
    role = role,
    instance_count = 1,
    instance_type = 'ml.m5.xlarge',
    max_runtime_in_seconds = 1200
)

### Create the ProcessingStep
We will now create the [ProcessingStep](https://aws-step-functions-data-science-sdk.readthedocs.io/en/stable/sagemaker.html#stepfunctions.steps.sagemaker.ProcessingStep) that will launch a SageMaker Processing Job.

This step will use ScriptProcessor as defined in previous steps along with the inputs and outputs objects that are defined in the below steps.

In [None]:
inputs = [
    ProcessingInput(
        source = input_code,
        destination = "/opt/ml/processing/input/code",
        input_name = "code"
    )
]

outputs = [
    ProcessingOutput(
        source = "/opt/ml/processing/output",
        destination = f"{output_data}",
        output_name = "output_data"
    )
]

In [None]:
execution_input = ExecutionInput(
    schema = {
        "PreprocessingJobName": str,
        "TrainingJobName": str,
        "TuningJobName": str,
        "EvaluationProcessingJobName": str
    }
)

#### Create the ProcessingStep

In [None]:
processing_step = ProcessingStep(
    "Air Quality Forecasting pre-processing step",
    processor = preprocessing_processor,
    job_name = execution_input["PreprocessingJobName"],
    inputs = inputs,
    outputs = outputs,
    container_arguments = ["--split-days", "30"],
    container_entrypoint = ["python3", "/opt/ml/processing/input/code/preprocessing.py"]
)

### Training Using the pre-processed data

We create a DeepAR instance, which we will use to run a training job. This will be used to create a TrainingStep for the workflow.

### Model Evaluation

Run model evaluation.

Create `Fail` state to mark the workflow failed in case any of the steps fail.

In [None]:
failed_state_sagemaker_processing_failure = stepfunctions.steps.states.Fail(
    "ML Workflow failed", cause = "SageMakerProcessingJobFailed"
)

#### Add the Error handling in the workflow

In [None]:
catch_state_processing = stepfunctions.steps.states.Catch(
    error_equals = ["States.TaskFailed"],
    next_step = failed_state_sagemaker_processing_failure   
)
processing_step.add_catch(catch_state_processing)

#### Workflow Role

In [None]:
workflow_execution_role = "arn:aws:iam::593380422482:role/StepFunctionsWorkflowExecutionRole"

#### Create StepFunctions Workflow execution Input schema

In [None]:
preprocessing_job_name = f"aqf-preprocessing-{uuid.uuid1().hex}"
training_job_name = f"aqf-training-{uuid.uuid1().hex}"
tuning_job_name = f"aqf-tuning-{uuid.uuid1().hex}"
evaluation_job_name = f"aqf-evaluation-{uuid.uuid1().hex}"

In [None]:
len(preprocessing_job_name)

### Create and execute the workflow

In [None]:
workflow_graph = Chain([processing_step])
workflow = Workflow(
    name = "AirQualityForecastingWorkflow",
    definition = workflow_graph,
    role = workflow_execution_role
)
workflow.create()

# execute workflow
execution = workflow.execute(
    inputs = {
        "PreprocessingJobName": preprocessing_job_name,
        "TrainingJobName": training_job_name,
        "TuningJobName": tuning_job_name,
        "EvaluationProcessingJobName": evaluation_job_name
    }
)
execution_output = execution.get_output(wait = True)

In [None]:
execution.render_progress()