# ML Pipeline for "Forecasting Air Quality with Amazon SageMaker DeepAR

In this example, we are going to build a ML Pipeline to automate air quality forecasting application with [AWS Step Functions Data Science SDK](https://aws-step-functions-data-science-sdk.readthedocs.io). 

## ML Pipeline

### Outcome
* Create the flow for ML process for air quality forcasting build/train/deploy
* Create simple retrain flow

### Design
* Use Step Functions Data Science SDK to orchestrate the ML flow
* Use SageMaker Processing to do data preprocessing, especially,
 * A common Docker image will be build for data retrieving (interact with Amazon Athena) and data/feature engineering
* Use SageMaker Processing to do Model Evaluation
* A scheduled job mechanism will be used to do model retraining.

### Implementation

#### Initialize

In [13]:
%load_ext autoreload
%autoreload 2

In [14]:
!pip install --upgrade sagemaker

Collecting sagemaker
  Downloading sagemaker-2.5.5.tar.gz (293 kB)
[K     |████████████████████████████████| 293 kB 26.7 MB/s eta 0:00:01
Collecting google-pasta
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting smdebug-rulesconfig==0.1.5
  Using cached smdebug_rulesconfig-0.1.5-py2.py3-none-any.whl (6.2 kB)
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.5.5-py2.py3-none-any.whl size=415596 sha256=093aeed935e58d031f10ff82ead34292fec7520eb5c069ea69c216331a63ad81
  Stored in directory: /home/ec2-user/.cache/pip/wheels/0d/55/96/5edc5b32f17c32cf305789d97dff9688167dc93ea4e4af6667
Successfully built sagemaker
Installing collected packages: google-pasta, smdebug-rulesconfig, sagemaker
  Attempting uninstall: smdebug-rulesconfig
    Found existing installation: smdebug-rulesconfig 0.1.4
    Uninstalling smdebug-rulesconfig-0.1.4:
      Successfully uninstal

In [3]:
import boto3
import sagemaker
from sagemaker import get_execution_role

region = boto3.session.Session().region_name
role = get_execution_role()

#### Create Docker Image for SageMaker Processing

Define your own processing container and install related dependencies.

Below, you talk through how to create a processing container, and how to use a `ScriptProcessor` to run your own code within a container. Create a container support data preprocessing, feature engineering and model evaluation. 

In [4]:
# create a subfolder for docker 
!mkdir docker

Below is the Dockerfile to create processing container. Install PyAthena, pandas and GeoPandas into it. You can install your own dependencies.

In [5]:
%%writefile docker/Dockerfile

FROM python:3.7-slim-buster
    
RUN pip install PyAthena[Pandas] geopandas scikit-learn
ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python3"]

Writing docker/Dockerfile


This block of code buils the container using the docker command, creates an Amazon Elastic Container Registry (Amazon ECR) repository, and pushes the image to Amazon ECR

In [6]:
import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'aq-forecasting-processing-container'
tag = ':latest'

uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
    uri_suffix = 'amazonaws.com.cn'
processing_repository_uri = f'{account_id}.dkr.ecr.{region}.{uri_suffix}/{ecr_repository + tag}'


In [7]:
processing_repository_uri

'593380422482.dkr.ecr.us-east-1.amazonaws.com/aq-forecasting-processing-container:latest'

In [10]:
# @todo consider using CFN template to create ECR repo and only manage the docker image build and push.
!docker build -t $ecr_repository docker


Sending build context to Docker daemon  2.048kB
Step 1/4 : FROM python:3.7-slim-buster
3.7-slim-buster: Pulling from library/python

[1Bf8d1c412: Pulling fs layer 
[1B2574cc82: Pulling fs layer 
[1B6349c99d: Pulling fs layer 
[1Bc0b72728: Pulling fs layer 
[1BDigest: sha256:4731bee5e891e5bf8af43f0bfd3c25f8b999eb7ef6757f756f4cb9836c929eeb5A[2K[3A[2K[5A[2K[5A[2K[5A[2K[5A[2K[5A[2K[5A[2K[5A[2K[5A[2K[5A[2K[5A[2K[5A[2K[5A[2K[4A[2K[4A[2K[4A[2K[4A[2K[3A[2K[3A[2K[3A[2K[3A[2K[3A[2K[2A[2K[1A[2K[1A[2K[1A[2K
Status: Downloaded newer image for python:3.7-slim-buster
 ---> 4d4a9832278b
Step 2/4 : RUN pip install PyAthena[Pandas] geopandas scikit-learn
 ---> Running in badac19e526a
Collecting PyAthena[Pandas]
  Downloading PyAthena-1.11.1-py2.py3-none-any.whl (49 kB)
Collecting geopandas
  Downloading geopandas-0.8.1-py2.py3-none-any.whl (962 kB)
Collecting scikit-learn
  Downloading scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)

In [12]:
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

An error occurred (RepositoryAlreadyExistsException) when calling the CreateRepository operation: The repository with name 'aq-forecasting-processing-container' already exists in the registry with id '593380422482'
The push refers to repository [593380422482.dkr.ecr.us-east-1.amazonaws.com/aq-forecasting-processing-container]

[1B4b9d8fee: Preparing 
[1Be5632dc8: Preparing 
[1B857805ec: Preparing 
[1B87503449: Preparing 
[1B6688d36c: Preparing 
[6B4b9d8fee: Pushed   547.8MB/538.5MB[4A[2K[2A[2K[2A[2K[6A[2K[2A[2K[6A[2K[3A[2K[6A[2K[2A[2K[6A[2K[3A[2K[2A[2K[2A[2K[3A[2K[5A[2K[3A[2K[5A[2K[2A[2K[3A[2K[2A[2K[3A[2K[4A[2K[2A[2K[6A[2K[3A[2K[5A[2K[3A[2K[5A[2K[3A[2K[5A[2K[6A[2K[5A[2K[3A[2K[1A[2K[6A[2K[3A[2K[6A[2K[1A[2K[3A[2K[3A[2K[1A[2K[3A[2K[1A[2K[3A[2K[2A[2K[3A[2K[6A[2K[3A[2K[6A[2K[1A[2K[6A[2K[1A

Below cell writes a file `preprocessing.py`, which contains the pre-processing script. You can update the script, and rerun the cell to overwrite `preprocessing.py`. You run this as a processing job in the next cell. In this script, the actions will be done:

* Create Athena table with external source - OpenAQ
* Query OpenAQ data 
* Feature engineering on the dataset
* Split and store the data on S3 buckets.

In [None]:
%%writefile preprocessing.py

import argparse
import os
import warnings

import boto3, time, s3fs, json, warnings, os
import urllib.request
from datetime import date, timedelta
import numpy as np
import pandas as pd
import geopandas as gpd
from multiprocessing import Pool

# the train test split date is used to split each time series into train and test sets
train_test_split_date = date.today() - timedelta(days = 30)

# the sampling frequency determines the number of hours per sample
# and is used for aggregating and filling missing values
frequency = '1'

# prediction length is how many hours into future to predict values for
prediction_length = 48

# context length is how many prior time steps the predictor needs to make a prediction
context_length = 3

warnings.filterwarning('ignore')

session = boto3.Session()
region = session.region_name
account = session.client('sts').get_caller_identity().get('Account')
bucket_name = f"{account_id}-openaq-lab"

s3 = boto3.client('s3')

# @todo to evaluate whether we should store existing model.tar.gz onto s3 bucket.

# processing Athena
def athena_execute(query_file, ext, wait):
    with open(query_file) as f:
        query_str = f.read()
        
    

The `ScriptProcessor` class lets you run a command inside the container, which you can use to run your own script.

In [None]:
from sagemaker.processing import ScriptProcessor

preprocessing_processor = ScriptProcessor(
    command = ['python3'],
    image_uri = processing_repository_uri,
    role = role,
    instance_count = 1,
    instance_type = 'ml.m5.xlarge'
)