## Automate Model Retraining & Deployment Using the AWS Step Functions Data Science SDK

**This sample is provided for demonstration purposes, make sure to conduct appropriate testing if derivating this code for your own use-cases!**

This notebook describes how to use the AWS Step Functions Data Science SDK to create a machine learning model retraining workflow. The Step Functions SDK is an open source library that allows data scientists to easily create and execute machine learning workflows using AWS Step Functions and Amazon SageMaker. For more information, please see the following resources:
* [AWS Step Functions](https://aws.amazon.com/step-functions/)
* [AWS Step Functions Developer Guide](https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html)
* [AWS Step Functions Data Science SDK](https://aws-step-functions-data-science-sdk.readthedocs.io)


### Step 0: Get Admin Setup Results
Bucket names, codecommit repo, docker image, IAM roles, ...

In order to keep things orginized, we will save our `Source Code` (data processing, model training/serving scripts), `datasets`, as well as our trained `model(s) binaries` and their `test-performance metrics` all on S3, **versioned with respect to the date/time of each update.**

In [None]:
# Upgrade the stepfunctions library
import sys
!{sys.executable} -m pip install --upgrade stepfunctions

In [None]:
import json
from time import gmtime, strftime
import logging
import stepfunctions
from stepfunctions import steps
from stepfunctions.steps.choice_rule import ChoiceRule
from stepfunctions.steps import TrainingStep, ModelStep
from stepfunctions.inputs import ExecutionInput
from stepfunctions.workflow import Workflow
stepfunctions.set_stream_logger(level=logging.INFO)


# Set project bucket, IAM Roles and Docker Image for Training
with open('admin_setup.txt', 'r') as filehandle:
    admin_setup = json.load(filehandle)

SOURCE_DATA = admin_setup["raw_data_path"]
BUCKET = admin_setup["project_bucket"]
REPO_NAME = admin_setup["repo_name"]
TRAINING_IMAGE = admin_setup["docker_image"]
WORKFLOW_EXECUTION_ROLE = admin_setup["workflow_execution_role"]


# MLOps Hygiene
WORKFLOW_NAME = "my-project"
WORKFLOW_DATE_TIME = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
TRAINING_JOB_NAME = "{}-{}".format(WORKFLOW_NAME, WORKFLOW_DATE_TIME)

SOURCE_CODE_PREFIX = "{}/{}".format(WORKFLOW_DATE_TIME, "source-code")
SOURCE_CODE = "s3://{}/{}/{}".format(BUCKET, SOURCE_CODE_PREFIX, "sourcedir.tar.gz")

OUTPUT_ARTIFACTS_PREFIX = "{}/{}".format(WORKFLOW_DATE_TIME, "model-artifacts")
OUTPUT_ARTIFACTS_PATH = 's3://{}/{}'.format(BUCKET, WORKFLOW_DATE_TIME + '/model-artifacts/')#+"/output/"+"model.tar.gz"


TRAINING_DATA_PATH = "s3://{}/{}/data/train/train.csv".format(BUCKET, WORKFLOW_DATE_TIME)
VALIDATION_DATA_PATH = "s3://{}/{}/data/validation/validation.csv".format(BUCKET, WORKFLOW_DATE_TIME)
TESTING_DATA_PATH = "s3://{}/{}/data/test/test.csv".format(BUCKET, WORKFLOW_DATE_TIME)

codecommit_to_s3_event = {
    "s3BucketName":BUCKET,
    "s3BucketKey":"{}/{}".format(WORKFLOW_DATE_TIME, "source-code"),
    "repository": REPO_NAME,
    "branch": "master",
    "codecommitRegion":"us-east-1",
    "repository_sagemaker_key": "sagemaker-train-serve-src",
    "repository_sm_processing_key": "sagemaker-processing-src"
}

processing_job_event = {
    "JOB_NAME":"{}-{}".format(WORKFLOW_NAME, WORKFLOW_DATE_TIME),
    "BUCKET": BUCKET,
    "WORKFLOW_DATE_TIME":WORKFLOW_DATE_TIME,
    "SOURCE_CODE_PREFIX":"{}/{}".format(WORKFLOW_DATE_TIME, "source-code"),
    "ENTRY_POINT_SCRIPT":"processing.py",
    "TRAINING_IMAGE":TRAINING_IMAGE,
    "ROLE_ARN":WORKFLOW_EXECUTION_ROLE,
    "INSTANCE_TYPE":"ml.c5.xlarge",
    "INSTANCE_COUNT":1,
    "VOLUME_SIZE_GB":10,
    "DATA_SOURCE": SOURCE_DATA
}

training_job_event = {
    "TRAINING_JOB_NAME":"{}-{}".format(WORKFLOW_NAME, WORKFLOW_DATE_TIME),
    "TRAINING_DATA":TRAINING_DATA_PATH,
    "TESTING_DATA":VALIDATION_DATA_PATH,
    "SOURCE_CODE":"s3://{}/{}/{}".format(BUCKET, WORKFLOW_DATE_TIME, "source-code/sourcedir.tar.gz"),
    "ENTRY_POINT_SCRIPT":"train.py",
    "TRAINING_IMAGE":TRAINING_IMAGE,
    "ROLE_ARN":WORKFLOW_EXECUTION_ROLE,
    "OUTPUT_ARTIFACTS_PATH":"s3://{}/{}/{}/".format(BUCKET, WORKFLOW_DATE_TIME, "model-artifacts"),
    "INSTANCE_TYPE":"ml.c5.xlarge",
    "INSTANCE_COUNT":1,
    "VOLUME_SIZE_GB":10,
    "PROCESSING_JOB_NAME":"{}-{}".format(WORKFLOW_NAME, WORKFLOW_DATE_TIME)
}

deploy_event = {
    "EndPointConfigName":"{}-{}".format(WORKFLOW_NAME, WORKFLOW_DATE_TIME),
    "EndPointName":WORKFLOW_NAME,
    "ModelURL":"s3://{}/{}/{}".format(BUCKET, WORKFLOW_DATE_TIME, "model-artifacts/"),
    "Directory":"s3://{}/{}/{}".format(BUCKET, WORKFLOW_DATE_TIME, "source-code/sourcedir.tar.gz"),
    "Program":"train.py",
    "Region":"us-east-1",
    "TrainingImage":TRAINING_IMAGE,
    "ROLE_ARN":WORKFLOW_EXECUTION_ROLE,
    "OUTPUT_ARTIFACTS_PATH":"s3://{}/{}/{}/{}".format(BUCKET, WORKFLOW_DATE_TIME, "model-artifacts", "{}-{}".format(WORKFLOW_NAME, WORKFLOW_DATE_TIME)+"/output/model.tar.gz"),
    "DeploymentInstanceType":"ml.c5.xlarge",
    "DeploymentInstanceCount":1
}


processing_status_event = {
    'JOB_NAME':"{}-{}".format(WORKFLOW_NAME, WORKFLOW_DATE_TIME)
}
training_status_event = {
    'JOB_NAME':"{}-{}".format(WORKFLOW_NAME, WORKFLOW_DATE_TIME)
}

model_accuracy_event = {
    'TrainingJobName':"{}-{}".format(WORKFLOW_NAME, WORKFLOW_DATE_TIME)
}


### Define Wrokflow Schema

In [None]:
execution_input = ExecutionInput(schema={
    'CodeCommitToS3Step': str,
    'DataProcessingStep': str,
    'DataProcessingStatusStep': str,
    'TrainingStep': str,
    'TrainingStatusStep': str,
    'ModelAccuracyStep': str,
    'DeployModelStep': str
})

### Create Workflow States (steps)

In [None]:
# StepN: Create Fail State
fail_step = steps.states.Fail(
    'Workflow Failed',
    comment='Either Validation accuracy is lower than threshold or one of processing, training, deployment jobs has faild.'
)

# Step1: Copy source code from CodeCommit to S3
codecommit_to_s3_step = steps.compute.LambdaStep(
    state_id = 'Put Code on S3',
    parameters={ 
        "FunctionName": execution_input['CodeCommitToS3Step'],
        'Payload':codecommit_to_s3_event
    }
)
# Step2: Run SageMaker Data Processing Job
data_processing_step = steps.compute.LambdaStep(
    state_id = 'Data Processing',
    parameters={  
        "FunctionName": execution_input['DataProcessingStep'],
        'Payload':processing_job_event
    }
)

# Step3: Wait a little bit
wait_for_data_processing = steps.states.Wait(
    state_id = "Wait 30 Seconds",
    seconds = 30
)

# Step4: Check if processing job has finished
get_processing_status = steps.compute.LambdaStep(
    state_id = "Processing Job Status",
    parameters={  
        "FunctionName": execution_input['DataProcessingStatusStep'],
        'Payload':processing_status_event
    }
)

# Step5: If processing job is not done, go back to waiting (Step3), if done go to Step6, else go to failure
# We will author this step later
# ...

# Step6: Start SageMaker Training Job
model_training_step = steps.compute.LambdaStep(
    'Model Training',
    parameters={  
        "FunctionName": execution_input['TrainingStep'],
        'Payload':training_job_event
    }
)

# Step7: Wait a little bit
wait_for_training = steps.states.Wait(
    state_id = "Wait 60 Seconds",
    seconds = 60
)

# Step8: Check if training job has finished
get_training_status = steps.compute.LambdaStep(
    state_id = "Training Job Status",
    parameters={  
        "FunctionName": execution_input['TrainingStatusStep'],
        'Payload':training_status_event
    }
)

# Step9: If training job is not done, go back to waiting (Step7), if done go to Step10, else go to failure
# We will author this step later
# ...

# Step10: Get model accuracy (custom print to logs during training)
get_model_accuracy = steps.compute.LambdaStep(
    state_id = "Get Model Median Abs. Err.",
    parameters={  
        "FunctionName": execution_input['ModelAccuracyStep'],
        'Payload':model_accuracy_event
    }
)

# Step11: If model's Median Abs. Err. is less than 2, go back to next step (deployment), else go to failure
# We will author this step later
# ...

# Step12: Create Endpoint (or update it if it exists)
deploy_model_step = steps.compute.LambdaStep(
    'Deploy Model',
    parameters={  
        "FunctionName": execution_input['DeployModelStep'],
        'Payload':deploy_event
    }
)

In [None]:
# Step5: If processing job is not done, go back to waiting (Step3), if done go to Step6, else go to failure
check_pocessing_status = steps.states.Choice(
    state_id = "Processing Job Complete?",
)

processing_job_output = get_processing_status.output()

completed_rule = ChoiceRule.StringEquals(variable=processing_job_output['Payload']['ProcessingJobStatus'],
                                    value="Completed"
                                   )
in_progress_rule = ChoiceRule.StringEquals(variable=processing_job_output['Payload']['ProcessingJobStatus'],
                                           value="InProgress"
                                          )

check_pocessing_status.add_choice(rule=completed_rule, next_step=model_training_step)
check_pocessing_status.add_choice(rule=in_progress_rule, next_step=wait_for_data_processing)
check_pocessing_status.default_choice(fail_step)


# Step9: If training job is not done, go back to waiting (Step7), if done go to Step10, else go to failure
check_training_status = steps.states.Choice(
    state_id = "Training Job Complete?",
)

training_job_output = get_training_status.output()

completed_rule = ChoiceRule.StringEquals(variable=training_job_output['Payload']['TrainingJobStatus'],
                                    value="Completed"
                                   )
in_progress_rule = ChoiceRule.StringEquals(variable=training_job_output['Payload']['TrainingJobStatus'],
                                           value="InProgress"
                                          )

check_training_status.add_choice(rule=completed_rule, next_step=get_model_accuracy)
check_training_status.add_choice(rule=in_progress_rule, next_step=wait_for_training)
check_training_status.default_choice(fail_step)


# Step11: If model's Median Abs. Err. is less than 2, go back to next step (deployment), else go to failure
check_accuracy_step = steps.states.Choice(
    'Median-AE < 3'
)

threshold_rule = ChoiceRule.NumericLessThan(variable=get_training_status.output()['Payload']['trainingMetrics'][0]['Value'], value=3)

check_accuracy_step.add_choice(rule=threshold_rule, next_step=deploy_model_step)

check_accuracy_step.default_choice(next_step=fail_step)

### Link all the Steps Together
We create a workflow definition by chaining all of the steps together that we've created. See [Chain](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/sagemaker.html#stepfunctions.steps.states.Chain) in the AWS Step Functions Data Science SDK documentation to learn more.

In [None]:
# Chain Steps 6-12
model_training_step.next(wait_for_training)
wait_for_training.next(get_training_status)
get_training_status.next(check_training_status)
get_model_accuracy.next(check_accuracy_step)

# Chain the whole workflow
workflow_definition = steps.Chain([
    codecommit_to_s3_step,
    data_processing_step,
    wait_for_data_processing,
    get_processing_status,
    check_pocessing_status
])

## Run the Workflow
Create your workflow using the workflow definition above, and render the graph with [render_graph](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow.render_graph):

In [8]:
workflow = Workflow(
    name=WORKFLOW_NAME,
    definition=workflow_definition,
    role=WORKFLOW_EXECUTION_ROLE,
    execution_input=execution_input
)
workflow.render_graph()

Create the workflow in AWS Step Functions with [create](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow.create):

In [None]:
workflow.create()

Run the workflow with [execute](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Workflow.execute):

In [None]:
execution = workflow.execute(
    inputs={
        'CodeCommitToS3Step': WORKFLOW_NAME + '-codecommit-to-s3',
        'DataProcessingStep': WORKFLOW_NAME + '-create-sagemaker-prcoessing-job',
        'DataProcessingStatusStep': WORKFLOW_NAME + '-query-data-processing-status',
        'TrainingStep': WORKFLOW_NAME + '-create-sagemaker-training-job',
        'TrainingStatusStep': WORKFLOW_NAME + '-query-training-status',
        'ModelAccuracyStep': WORKFLOW_NAME + '-query-model-accuracy',
        'DeployModelStep': WORKFLOW_NAME + '-deploy-sagemaker-model-job'
    }
)

Render workflow progress with the [render_progress](https://aws-step-functions-data-science-sdk.readthedocs.io/en/latest/workflow.html#stepfunctions.workflow.Execution.render_progress). This generates a snapshot of the current state of your workflow as it executes. This is a static image therefore you must run the cell again to check progress:

In [None]:
execution.render_progress()

In [None]:
execution.list_events(html=True)