# TABLE OF CONTENTS:
---
* [Notebook Summary](#Notebook-Summary)
* [Setup](#Setup)
    * [Notebook Parameters](#Notebook-Parameters)
    * [Connect to Workspace](#Connect-to-Workspace)
* [Compute Target](#Compute-Target)
* [Pipeline Run Configuration & Environment](#Pipeline-Run-Configuration-&-Environment)
* [Pipeline Inputs](#Pipeline-Inputs)
* [Create Pipeline](#Create-Pipeline)
    * [Training Step](#Training-Step)
    * [Evaluate Step](#Evaluate-Step)
    * [Register Step](#Register-Step)
* [Publish Pipeline](#Publish-Pipeline)
* [Run Pipeline](#Run-Pipeline)
* [Resource Clean Up](#Resource-Clean-Up)
---

# Notebook Summary

In this notebook, an Azure Machine Learning (AML) training / retraining pipeline will be built and published. After building and publishing the pipeline, a REST endpoint can be used to trigger the pipeline from any HTTP library on any platform. This pipeline will be used in the MlOps process for continuous model retraining, e.g. when data or model drift is detected or in general when the model should be retrained. A pipeline gives a more operationalizable way of training than a script run (which was used for original model training in the `02_model_training` notebook) as it can be easily automated and run based on triggers. It also allows for chaining of different steps that can then be executed sequentially. In general, machine learning pipelines help to optimize the workflow in terms of speed, portability and reuse.

The training / retraining pipeline built in this notebook will consist of three different steps that are executed sequentially:
- Model training using the same code for training as in the `02_model_training` notebook
- Model evaluation (comparing the newly trained model with the model currently in production or with a manual threshold)
- Model registration (registering the newly trained model to the AML workspace based on the outcomes of the model evaluation)

Check out the [AML Documentation](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-machine-learning-pipelines) for more info on how to build pipelines in general.

# Setup

In [1]:
# Import libraries
import azureml
from azureml.core import Dataset, Datastore, Environment, Experiment, Workspace
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import RunConfiguration
from azureml.pipeline.core import Pipeline, PipelineData, PublishedPipeline
from azureml.pipeline.core.graph import PipelineParameter
from azureml.pipeline.steps import PythonScriptStep

print("SDK version:", azureml.core.VERSION)

SDK version: 1.20.0


### Notebook Parameters

In [2]:
# Define the name of the remote compute target cluster
cluster_name = "gpu-cluster"

# Define the name of the training environment created in the 00_environment_setup notebook
env_name = "stanford-dogs-train-env"

# Determine whether the pipeline training run should be evaluated before model registration
run_evaluation = True

# Define the pipeline endpoint name
pipeline_name = "dog_clf_model_training_pipeline"

# Define the pipeline endpoint version
pipeline_version = "1.ß"

# Define the model_name
model_name = "dog_clf_model"

# Define the experiment name
experiment_name = "stanford_dogs_classifier_train"

### Connect to Workspace

In order to connect and communicate with the AML workspace, a workspace object needs to be instantiated using the AML SDK.

In [3]:
# Connect to the AML workspace
ws = Workspace.from_config()

# Compute Target

Retrieve a remote compute target to run the pipeline experiments on. The below code will first check whether a compute target with name **cluster_name** (defined in the [Notebook Parameters](#Notebook-Parameters) section) already exists and if it does, will retrieve it. Otherwise it will create a new compute cluster.

**Note**: At the moment it is not possible to create a new compute cluster so please specify the name of an existing compute cluster.

AML pipelines need to be run on a remote compute target and cannot be run locally.

In [4]:
# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2", # CPU
                                                           # vm_size='STANDARD_NC6', # GPU
                                                           max_nodes=4,
                                                           idle_seconds_before_scaledown=2400)
    
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# Use get_status() to get a detailed status for the current cluster
print(compute_target.get_status().serialize())

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-03-25T06:14:03.766000+00:00', 'errors': None, 'creationTime': '2021-02-04T18:49:38.130943+00:00', 'modifiedTime': '2021-02-04T18:49:53.799036+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 1, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'LowPriority', 'vmSize': 'STANDARD_NC6'}


# Pipeline Run Configuration & Environment

Load the model training environment that has been registered as part of the `00_environment_setup` notebook and use it for the pipeline run.

In [5]:
env = Environment.get(workspace=ws, name=env_name)

Create a pipeline run configuration containing the retrieved environment.

In [6]:
run_config = RunConfiguration()
run_config.environment = env

# Pipeline Inputs

Create a PipelineData object to pass data between steps.

While here the pipeline will consist of a single step only, a usual flow with multiple steps will include:
- Using Dataset objects as inputs to fetch raw data, performing some transformations, then outputting a PipelineData object.
- Use the previous step's PipelineData output object as an input object, repeated for subsequent steps.

In [7]:
pipeline_data = PipelineData("pipeline_data", datastore=ws.get_default_datastore())

Create PipelineParameter objects to be able to pass versatile arguments to the PythonScriptSteps.

In [8]:
dataset_name_param = PipelineParameter(name="dataset_name", default_value="stanford_dogs_dataset")
dataset_version_param = PipelineParameter(name="dataset_version", default_value=1)
data_file_path_param = PipelineParameter(name="data_file_path", default_value="none")
model_name_param = PipelineParameter(name="model_name", default_value="dog_clf_model")
caller_run_id_param = PipelineParameter(name="caller_run_id", default_value="none")

# Create Pipeline

In order to create a pipeline, the individual steps need to be created first.

A pipeline step is an object that encapsulates everything that is needed for running a pipeline including:

- environment and dependency settings
- the compute target to run the pipeline on
- input and output data, and any custom parameters
- reference to a script or SDK-logic to run during the step

There are multiple classes that inherit from the parent class PipelineStep to assist with building a step using certain frameworks and stacks. Here, the PythonScriptStep class is used to define the step logic using the train_model.py script.

An object reference in the outputs array becomes available as an input for a subsequent pipeline step, for scenarios where there is more than one step.

For a list of all classes for different step types, see the [steps package](https://docs.microsoft.com/en-gb/python/api/azureml-pipeline-steps/azureml.pipeline.steps?view=azure-ml-py).

### Training Step

Create the pipeline training step using the PipelineParameter objects.

In [9]:
train_step = PythonScriptStep(name="Train Model",
                              script_name="pipeline/train_model_step.py",
                              compute_target=compute_target,
                              source_directory="../src",
                              outputs=[pipeline_data],
                              arguments=["--model_name", model_name_param,
                                         "--step_output", pipeline_data,
                                         "--dataset_version", dataset_version_param,
                                         "--data_file_path", data_file_path_param,
                                         "--caller_run_id", caller_run_id_param,
                                         "--dataset_name", dataset_name_param],
                              runconfig=run_config,
                              allow_reuse=True)

print("Training step has been created.")

Training step has been created.


### Evaluate Step

Create the pipeline evaluate step using the PipelineParameter objects.

In [10]:
evaluate_step = PythonScriptStep(name="Evaluate Model",
                                 script_name="pipeline/evaluate_model_step.py",
                                 compute_target=compute_target,
                                 source_directory="../src",
                                 arguments=["--model_name", model_name_param,
                                            "--allow_run_cancel", True],
                                 runconfig=run_config,
                                 allow_reuse=False)

print("Evaluate step has been created.")

Evaluate step has been created.


### Register Step

Create the pipeline register step using the PipelineParameter objects.

In [11]:
register_step = PythonScriptStep(name="Register Model ",
                                 script_name="pipeline/register_model_step.py",
                                 compute_target=compute_target,
                                 source_directory="../src",
                                 inputs=[pipeline_data],
                                 arguments=["--model_name", model_name_param,
                                            "--step_input", pipeline_data],
                                 runconfig=run_config,
                                 allow_reuse=False)

print("Register step has been created.")

Register step has been created.


Stitch the three pipeline steps together.

In [12]:
# Check run_evaluation flag to include or exclude evaluation step.
if run_evaluation == True:
    print("Include evaluation step before register step.")
    evaluate_step.run_after(train_step)
    register_step.run_after(evaluate_step)
    steps = [train_step, evaluate_step, register_step]
else:
    print("Exclude evaluation step and directly run register step.")
    register_step.run_after(train_step)
    steps = [train_step, register_step]

Include evaluation step before register step.


Create and validate the pipeline based on the pipeline steps.

In [13]:
train_pipeline = Pipeline(workspace=ws, steps=steps)
train_pipeline._set_experiment_name
train_pipeline.validate()

Step Train Model is ready to be created [0bad7338]
Step Evaluate Model is ready to be created [970d713e]
Step Register Model  is ready to be created [aee6a099]


[]

# Publish Pipeline

Publish the pipeline to create a REST endpoint that allows to rerun the pipeline from any HTTP library on any platform. The published pipeline can also be run from the AML workspace where different metdata such as run history and duration are tracked as well. If a pipeline with the same version has already been published, retrieve the existing published pipeline instead.

In [14]:
pipelines = PublishedPipeline.list(ws)
matched_pipes = []

for p in pipelines:
    if p.name == pipeline_name:
        if p.version == pipeline_version:
            matched_pipes.append(p)

if(len(matched_pipes) == 0):
    published_pipeline = train_pipeline.publish(name=pipeline_name,
                                                description="Model training/retraining pipeline",
                                                version=pipeline_version)
    
    print(f"Published pipeline '{published_pipeline.name}' with version {published_pipeline.version}.")

else:
    published_pipeline = matched_pipes[0]
    print(f"Retrieved published pipeline with id {published_pipeline.id}.")

Created step Train Model [0bad7338][53dfba84-57e8-4d7a-8ef5-522938d3e807], (This step will run and generate new outputs)
Created step Evaluate Model [970d713e][da441d12-0054-451a-9b67-ed2ac6df05c7], (This step will run and generate new outputs)
Created step Register Model  [aee6a099][94fef048-2bf0-47b8-868f-47afffae3002], (This step will run and generate new outputs)
Published pipeline 'dog_clf_model_training_pipeline' with version 1.ß.


# Run Pipeline

The first pipeline run takes more time than subsequent runs, as all dependencies must be downloaded, a Docker image is created, and the Python environment is provisioned/created. Running it again takes significantly less time as those resources are reused. Total run time depends on the workload of your scripts and processes running in each pipeline step.

In [15]:
pipeline_parameters = {"model_name": model_name}
tags = {"trigger": "jupyter notebook",
        "model_architecture" : "transfer-learning with ResNext-50"}

# Create an AML Experiment
experiment = Experiment(workspace=ws, name=experiment_name)
    
# Submit an Experiment Run using the published pipeline and defined pipeline parameters
run = experiment.submit(published_pipeline,
                        tags=tags,
                        pipeline_parameters=pipeline_parameters)

Submitted PipelineRun 5dd4d202-711a-458a-8041-56effefc11c9
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/stanford_dogs_classifier_train/runs/5dd4d202-711a-458a-8041-56effefc11c9?wsid=/subscriptions/bf088f59-f015-4332-bd36-54b988be7c90/resourcegroups/amlbrikserg/workspaces/amlbriksews


In [None]:
# Wait for completion of the run and show output log
run.wait_for_completion(show_output=True)

PipelineRunId: 5dd4d202-711a-458a-8041-56effefc11c9
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/stanford_dogs_classifier_train/runs/5dd4d202-711a-458a-8041-56effefc11c9?wsid=/subscriptions/bf088f59-f015-4332-bd36-54b988be7c90/resourcegroups/amlbrikserg/workspaces/amlbriksews
PipelineRun Status: Running


# Resource Clean Up

Delete the compute target.

**Note**: At the moment the compute target can and should not be deleted.

In [None]:
# compute_target.delete()