# TABLE OF CONTENTS:
---
* [Notebook Summary](#Notebook-Summary)
* [Setup](#Setup)
    * [Notebook Parameters](#Notebook-Parameters)
    * [Connect to Workspace](#Connect-to-Workspace)
* [Compute Target](#Compute-Target)
* [Pipeline Run Configuration & Environment](#Pipeline-Run-Configuration-&-Environment)
* [Pipeline Inputs](#Pipeline-Inputs)
* [Create Pipeline](#Create-Pipeline)
    * [Training Step](#Training-Step)
    * [Evaluate Step](#Evaluate-Step)
    * [Register Step](#Register-Step)
* [Publish Pipeline](#Publish-Pipeline)
* [Run Pipeline](#Run-Pipeline)
* [Resource Clean Up](#Resource-Clean-Up)
---

# Notebook Summary

In this notebook, an Azure Machine Learning (AML) training / retraining pipeline will be built and published. After building and publishing the pipeline, a REST endpoint can be used to trigger the pipeline from any HTTP library on any platform. This pipeline will be used in the MLOps process for continuous model retraining, e.g. when data or model drift is detected or in general when the model should be retrained. A pipeline gives a more operationalizable way of training than a script run (which was used for original model training in the `02_model_training` notebook) as it can be easily automated and run based on triggers. It also allows for chaining of different steps that can then be executed sequentially. In general, machine learning pipelines help to optimize the workflow in terms of speed, portability and reuse.

The training / retraining pipeline built in this notebook will consist of three different steps that are executed sequentially:
- Model training using the same code for training as in the `02_model_training` notebook
- Model evaluation (comparing the newly trained model with the model currently in production or with a manual threshold)
- Model registration (registering the newly trained model to the AML workspace based on the outcomes of the model evaluation)

Check out the [AML Documentation](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-machine-learning-pipelines) for more info on how to build pipelines in general.

**Note**: The entire code of this notebook has also been refactored into the python scripts `build_train_pipeline.py` and `run_train_pipeline.py`in the `<PROJECT_ROOT/src/pipeline` folder so that the logic can be triggered inside a CI/CD workflow on Azure DevOps.

# Setup

In [1]:
# Import libraries
import azureml
from azureml.core import Dataset, Datastore, Environment, Experiment, Workspace
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import RunConfiguration
from azureml.pipeline.core import Pipeline, PipelineData, PublishedPipeline
from azureml.pipeline.core.graph import PipelineParameter
from azureml.pipeline.steps import PythonScriptStep

print("SDK version:", azureml.core.VERSION)

SDK version: 1.20.0


### Notebook Parameters

Specify the notebook parameters which are used in the source code below.

In [2]:
# Define the name of the remote compute target cluster
cluster_name = "gpu-cluster"

# Define the name of the training environment created in the 00_environment_setup notebook
env_name = "dogs_clf_train_env"

# Determine whether the pipeline training run should be evaluated before model registration
run_evaluation = True

# Define the pipeline endpoint name
pipeline_name = "dog_clf_model_training_pipeline"

# Define the pipeline endpoint version
# Make sure to update this every time you want to publish changes to your pipeline!!!
pipeline_version = "1.0"

# Define the model_name
model_name = "dog_clf_model"

# Define the experiment name
experiment_name = "stanford_dogs_classifier_train"

### Connect to Workspace

In order to connect and communicate with the AML workspace, a workspace object needs to be instantiated using the AML SDK.

In [3]:
# Connect to the AML workspace
ws = Workspace.from_config()

# Compute Target

Retrieve a remote compute target to run the pipeline experiments on. The below code will first check whether a compute target with name **cluster_name** (defined in the [Notebook Parameters](#Notebook-Parameters) section) already exists and if it does, will retrieve it. Otherwise it will create a new compute cluster.

AML pipelines need to be run on a remote compute target and cannot be run locally.

In [4]:
# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2", # CPU
                                                           # vm_size='STANDARD_NC6', # GPU
                                                           max_nodes=4,
                                                           idle_seconds_before_scaledown=2400)
    
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# Use get_status() to get a detailed status for the current cluster
print(compute_target.get_status().serialize())

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-06-29T13:12:27.451000+00:00', 'errors': None, 'creationTime': '2021-06-28T06:49:27.130474+00:00', 'modifiedTime': '2021-06-28T06:49:57.842385+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 2, 'nodeIdleTimeBeforeScaleDown': 'PT300S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}


# Pipeline Run Configuration & Environment

Load the model training environment that has been registered as part of the `00_environment_setup` notebook and use it for the pipeline run.

In [5]:
env = Environment.get(workspace=ws, name=env_name)

Create a pipeline run configuration containing the retrieved environment.

In [6]:
run_config = RunConfiguration()
run_config.environment = env

# Pipeline Inputs

Create a PipelineData object to pass data between steps. In general, an object reference in the outputs array of one step becomes available as an input for a subsequent pipeline step for scenarios where there is more than one step.


While here the pipeline will consist of only one step that requires access to data, a usual flow with multiple steps will include:
- Using Dataset objects as inputs to fetch raw data, performing some transformations, then outputting a PipelineData object.
- Use the previous step's PipelineData output object as an input object, repeated for subsequent steps.

In [7]:
pipeline_data = PipelineData("pipeline_data", datastore=ws.get_default_datastore())

Create PipelineParameter objects to be able to pass versatile arguments to the PythonScriptSteps.

In [8]:
dataset_name_param = PipelineParameter(name="dataset_name", default_value="stanford_dogs_dataset")
dataset_version_param = PipelineParameter(name="dataset_version", default_value=1)
data_file_path_param = PipelineParameter(name="data_file_path", default_value="none")
model_name_param = PipelineParameter(name="model_name", default_value="dog_clf_model")
caller_run_id_param = PipelineParameter(name="caller_run_id", default_value="none")

# Create Pipeline

In order to create a pipeline, the individual steps need to be created first.

A pipeline step is an object that encapsulates everything that is needed for running a pipeline including:

- environment and dependency settings
- the compute target to run the pipeline on
- input and output data, and any custom parameters
- reference to a script or SDK-logic to run during the step

There are multiple classes that inherit from the parent class PipelineStep to assist with building a step using certain frameworks and stacks. Here, the PythonScriptStep class is used to define the logic of the three steps in Python scripts. These Python scripts can be found in the `<PROJECT_ROOT>/src/pipeline` folder:
- the training step: train_model_step.py
- the evaluate step: evaluate_model_step.py
- the register step: register_model_step.py

For a list of all classes for different step types, see the [steps package](https://docs.microsoft.com/en-gb/python/api/azureml-pipeline-steps/azureml.pipeline.steps?view=azure-ml-py).

### Training Step

Create the pipeline training step using the PipelineParameter objects.

In [9]:
train_step = PythonScriptStep(name="Train Model",
                              script_name="pipeline/train_model_step.py",
                              compute_target=compute_target,
                              source_directory="../src",
                              outputs=[pipeline_data],
                              arguments=["--caller_run_id", caller_run_id_param,
                                         "--dataset_name", dataset_name_param,
                                         "--dataset_version", dataset_version_param,
                                         "--data_file_path", data_file_path_param,
                                         "--model_name", model_name_param,
                                         "--step_output", pipeline_data],
                              runconfig=run_config,
                              allow_reuse=False)

print("Training step has been created.")

Training step has been created.


### Evaluate Step

Create the pipeline evaluate step using the PipelineParameter objects.

In [10]:
evaluate_step = PythonScriptStep(name="Evaluate Model",
                                 script_name="pipeline/evaluate_model_step.py",
                                 compute_target=compute_target,
                                 source_directory="../src",
                                 arguments=["--model_name", model_name_param,
                                            "--allow_run_cancel", True],
                                 runconfig=run_config,
                                 allow_reuse=False)

print("Evaluate step has been created.")

Evaluate step has been created.


### Register Step

Create the pipeline register step using the PipelineParameter objects.

In [11]:
register_step = PythonScriptStep(name="Register Model ",
                                 script_name="pipeline/register_model_step.py",
                                 compute_target=compute_target,
                                 source_directory="../src",
                                 inputs=[pipeline_data],
                                 arguments=["--model_name", model_name_param,
                                            "--step_input", pipeline_data],
                                 runconfig=run_config,
                                 allow_reuse=False)

print("Register step has been created.")

Register step has been created.


Stitch the three pipeline steps together.

In [12]:
# Check run_evaluation flag to include or exclude evaluation step.
if run_evaluation == True:
    print("Include evaluation step before register step.")
    evaluate_step.run_after(train_step)
    register_step.run_after(evaluate_step)
    steps = [train_step, evaluate_step, register_step]
else:
    print("Exclude evaluation step and directly run register step.")
    register_step.run_after(train_step)
    steps = [train_step, register_step]

Include evaluation step before register step.


Create and validate the pipeline based on the pipeline steps.

In [13]:
train_pipeline = Pipeline(workspace=ws, steps=steps)
train_pipeline._set_experiment_name
train_pipeline.validate()

Step Train Model is ready to be created [3d73f846]
Step Evaluate Model is ready to be created [397abf58]Step Register Model  is ready to be created [5c4eed1a]



[]

# Publish Pipeline

Publish the pipeline to create a REST endpoint that allows to rerun the pipeline from any HTTP library on any platform. The published pipeline can also be run from the AML workspace where different metadata such as run history and duration are tracked as well. 

Before publishing the pipeline, the training parameters need to be specified in the `pipeline_parameters.json` file that can be found in the `<PROJECT_ROOT/src/config` folder:

<img src="../docs/images/aml_pipeline_parameters.png" alt="aml_pipeline_parameterss" width="600"/>  

Adjust all parameters as desired and then run the following cell to publish the pipeline.

**Note**: If a pipeline with the same version has already been published, the code will retrieve the existing published pipeline instead. This means that whenever you make changes to the pipeline you need to specify a new pipeline version!

In [14]:
pipelines = PublishedPipeline.list(ws)
matched_pipes = []

for p in pipelines:
    if p.name == pipeline_name:
        if p.version == pipeline_version:
            matched_pipes.append(p)

if(len(matched_pipes) == 0):
    published_pipeline = train_pipeline.publish(name=pipeline_name,
                                                description="Model training/retraining pipeline",
                                                version=pipeline_version)
    
    print(f"Published pipeline '{published_pipeline.name}' with version {published_pipeline.version}.")

else:
    published_pipeline = matched_pipes[0]
    print(f"Retrieved published pipeline with id {published_pipeline.id}.")

Created step Train Model [3d73f846][7cb029e8-098f-4960-980f-e4c5d1e43a2e], (This step will run and generate new outputs)
Created step Evaluate Model [397abf58][132248b0-cef5-449e-b122-fd53f8f87df8], (This step will run and generate new outputs)
Created step Register Model  [5c4eed1a][bbcab9ab-1240-4388-90d3-58f5c1ad3518], (This step will run and generate new outputs)
Published pipeline 'dog_clf_model_training_pipeline' with version 1.0.


The pipeline is now published in the AML workspace:

<img src="../docs/images/aml_pipeline.png" alt="aml_pipeline" width="1200"/>  

# Run Pipeline

The first pipeline run takes more time than subsequent runs, as all dependencies must be downloaded, a Docker image is created, and the Python environment is provisioned/created. Running it again takes significantly less time as those resources are reused. Total run time depends on the workload of your scripts and processes running in each pipeline step.

In [15]:
pipeline_parameters = {"model_name": model_name}
tags = {"trigger": "jupyter notebook",
        "model_architecture" : "transfer-learning with ResNext-50"}

# Create an AML Experiment
experiment = Experiment(workspace=ws, name=experiment_name)
    
# Submit an Experiment Run using the published pipeline and defined pipeline parameters
run = experiment.submit(published_pipeline,
                        tags=tags,
                        pipeline_parameters=pipeline_parameters)

Submitted PipelineRun b360d7c1-5d4d-4cf6-9147-6c69fc33c36e
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/stanford_dogs_classifier_train/runs/b360d7c1-5d4d-4cf6-9147-6c69fc33c36e?wsid=/subscriptions/e58a23da-421e-4b52-99d5-e615f2f8be41/resourcegroups/mlopstemplaterg/workspaces/mlopstemplatewsbfdc24


In [16]:
# Wait for completion of the run and show output log
run.wait_for_completion(show_output=True)

PipelineRunId: b360d7c1-5d4d-4cf6-9147-6c69fc33c36e
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/stanford_dogs_classifier_train/runs/b360d7c1-5d4d-4cf6-9147-6c69fc33c36e?wsid=/subscriptions/e58a23da-421e-4b52-99d5-e615f2f8be41/resourcegroups/mlopstemplaterg/workspaces/mlopstemplatewsbfdc24
PipelineRun Status: Running


StepRunId: 14c90a42-b648-4c62-b401-838b149670cc
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/stanford_dogs_classifier_train/runs/14c90a42-b648-4c62-b401-838b149670cc?wsid=/subscriptions/e58a23da-421e-4b52-99d5-e615f2f8be41/resourcegroups/mlopstemplaterg/workspaces/mlopstemplatewsbfdc24
StepRun( Train Model ) Status: NotStarted
StepRun( Train Model ) Status: Running

Streaming azureml-logs/55_azureml-execution-tvmps_c9a77e93a71e9584e7235b2523fef9602788c1f007e7504932dc8fae63a1ca91_d.txt
2021-06-29T15:06:07Z Successfully mounted a/an Blobfuse File System at /mnt/batch/tasks/shared/LS_root/jobs/mlopstemplatewsbfdc24/azure

2021/06/29 15:12:38 Not exporting to RunHistory as the exporter is either stopped or there is no data.
Stopped: false
OriginalData: 1
FilteredData: 0.
--------------------
START MODEL TRAINING
--------------------
Hyperparameter number of epochs: 36
Hyperparameter batch size: 8
Hyperparameter learning rate: 0.01
Hyperparameter momentum: 0.9
Hyperparameter number of frozen layers: 7
Hyperparameter number of neurons fc layer: 512
Hyperparameter dropout probability fc layer: 0
Hyperparameter lr scheduler step size: 7
Downloading: "https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pth" to /root/.cache/torch/hub/checkpoints/resnext50_32x4d-7cdf4587.pth

  0%|          | 0.00/95.8M [00:00<?, ?B/s]
  9%|▉         | 8.77M/95.8M [00:00<00:00, 91.9MB/s]
 33%|███▎      | 31.5M/95.8M [00:00<00:00, 113MB/s] 
 53%|█████▎    | 50.6M/95.8M [00:00<00:00, 130MB/s]
 74%|███████▍  | 71.1M/95.8M [00:00<00:00, 147MB/s]
 94%|█████████▍| 90.0M/95.8M [00:00<00:00, 160MB/s]
100%|██████████| 95.8M/95.




StepRunId: 8a81375d-6e42-4c6a-b508-d1e04a03e758
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/stanford_dogs_classifier_train/runs/8a81375d-6e42-4c6a-b508-d1e04a03e758?wsid=/subscriptions/e58a23da-421e-4b52-99d5-e615f2f8be41/resourcegroups/mlopstemplaterg/workspaces/mlopstemplatewsbfdc24
StepRun( Evaluate Model ) Status: NotStarted
StepRun( Evaluate Model ) Status: Running

Streaming azureml-logs/55_azureml-execution-tvmps_c9a77e93a71e9584e7235b2523fef9602788c1f007e7504932dc8fae63a1ca91_d.txt
2021-06-29T17:00:28Z Successfully mounted a/an Blobfuse File System at /mnt/batch/tasks/shared/LS_root/jobs/mlopstemplatewsbfdc24/azureml/8a81375d-6e42-4c6a-b508-d1e04a03e758/mounts/workspaceblobstore
2021-06-29T17:00:29Z Failed to start nvidia-fabricmanager due to exit status 5 with output Failed to start nvidia-fabricmanager.service: Unit nvidia-fabricmanager.service not found.
. Please ignore this if the GPUs don't utilize NVIDIA® NVLink® switches.
2021-06-29T17:00:2




StepRunId: b0462b01-c067-44d6-b2e2-1963cfde2b66
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/stanford_dogs_classifier_train/runs/b0462b01-c067-44d6-b2e2-1963cfde2b66?wsid=/subscriptions/e58a23da-421e-4b52-99d5-e615f2f8be41/resourcegroups/mlopstemplaterg/workspaces/mlopstemplatewsbfdc24
StepRun( Register Model  ) Status: Running

Streaming azureml-logs/55_azureml-execution-tvmps_c9a77e93a71e9584e7235b2523fef9602788c1f007e7504932dc8fae63a1ca91_d.txt
2021-06-29T17:06:30Z Successfully mounted a/an Blobfuse File System at /mnt/batch/tasks/shared/LS_root/jobs/mlopstemplatewsbfdc24/azureml/b0462b01-c067-44d6-b2e2-1963cfde2b66/mounts/workspaceblobstore
2021-06-29T17:06:30Z Failed to start nvidia-fabricmanager due to exit status 5 with output Failed to start nvidia-fabricmanager.service: Unit nvidia-fabricmanager.service not found.
. Please ignore this if the GPUs don't utilize NVIDIA® NVLink® switches.
2021-06-29T17:06:30Z Starting output-watcher...
2021-06-29T17:


StepRun(Register Model ) Execution Summary
StepRun( Register Model  ) Status: Finished
{'runId': 'b0462b01-c067-44d6-b2e2-1963cfde2b66', 'target': 'gpu-cluster', 'status': 'Completed', 'startTimeUtc': '2021-06-29T17:06:30.080675Z', 'endTimeUtc': '2021-06-29T17:12:17.282622Z', 'properties': {'azureml.git.repository_uri': 'https://github.com/sebastianbirk/pytorch-mlops-template-azure-ml.git', 'mlflow.source.git.repoURL': 'https://github.com/sebastianbirk/pytorch-mlops-template-azure-ml.git', 'azureml.git.branch': 'develop', 'mlflow.source.git.branch': 'develop', 'azureml.git.commit': '24ee5abcfc13dfd6c7da3ede3b9f3013132587b1', 'mlflow.source.git.commit': '24ee5abcfc13dfd6c7da3ede3b9f3013132587b1', 'azureml.git.dirty': 'True', 'ContentSnapshotId': 'a77d3106-296d-4662-949a-a22a25b989ae', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.moduleid': 'bbcab9ab-1240-4388-90d3-58f5c1ad3518', 'azureml.runsource': 'azureml.StepRun', 'azureml.nodeid': '5c4eed1a', 'azurem



PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': 'b360d7c1-5d4d-4cf6-9147-6c69fc33c36e', 'status': 'Completed', 'startTimeUtc': '2021-06-29T15:01:00.125885Z', 'endTimeUtc': '2021-06-29T17:12:20.728765Z', 'properties': {'azureml.git.repository_uri': 'https://github.com/sebastianbirk/pytorch-mlops-template-azure-ml.git', 'mlflow.source.git.repoURL': 'https://github.com/sebastianbirk/pytorch-mlops-template-azure-ml.git', 'azureml.git.branch': 'develop', 'mlflow.source.git.branch': 'develop', 'azureml.git.commit': '24ee5abcfc13dfd6c7da3ede3b9f3013132587b1', 'mlflow.source.git.commit': '24ee5abcfc13dfd6c7da3ede3b9f3013132587b1', 'azureml.git.dirty': 'True', 'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{"model_name":"dog_clf_model","caller_run_id":"none","dataset_name":"stanford_dogs_dataset","dataset_version":"1","data_file_path":"none"}', 'azureml.pipelineid': 'b6216aa0-5fee-44d8-b064-98cd8f278ab6'}, 'inputDa

'Finished'

# Resource Clean Up

Uncomment to delete the compute target.

In [17]:
# compute_target.delete()