# Creating an Azure Machine Learning Pipeline

You can perform the various steps required to ingest data, train a model, and register the model individually by using the Azure ML SDK to run script-based experiments. However, in an enterprise environment it is common to encapsulate the sequence of discrete steps required to build a machine learning solution into a *pipeline* that can be run on one or more compute targets, either on-demand by a user, from an automated build process, or on a schedule.

In this lab, you'll bring together all of these elements to create a simple pipeline that trains and registers a model.

## Connect to Your Workspace

The first thing you need to do is to connect to your workspace using the Azure ML SDK.

> **Note**: If the authenticated session with your Azure subscription has expired since you completed the previous exercise, you'll be prompted to reauthenticate.

In [2]:
from azureml import core

ws = core.Workspace.from_config()
print(f'Ready to use Azure ML {core.VERSION} to work with {ws.name}')

Ready to use Azure ML 1.11.0 to work with workspace


## Prepare the Training Data

You can use local data files to train a model, but when running training workloads automatically on cloud-based compute, it makes more sense to store the data centrally in the cloud and ingest it into the training script wherever it happens to be running.

In this lab, you'll upload the training data to a *datastore* and define a *dataset* that can be used to access the data from a training script. For simplicity, you'll upload the data to the default datastore for your Azure Machine Learning workspace - this is an Azure Storage blob container that was created when you provisioned the workspace. In a real solution, you'd likely register a datastore that references the cloud location where you typically store your data. You'll then create a *tabular* dataset that references the CSV files you uploaded.

In [3]:
default_ds = ws.get_default_datastore()

## Prepare a Compute Environment for the Pipeline Steps

The pipeline will eventually be published and run on-demand, so it needs a compute environment in which to run. In this exercise, you'll use the same compute for both steps, but it's important to realize that each step is run independently; so you could specify different compute contexts for each step if appropriate.

First, you need a compute target. In this case, you create an Azure Machine Learning compute cluster in your workspace (or use an existing one if you have created it previously).

> **Important**: Change *your-compute-cluster* to the unique name for your compute cluster in the code below before running it!

In [4]:
from azureml.core import compute

cluster_name = "susumu-cluster"

pipeline_cluster = compute.ComputeTarget(workspace=ws, name=cluster_name)

pipeline_cluster.wait_for_completion(show_output=True)

Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


The compute will require a Python environment with the necessary package dependencies installed, so we'll create a run configuration.

In [5]:
from azureml.core import conda_dependencies
from azureml.core import runconfig

diabetes_env = core.Environment('diabetes-pipeline-env')
diabetes_env.python.user_managed_dependencies = False
diabetes_env.docker.enabled = True

diabetes_packages = conda_dependencies.CondaDependencies.create(
    conda_packages=['scikit-learn','pandas'],
    pip_packages=['azureml-defaults','azureml-dataprep[pandas]']
)

diabetes_env.python.conda_dependencies = diabetes_packages

diabetes_env.register(workspace=ws)
registered_env = core.Environment.get(ws, 'diabetes-pipeline-env')

pipeline_run_config = runconfig.RunConfiguration()

pipeline_run_config.target = pipeline_cluster

pipeline_run_config.environment = registered_env

print ("Run configuration created.")

Run configuration created.


## Create and Run a Pipeline

Now you're ready to define and run the pipeline.

First you need to define the steps for the pipeline, and any data references that need to passed between them. In this case, the first step must write the model to a folder that can be read from by the second step. Since the steps will be run on remote compute (and in fact, could each be run on different compute), the folder path must be passed as a data reference to a location in a datastore within the workspace. The **PipelineData** object is a special kind of data reference that is used to pass data from the output of one pipeline step to the input of another, creating a dependency between them. You'll create one and use it as the output for the first step and the input for the second step. Note that you also need to pass it as a script argument so your code can access the datastore location referenced by the data reference.

In [8]:
from azureml import pipeline
from azureml.pipeline import steps
from azureml.train import estimator

diabetes_ds = ws.datasets.get('diabetes dataset')

model_folder = pipeline.core.PipelineData(
    "model_folder", datastore=ws.get_default_datastore(),
)

experiment_folder = 'diabetes-pipeline'
config = estimator.Estimator(
    source_directory=experiment_folder,
    compute_target = pipeline_cluster,
    environment_definition=pipeline_run_config.environment,
    entry_script='train_diabetes.py'
)

train_step = steps.EstimatorStep(
    name='Train Model',
    estimator=config, 
    estimator_entry_script_arguments=['--output_folder', model_folder],
    inputs=[diabetes_ds.as_named_input('diabetes_train')],
    outputs=[model_folder],
    compute_target = pipeline_cluster,
    allow_reuse = True,
)

register_step = steps.PythonScriptStep(
    name='Register Model',
    source_directory= experiment_folder,
    script_name='register_diabetes.py',
    arguments = ['--model_folder', model_folder],
    inputs=[model_folder],
    compute_target = pipeline_cluster,
    runconfig = pipeline_run_config,
    allow_reuse = True,
)

print('Pipeline steps defined')

Pipeline steps defined


OK, now you're ready to build the pipeline from the steps you've defined and run it as an experiment.

> **Note**: This may take a while. The training cluster must be started and configured with the Python environment before the scripts can be run. Now might be a good time to take a coffee break!

In [None]:
from azureml import widgets

pipeline_steps = [train_step, register_step]
pl = pipeline.core.Pipeline(workspace=ws, steps=pipeline_steps)
print('Pipeline is built.')

experiment = core.Experiment(workspace=ws, name='diabetes-training-pipeline')
pipeline_run = experiment.submit(pipeline, regenerate_outputs=True)
print('Pipeline submitted for execution.')

widgets.RunDetails(pipeline_run).show()
pipeline_run.wait_for_completion()

The widget above shows details of the pipeline as it runs. You can also monitor pipeline runs in the **Experiments** page in [Azure Machine Learning studio](https://ml.azure.com).

> **Note**: If the widget displays the message `["AttributeError: 'NoneType' object has no attribute 'id'\n"]`, you can safely ignore it!

When the pipeline has finished, a new model should be registered with a *Training context* tag indicating it was trained in a pipeline. Run the following code to verify this.

In [None]:
from azureml.core import Model

for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

## Publish the Pipeline

Now that you've created a pipeline and verified it works, you can publish it as a REST service.

In [None]:
published_pipeline = pipeline.publish(name="Diabetes_Training_Pipeline",
                                      description="Trains diabetes model",
                                      version="1.0")
rest_endpoint = published_pipeline.endpoint
print(rest_endpoint)

To use the endpoint, client applications need to make a REST call over HTTP. This request must be authenticated, so an authorization header is required. A real application would require a service principal with which to be authenticated, but to test this out, we'll use the authorization header from your current connection to your Azure workspace, which you can get using the following code:

In [None]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

Now you're ready to call the REST interface. The pipeline runs asynchronously, so you'll get an identifier back, which you can use to track the pipeline experiment as it runs:

In [None]:
import requests
experiment_name = 'Run-diabetes-pipeline'

response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": experiment_name})
run_id = response.json()["Id"]
run_id

Since you have the run ID, you can use the **RunDetails** widget to view the experiment as it runs.

> **Note**: The pipeline should complete quickly, because each step was configured to allow output reuse. This was done primarily for convenience and to save time in this example. In reality, you'd likely want the first step to run every time in case the data has changed, and trigger the subsequent steps only if the output from step one changes.
>
> The widget may not refresh quickly enough to indicate that the pipeline run has completed - keep an eye on the kernel indicator at the top right of the page, when it turns from **&#9899;** to **&#9711;**, the code has finished running.

In [None]:
from azureml.pipeline.core.run import PipelineRun
from azureml.widgets import RunDetails

published_pipeline_run = PipelineRun(ws.experiments[experiment_name], run_id)
RunDetails(published_pipeline_run).show()
pipeline_run.wait_for_completion()

This is a simple example, designed to demonstrate the principle. In reality, you could build more sophisticated logic into the pipeline steps - for example, evaluating the model against some test data to calculate a performance metric like AUC or accuracy, comparing the metric to that of any previously registered versions of the model, and only registering the new model if it performs better.

You can use the [Azure Machine Learning extension for Azure DevOps](https://marketplace.visualstudio.com/items?itemName=ms-air-aiagility.vss-services-azureml) to combine Azure ML pipelines with Azure DevOps pipelines (yes, it *is* confusing that they have the same name!) and integrate model retraining into a *continuous integration/continuous deployment (CI/CD)* process. For example you could use an Azure DevOps *build* pipeline to trigger an Azure ML pipeline that trains and registers a model, and when the model is registered it could trigger an Azure Devops *release* pipeline that deploys the model as a web service, along with the application or service that consumes the model.