Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.png)

# Azure Machine Learning Pipeline with AutoMLStep
This notebook demonstrates the use of AutoMLStep in Azure Machine Learning Pipeline.

## Introduction
In this example we showcase how you can use AzureML Dataset to load data for AutoML via AML Pipeline. 

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.

In this notebook you will learn how to:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a `TabularDataset`.
4. Configure AutoML using `AutoMLConfig`.
5. Use AutoMLStep
6. Train the model using AmlCompute
7. Explore the results.
8. Test the best fitted model.

## Azure Machine Learning and Pipeline SDK-specific imports

In [47]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.dataset import Dataset
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

from azureml.train.automl.runtime import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.0.83


## Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json

In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

## Create an Azure ML experiment
Let's create an experiment named "automl-classification" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

In [49]:
# Choose a name for the run history container in the workspace.
experiment_name = 'automlstep-classification2'
project_folder = './project'

experiment = Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
automlstep-classification2,AMLWKSP2020,Link to Azure Machine Learning studio,Link to Documentation


### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource.

In [50]:
# Choose a name for your cluster.
amlcompute_cluster_name = "amlcluster"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets
if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
    found = True
    print('Found existing compute target.')
    compute_target = cts[amlcompute_cluster_name]
    
if not found:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D2_V2", # for GPU, use "STANDARD_NC6"
                                                                #vm_priority = 'lowpriority', # optional
                                                                max_nodes = 4)

    # Create the cluster.
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min_node_count is provided, it will use the scale settings for the cluster.
    compute_target.wait_for_completion(show_output = True, min_node_count = 1, timeout_in_minutes = 10)
    
     # For a more detailed view of current AmlCompute status, use get_status().

Found existing compute target.


In [51]:
# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

conda_run_config.environment.docker.enabled = True
conda_run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_CPU_IMAGE

#cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'], 
cd = CondaDependencies.create(pip_packages=['azureml-sdk[automl]'],
                              conda_packages=['numpy==1.16.2','scipy==1.1.0','scikit-learn==0.20.3','py-xgboost<=0.80'])
conda_run_config.environment.python.conda_dependencies = cd

print('run config is ready')

run config is ready


## Data

In [52]:

train_data = Dataset.get_by_name(ws,"iris")


In [53]:
train_data.take(5).to_pandas_dataframe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### Review the Dataset Result

You can peek the result of a TabularDataset at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only `j` records for all the steps in the TabularDataset, which makes it fast even against large datasets.

`TabularDataset` objects are composed of a list of transformation steps (optional).

## Train
This creates a general AutoML settings object.

In [54]:
target_column_name = 'species'

automl_settings = {
    "iteration_timeout_minutes" : 10,
    "iterations" : 5,
    "primary_metric" : 'AUC_weighted',
    "preprocess" : True,
    "verbosity" : logging.INFO
}
automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             path = project_folder,
                             compute_target=compute_target,
                             max_concurrent_iterations=4,
                             max_cores_per_iteration=-1,
                             run_configuration=conda_run_config,
                             training_data=train_data,
                             label_column_name=target_column_name,
                             **automl_settings
                            )

You can define outputs for the AutoMLStep using TrainingOutput.

In [55]:
from azureml.pipeline.core import PipelineData, TrainingOutput

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

Create an AutoMLStep.

In [56]:
automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=True)

In [57]:
from azureml.pipeline.core import Pipeline
pipeline = Pipeline(
    description="pipeline_with_automlstep_4",
    workspace=ws,    
    steps=[automl_step])

In [58]:
pipeline_run = experiment.submit(pipeline)

Created step automl_module [8201866d][207bec97-9317-486f-9873-a0dd9961612a], (This step will run and generate new outputs)
Submitted PipelineRun 2c4de33c-6765-4097-817d-17ea1a27292f
Link to Azure Machine Learning studio: https://ml.azure.com/experiments/automlstep-classification2/runs/2c4de33c-6765-4097-817d-17ea1a27292f?wsid=/subscriptions/08d28dbf-1252-4438-b3be-0188e3803935/resourcegroups/AMLGRP2020/workspaces/AMLWKSP2020


In [59]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [65]:
pipeline_run.wait_for_completion()

PipelineRunId: 2c4de33c-6765-4097-817d-17ea1a27292f
Link to Portal: https://ml.azure.com/experiments/automlstep-classification2/runs/2c4de33c-6765-4097-817d-17ea1a27292f?wsid=/subscriptions/08d28dbf-1252-4438-b3be-0188e3803935/resourcegroups/AMLGRP2020/workspaces/AMLWKSP2020

PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': '2c4de33c-6765-4097-817d-17ea1a27292f', 'status': 'Completed', 'startTimeUtc': '2020-01-28T23:05:27.357261Z', 'endTimeUtc': '2020-01-28T23:49:13.746891Z', 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{}'}, 'inputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://amlwksp20209321273990.blob.core.windows.net/azureml/ExperimentRun/dcid.2c4de33c-6765-4097-817d-17ea1a27292f/logs/azureml/executionlogs.txt?sv=2019-02-02&sr=b&sig=YezqVec8oZfE2ELXRA7TeJLIY3oAoaUjCiUIZgVKbRM%3D&st=2020-01-28T23%3A39%3A23Z&se=2020-01-29T07%3A49%3A23Z&sp=r', 'logs/azureml/stderrlogs

'Finished'

## Examine Results

### Retrieve the metrics of all child runs
Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests.

In [62]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

Downloading azureml/9361527c-b8db-4fa8-93e2-584c7d0555f4/metrics_data
Downloaded azureml/9361527c-b8db-4fa8-93e2-584c7d0555f4/metrics_data, 1 files out of an estimated total of 1


In [63]:
pip freeze

absl-py==0.9.0
adal==1.2.2
alabaster==0.7.12
alembic==1.3.2
anaconda-client==1.7.2
anaconda-project==0.8.3
ansiwrap==0.8.4
applicationinsights==0.11.9
asn1crypto==1.0.1
astor==0.8.1
astroid==2.3.1
astropy==3.2.1
atomicwrites==1.3.0
attrs==19.2.0
azure-common==1.1.24
azure-graphrbac==0.61.1
azure-mgmt-authorization==0.60.0
azure-mgmt-containerregistry==2.8.0
azure-mgmt-keyvault==2.0.0
azure-mgmt-resource==7.0.0
azure-mgmt-storage==7.0.0
azure-storage-blob==2.1.0
azure-storage-common==2.1.0
azureml-automl-core==1.0.83.1
azureml-automl-runtime==1.0.83
azureml-contrib-datadrift==1.0.83
azureml-contrib-interpret==1.0.83
azureml-contrib-notebook==1.0.83
azureml-contrib-pipeline-steps==1.0.83
azureml-contrib-reinforcementlearning==1.0.83
azureml-contrib-server==1.0.83
azureml-contrib-services==1.0.83
azureml-core==1.0.83
azureml-datadrift==1.0.83
azureml-dataprep==1.1.35
azureml-dataprep-native==13.2.0
azureml-defaults==1.0.83
azureml-explain-model==1.0.83
azureml-interpret==1.0.83
azureml-ml

In [64]:
import json
with open(metrics_output._path_on_datastore) as f:  
   metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

Unnamed: 0,9361527c-b8db-4fa8-93e2-584c7d0555f4_0,9361527c-b8db-4fa8-93e2-584c7d0555f4_2,9361527c-b8db-4fa8-93e2-584c7d0555f4_1,9361527c-b8db-4fa8-93e2-584c7d0555f4_3,9361527c-b8db-4fa8-93e2-584c7d0555f4_4
AUC_macro,[0.9993827160493828],[0.9987654320987656],[1],[1],[0.9993827160493828]
AUC_micro,[0.9964444444444445],[0.9975555555555558],[0.9982222222222223],[0.9975555555555555],[0.9953333333333333]
AUC_weighted,[0.9992592592592592],[0.9985185185185186],[1],[1],[0.9992592592592592]
accuracy,[0.96],[0.9533333333333334],[0.9733333333333334],[0.96],[0.9666666666666668]
average_precision_score_macro,[0.9992063492063492],[0.9984126984126984],[1],[1],[0.9992063492063492]
average_precision_score_micro,[0.9934921494913753],[0.9954869281045753],[0.9967091503267973],[0.9955081699346404],[0.9900387009791034]
average_precision_score_weighted,[0.9990476190476191],[0.9980952380952381],[1],[1],[0.9990476190476191]
balanced_accuracy,[0.9588888888888889],[0.9533333333333334],[0.9755555555555556],[0.9588888888888889],[0.9700000000000001]
f1_score_macro,[0.9479715056185644],[0.9431282007752596],[0.9621933621933622],[0.9479715056185644],[0.9565989565989564]
f1_score_micro,[0.96],[0.9533333333333334],[0.9733333333333334],[0.96],[0.9666666666666668]


### Retrieve the Best Model

In [66]:
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)

Downloading azureml/9361527c-b8db-4fa8-93e2-584c7d0555f4/model_data
Downloaded azureml/9361527c-b8db-4fa8-93e2-584c7d0555f4/model_data, 1 files out of an estimated total of 1


### Publish Pipeline

In [67]:
from azureml.pipeline.core import PublishedPipeline

published_pipeline = pipeline.publish(name="auotml_training_pipeline6",
                                        description="Auto ML Trainign Pipeline",
                                        continue_on_step_failure=True)
published_pipeline

Name,Id,Status,Endpoint
auotml_training_pipeline6,cfc020a9-b25d-44b9-a9ed-dc3acc2b29c9,Active,REST Endpoint


In [68]:
from azureml.pipeline.core import PublishedPipeline

all_pub_pipelines = PublishedPipeline.list(ws, active_only=True, _service_endpoint=None)
all_pub_pipelines

[Pipeline(Name: auotml_training_pipeline6,
 Id: cfc020a9-b25d-44b9-a9ed-dc3acc2b29c9,
 Status: Active,
 Endpoint: https://westus.aether.ms/api/v1.0/subscriptions/08d28dbf-1252-4438-b3be-0188e3803935/resourceGroups/AMLGRP2020/providers/Microsoft.MachineLearningServices/workspaces/AMLWKSP2020/PipelineRuns/PipelineSubmit/cfc020a9-b25d-44b9-a9ed-dc3acc2b29c9),
 Pipeline(Name: Copy of MortageExp2020125 on 01-25-2020-real time inference 01-26-2020-12-03,
 Id: 6117bd53-d38d-4e01-a1c0-b718e27d9490,
 Status: Active,
 Endpoint: https://westus.aether.ms/api/v1.0/subscriptions/08d28dbf-1252-4438-b3be-0188e3803935/resourceGroups/AMLGRP2020/providers/Microsoft.MachineLearningServices/workspaces/AMLWKSP2020/PipelineRuns/PipelineSubmit/6117bd53-d38d-4e01-a1c0-b718e27d9490),
 Pipeline(Name: Copy of MortageExp2020125 on 01-25-2020 01-25-2020-02-37,
 Id: 4eec5cd5-1b5a-4837-b0b0-c9cf00650c78,
 Status: Active,
 Endpoint: https://westus.aether.ms/api/v1.0/subscriptions/08d28dbf-1252-4438-b3be-0188e3803935/r

In [73]:
#pipeline_id = 'copy from above output'
#published_pipeline = PublishedPipeline.get(ws, pipeline_id)
#published_pipeline

In [71]:
from azureml.core.authentication import InteractiveLoginAuthentication
import requests

auth = InteractiveLoginAuthentication()
aad_token = auth.get_authentication_header()

rest_endpoint1 = published_pipeline.endpoint

print("You can perform HTTP POST on URL {} to trigger this pipeline".format(rest_endpoint1))

# specify the param when running the pipeline
response = requests.post(rest_endpoint1, 
                         headers=aad_token, 
                         json={"ExperimentName": "My_Pipeline3",
                               "RunSource": "SDK",
                               "ParameterAssignments": {"pipeline_arg": 45}})

You can perform HTTP POST on URL https://westus.aether.ms/api/v1.0/subscriptions/08d28dbf-1252-4438-b3be-0188e3803935/resourceGroups/AMLGRP2020/providers/Microsoft.MachineLearningServices/workspaces/AMLWKSP2020/PipelineRuns/PipelineSubmit/cfc020a9-b25d-44b9-a9ed-dc3acc2b29c9 to trigger this pipeline


In [72]:
try:
    response.raise_for_status()
except Exception:    
    raise Exception('Received bad response from the endpoint: {}\n'
                    'Response Code: {}\n'
                    'Headers: {}\n'
                    'Content: {}'.format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

Submitted pipeline run:  a5a388ee-eca7-4dca-a4bf-1af975a7f8a3
