Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-automated-machine-learning-step.png)

# Azure Machine Learning Pipeline with AutoMLStep
This notebook demonstrates use the AutoMLStep in Azure Machine Learning Pipeline.

## Introduction
In this example we use AzureML on a preloaded dataset. 

If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the [configuration](https://aka.ms/pl-config) before running this notebook.

In this notebook we:
1. Create an `Experiment` in an existing `Workspace`.
2. Create or Attach existing AmlCompute to a workspace.
3. Define data loading in a `TabularDataset`.
4. Configure AutoML using `AutoMLConfig`.
5. Use AutoMLStep
6. Train the model using AmlCompute
7. Explore the results.
8. Test the best fitted model.

## Azure Machine Learning and Pipeline SDK-specific imports

In [1]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset

from azureml.pipeline.steps import AutoMLStep
from sklearn.model_selection import train_test_split

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.20.0


## Initialize Workspace
Initialize a workspace object.

In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

quick-starts-ws-137127
aml-quickstarts-137127
southcentralus
aa7cf8e8-d23f-4bce-a7b9-1f0b4e0ac8ee


## Create an Azure ML experiment
Let's create an experiment named "automlstep-regression" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

In [3]:
# Choose a name for the run history container in the workspace.
# NOTE: update these to match your existing experiment name
experiment_name = 'automlstep-regression'
project_folder = './pipeline-project'

experiment = Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
automlstep-regression,quick-starts-ws-137127,Link to Azure Machine Learning studio,Link to Documentation


### Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. Here, we use the name `automl` as training compute resource.

**Udacity Note** There is no need to create a new compute target, it can re-use the previous cluster if present.

In [4]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

# NOTE: update the cluster name to match the existing cluster
# Choose a name for your CPU cluster
amlcompute_cluster_name = "automl"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',# for GPU, use "STANDARD_NC6"
                                                           #vm_priority = 'lowpriority', # optional
                                                           max_nodes=4)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)
# For a more detailed view of current AmlCompute status, use get_status().

Creating
Succeeded...............................................................................................................
AmlCompute wait for completion finished

Wait timeout has been reached
Current provisioning state of AmlCompute is "Succeeded" and current node count is "0"


## Data

**Udacity note:** Make sure the `key` is the same name as the dataset that is uploaded, and that the description matches. If it is hard to find or unknown, loop over the `ws.datasets.keys()` and `print()` them.
If it *isn't* found because it was deleted, it can be recreated with the link that has the CSV 

In [5]:
ws.datasets.keys()

KeysView({'Houses': DatasetRegistration(id='61b32d94-91fa-4e84-ad15-faed934c58de', name='Houses', version=1, description='House prices and characteristics.', tags={})})

In [6]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
# NOTE: update the key to match the dataset name
found = False
key = "Houses"
description_text = "House prices and characteristics."

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        from azureml.core import Workspace, Dataset

        subscription_id = 'aa7cf8e8-d23f-4bce-a7b9-1f0b4e0ac8ee'
        resource_group = 'aml-quickstarts-137127'
        workspace_name = 'quick-starts-ws-137127'

        workspace = Workspace(subscription_id, resource_group, workspace_name)

        dataset = Dataset.get_by_name(workspace, name='Houses')

df = dataset.to_pandas_dataframe()
df.describe()

Unnamed: 0,Transactieprijs_m2,BU_g_woz_v2018,BU_g_woz_v2016,WK_g_woz_v2019,BU_g_ink_po_v2017,Woonoppervlakte,BAGbouwjaar,maanden_sinds_jan2004,BUURT_m2_alle_objecten,Woningtype_appartement,...,Koop_historisch_2020_med_transactieprijsm2_tussenwoningen_PC_123456,Koop_historisch_2019_med_transactieprijsm2_tussenwoningen_PC_123456,Koop_historisch_2018_med_transactieprijsm2_tussenwoningen_PC_123456,Koop_historisch_2020_med_transactieprijsm2_hoekwoningen_PC_12345,Koop_historisch_2019_med_transactieprijsm2_hoekwoningen_PC_12345,Koop_historisch_2018_med_transactieprijsm2_hoekwoningen_PC_12345,Koop_historisch_2019_med_transactieprijsm2_hoekwoningen_PC_123456,Koop_historisch_2020_med_transactieprijsm2_2onder1kappers_PC_12345,Koop_historisch_2019_med_transactieprijsm2_2onder1kappers_PC_12345,Koop_historisch_2018_med_transactieprijsm2_2onder1kappers_PC_12345
count,51314.0,51314.0,51314.0,51314.0,51314.0,51314.0,51314.0,51314.0,51314.0,51314.0,...,50458.0,50774.0,50822.0,51314.0,51314.0,48512.0,51263.0,51202.0,51070.0,50961.0
mean,3545.260494,272.382644,220.076568,303.708949,35.010845,101.161009,1952.495245,109.128074,282246.8,0.458627,...,-1.0,-1.0,-1.0,265.703654,277.5477,-1.0,-1.0,-1.0,-1.0,-1.0
std,1237.777275,94.270025,78.045218,66.238606,8.615431,46.006606,64.167687,60.577667,191316.9,0.49829,...,0.0,0.0,0.0,1006.343176,951.008235,0.0,0.0,0.0,0.0,0.0
min,207.428571,-1.0,-1.0,177.0,-1.0,11.0,1250.0,0.0,21498.0,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,2625.0,212.0,164.0,250.0,29.1,74.0,1926.0,52.0,182797.0,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
50%,3294.117647,265.0,214.0,328.0,33.9,93.0,1961.0,119.0,245066.0,0.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
75%,4200.0,326.0,262.0,354.0,41.1,120.0,1997.0,162.0,326001.0,1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
max,23890.909091,771.0,647.0,403.0,63.8,1145.0,2020.0,203.0,1649683.0,1.0,...,-1.0,-1.0,-1.0,5326.219181,4895.16129,-1.0,-1.0,-1.0,-1.0,-1.0


### Review the Dataset Result

You can peek the result of a TabularDataset at any range using `skip(i)` and `take(j).to_pandas_dataframe()`. Doing so evaluates only `j` records for all the steps in the TabularDataset, which makes it fast even against large datasets.

`TabularDataset` objects are composed of a list of transformation steps (optional).

In [7]:
dataset.take(5).to_pandas_dataframe()

Unnamed: 0,Transactieprijs_m2,BU_g_woz_v2018,BU_g_woz_v2016,WK_g_woz_v2019,BU_g_ink_po_v2017,Woonoppervlakte,BAGbouwjaar,maanden_sinds_jan2004,BUURT_m2_alle_objecten,Woningtype_appartement,...,Koop_historisch_2020_med_transactieprijsm2_tussenwoningen_PC_123456,Koop_historisch_2019_med_transactieprijsm2_tussenwoningen_PC_123456,Koop_historisch_2018_med_transactieprijsm2_tussenwoningen_PC_123456,Koop_historisch_2020_med_transactieprijsm2_hoekwoningen_PC_12345,Koop_historisch_2019_med_transactieprijsm2_hoekwoningen_PC_12345,Koop_historisch_2018_med_transactieprijsm2_hoekwoningen_PC_12345,Koop_historisch_2019_med_transactieprijsm2_hoekwoningen_PC_123456,Koop_historisch_2020_med_transactieprijsm2_2onder1kappers_PC_12345,Koop_historisch_2019_med_transactieprijsm2_2onder1kappers_PC_12345,Koop_historisch_2018_med_transactieprijsm2_2onder1kappers_PC_12345
0,2700.0,156,137,219,28.1,40,1962,1,290265,1,...,-1,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1
1,1725.0,149,125,219,23.1,80,1956,1,413897,1,...,-1,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1
2,2350.0,189,151,247,28.1,60,1979,1,184252,1,...,-1,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1
3,2875.0,213,167,250,26.9,48,1919,1,116432,1,...,-1,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1
4,2057.142857,156,137,219,28.1,70,1962,1,290265,1,...,-1,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1


### Train - Test split

Now that we have the data, we do not need to create a train and test set as we've done in the HyperDrive example. Based on Microsoft documentation, if you do not explicitly specify either validation_data or n_cross_validation parameters, AutoML applies default techniques based on the number of rows in the dataset (if >20k rows, it will make a train-validation split where 10% is reserved for the validation set and where metrics returned are based on the validation set; if less than 20k rows, it will apply 10-fold cross-validation in case of < 1,000 rows, otherwise 3-fold cross-validation). In the HyperDrive example, we've chosen for a validation/test set of 20%. Therefore, we modify the parameter to use a 20% in the AutoML settings as well (so that results can be compared).

## Train
This creates a general AutoML settings object.
**Udacity notes:** These inputs must match what was used when training in the portal. `label_column_name` has to be `Transactieprijs_m2` (the actual price sold) in this example.

In [8]:
automl_settings = {
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 5,
    "primary_metric": 'spearman_correlation'
}

automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "regression",
                             training_data=dataset,
                             label_column_name="Transactieprijs_m2",   
                             path = project_folder,
                             validation_size = 0.2,
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

#### Create Pipeline and AutoMLStep

You can define outputs for the AutoMLStep using TrainingOutput.

In [9]:
from azureml.pipeline.core import PipelineData, TrainingOutput

ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

metrics_data = PipelineData(name='metrics_data',
                           datastore=ds,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=ds,
                           pipeline_output_name=best_model_output_name,
                           training_output=TrainingOutput(type='Model'))

Create an AutoMLStep.

In [10]:
automl_step = AutoMLStep(
    name='automl_module',
    automl_config=automl_config,
    outputs=[metrics_data, model_data],
    allow_reuse=True)

In [11]:
from azureml.pipeline.core import Pipeline
pipeline = Pipeline(
    description="pipeline_with_automlstep",
    workspace=ws,    
    steps=[automl_step])

In [12]:
pipeline_run = experiment.submit(pipeline)

Created step automl_module [8f1e87b8][db633287-4d61-4dfc-ba19-29d41e469787], (This step will run and generate new outputs)
Submitted PipelineRun 00f7a25f-80c7-4985-9039-9fe510c7b55c
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/automlstep-regression/runs/00f7a25f-80c7-4985-9039-9fe510c7b55c?wsid=/subscriptions/aa7cf8e8-d23f-4bce-a7b9-1f0b4e0ac8ee/resourcegroups/aml-quickstarts-137127/workspaces/quick-starts-ws-137127


In [13]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [14]:
pipeline_run.wait_for_completion()

PipelineRunId: 00f7a25f-80c7-4985-9039-9fe510c7b55c
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/automlstep-regression/runs/00f7a25f-80c7-4985-9039-9fe510c7b55c?wsid=/subscriptions/aa7cf8e8-d23f-4bce-a7b9-1f0b4e0ac8ee/resourcegroups/aml-quickstarts-137127/workspaces/quick-starts-ws-137127
PipelineRun Status: Running


StepRunId: 9a406e11-de4c-4606-b078-eea3142f8067
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/automlstep-regression/runs/9a406e11-de4c-4606-b078-eea3142f8067?wsid=/subscriptions/aa7cf8e8-d23f-4bce-a7b9-1f0b4e0ac8ee/resourcegroups/aml-quickstarts-137127/workspaces/quick-starts-ws-137127
StepRun( automl_module ) Status: NotStarted
StepRun( automl_module ) Status: Running

StepRun(automl_module) Execution Summary
StepRun( automl_module ) Status: Finished



PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': '00f7a25f-80c7-4985-9039-9fe510c7b55c', 'status': 'Completed', 'startTimeUtc': '2021-02-03T21:15:17

'Finished'

## Examine Results

### Retrieve the metrics of all child runs
Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests.

In [15]:
metrics_output = pipeline_run.get_pipeline_output(metrics_output_name)
num_file_downloaded = metrics_output.download('.', show_progress=True)

Downloading azureml/9a406e11-de4c-4606-b078-eea3142f8067/metrics_data
Downloaded azureml/9a406e11-de4c-4606-b078-eea3142f8067/metrics_data, 1 files out of an estimated total of 1


In [16]:
import json
with open(metrics_output._path_on_datastore) as f:
    metrics_output_result = f.read()
    
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

Unnamed: 0,9a406e11-de4c-4606-b078-eea3142f8067_24,9a406e11-de4c-4606-b078-eea3142f8067_3,9a406e11-de4c-4606-b078-eea3142f8067_32,9a406e11-de4c-4606-b078-eea3142f8067_5,9a406e11-de4c-4606-b078-eea3142f8067_9,9a406e11-de4c-4606-b078-eea3142f8067_29,9a406e11-de4c-4606-b078-eea3142f8067_14,9a406e11-de4c-4606-b078-eea3142f8067_30,9a406e11-de4c-4606-b078-eea3142f8067_33,9a406e11-de4c-4606-b078-eea3142f8067_28,...,9a406e11-de4c-4606-b078-eea3142f8067_4,9a406e11-de4c-4606-b078-eea3142f8067_27,9a406e11-de4c-4606-b078-eea3142f8067_18,9a406e11-de4c-4606-b078-eea3142f8067_21,9a406e11-de4c-4606-b078-eea3142f8067_2,9a406e11-de4c-4606-b078-eea3142f8067_31,9a406e11-de4c-4606-b078-eea3142f8067_6,9a406e11-de4c-4606-b078-eea3142f8067_39,9a406e11-de4c-4606-b078-eea3142f8067_10,9a406e11-de4c-4606-b078-eea3142f8067_17
normalized_median_absolute_error,[0.012676222982082016],[0.016404131975244738],[0.010949974172276385],[0.012479171483708617],[0.013570918494786817],[0.012160415985806926],[0.016666064480518966],[0.031986681729188005],[0.013417276651437802],[0.008905657966835121],...,[0.014578823831505236],[0.01198065645014124],[0.012233511944231116],[0.020278213391588164],[0.016751390004634416],[0.017275427337798757],[0.012083619872604392],[0.0089722252658295],[0.017727600738080943],[0.020568422763233]
normalized_root_mean_squared_error,[0.025529386296620726],[0.031240779254342144],[0.021238988259433413],[0.02384720964387451],[0.026820819208066466],[0.024838171537315955],[0.031415518962133474],[0.049615925582494946],[0.026772922268113605],[0.019845174005562606],...,[0.027043878545322288],[0.025456378961278395],[0.024058794871597497],[0.03383192069209687],[0.032117261071134835],[0.03265265432369813],[0.02443720080805284],[0.018926020457749233],[0.0329929009733318],[0.03374392715907567]
explained_variance,[0.7678254872839818],[0.6524043793996734],[0.8392804303170844],[0.8444049329250115],[0.7437711864387516],[0.782283704035649],[0.6938251905988955],[0.12309684083178585],[0.7446351353882292],[0.8599119466136259],...,[0.7394905748876697],[0.7713521479189167],[0.8410133835012343],[0.5922221396906968],[0.6326298634402923],[0.6202213794207319],[0.787287366974958],[0.872381654108706],[0.6123006376452744],[0.5943396643295822]
normalized_root_mean_squared_log_error,[0.032806384168403466],[0.03940707953211152],[0.02784711216734022],[0.029596690865504486],[0.03462568397808487],[0.032016611638314846],[0.03944228664758565],[0.06585783763146846],[0.03425327761407578],[0.024701878736918103],...,[0.03562111725807456],[0.03185783776457423],[0.02943216959671178],[0.04618069339413656],[0.04027827106727284],[0.04202756481872241],[0.031387230580001015],[0.024128927459149397],[0.041724231756950884],[0.046484821121675485]
normalized_mean_absolute_error,[0.017662583045474974],[0.022176848570878795],[0.014922869713281535],[0.016798010417990627],[0.01877771688538825],[0.017084999461857188],[0.0223252883039429],[0.0381059915989378],[0.01860998318169144],[0.013147861293577097],...,[0.0194574327797426],[0.017196681626346754],[0.01679722280750054],[0.025344499646469502],[0.022765616410424037],[0.02338545624668073],[0.016946533251819353],[0.012876308206283646],[0.023695815537492053],[0.02537021937121655]
r2_score,[0.7677758070587986],[0.6522473655042205],[0.8392709557831506],[0.7973708830557819],[0.7436869079154853],[0.7801806311247415],[0.6483463017276603],[0.12286001818425707],[0.7446015444763251],[0.8596745204983703],...,[0.7394058475938141],[0.7691021082272175],[0.7937592634097992],[0.592169200597393],[0.6324607570776661],[0.6201048983015143],[0.787220576632994],[0.8723721883139838],[0.6121464958695739],[0.5942878981805881]
mean_absolute_percentage_error,[12.016915979620533],[15.064993046666988],[10.197412785671986],[10.472051322080137],[12.675528028339698],[11.445066397918898],[14.231675200317024],[27.213345897783814],[12.655021556632777],[8.64684303572322],...,[13.440424721349872],[11.356618883574898],[10.39490980872707],[17.870678214042623],[15.404236333700682],[15.872535416136524],[11.45133157458556],[8.601540919486814],[16.100656899584816],[17.944322098379217]
mean_absolute_error,[418.31144148121354],[525.2249611118775],[353.4254941492492],[397.835352500512],[444.72169205541314],[404.63225193023],[528.7405306382182],[902.4825097089321],[440.74917415144944],[311.3871168192652],...,[460.8197301981357],[407.2772742972921],[397.81669914281304],[600.2459636531415],[539.1690327702439],[553.8489974574272],[401.3528901421927],[304.95579456634596],[561.199385675397],[600.8550962531547]
root_mean_squared_error,[604.6247230303101],[739.8903868836036],[503.0131646957668],[564.7849250446701],[635.2103492307513],[588.2543517435381],[744.0288313490593],[1175.0778069890146],[634.075982986435],[470.00279196644334],...,[640.4931706993377],[602.8956552259508],[569.7959996636573],[801.2576346478862],[760.6485269172922],[773.3285025846361],[578.7579692881532],[448.2340368223939],[781.3867274830538],[799.1736415227385]
median_absolute_error,[300.21708005673077],[388.50694007469633],[259.3334999979227],[295.5502147336699],[321.4065838027416],[288.00097510863793],[394.7104134607773],[757.5559536160479],[317.7678101988081],[210.91697697069617],...,[345.2772902103927],[283.74364364750875],[289.73214181603],[480.25867182954744],[396.73121884898],[409.14224682045824],[286.18217585763387],[212.49352229966394],[419.8512867374686],[487.1318398294686]


Here we can see that the best model has an mean_absolute_error (MAE) on validation data of: 311.39

### Retrieve the Best Model

In [17]:
# Retrieve best model from Pipeline Run
best_model_output = pipeline_run.get_pipeline_output(best_model_output_name)
num_file_downloaded = best_model_output.download('.', show_progress=True)

Downloading azureml/9a406e11-de4c-4606-b078-eea3142f8067/model_data
Downloaded azureml/9a406e11-de4c-4606-b078-eea3142f8067/model_data, 1 files out of an estimated total of 1


In [18]:
import pickle

with open(best_model_output._path_on_datastore, "rb" ) as f:
    best_model = pickle.load(f)
best_model

RegressionPipeline(pipeline=Pipeline(memory=None,
                                     steps=[('datatransformer',
                                             DataTransformer(enable_dnn=None,
                                                             enable_feature_sweeping=None,
                                                             feature_sweeping_config=None,
                                                             feature_sweeping_timeout=None,
                                                             featurization_config=None,
                                                             force_text_dnn=None,
                                                             is_cross_validation=None,
                                                             is_onnx_compatible=None,
                                                             logger=None,
                                                             observer=None,
                                         

In [19]:
best_model.steps

[('datatransformer',
  DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                  feature_sweeping_config=None, feature_sweeping_timeout=None,
                  featurization_config=None, force_text_dnn=None,
                  is_cross_validation=None, is_onnx_compatible=None, logger=None,
                  observer=None, task=None, working_dir=None)),
 ('prefittedsoftvotingregressor',
  PreFittedSoftVotingRegressor(estimators=[('0',
                                            Pipeline(memory=None,
                                                     steps=[('maxabsscaler',
                                                             MaxAbsScaler(copy=True)),
                                                            ('lightgbmregressor',
                                                             LightGBMRegressor(boosting_type='gbdt',
                                                                               class_weight=None,
                             

#### Testing Our Best Fitted Model

We will evaluate based on the mean absolute error (MAE). This is the same metric we used in the HyperDrive example. The MAE is a common evaluation measure within real estate and specifically in predicting prices. The MAE of the best HyperDrive model was 525.5451960245264. 

In [20]:
from sklearn.metrics import mean_absolute_error
df_test = dataset.to_pandas_dataframe()
y_test = df_test['Transactieprijs_m2']
x_test = df_test.drop(['Transactieprijs_m2'], axis=1)
preds = best_model.predict(x_test) 
mae = mean_absolute_error(y_test, preds)

print('The Mean Absolute Error (MAE) of the best AutoML model is on the train-dataset: ', mae)

The Mean Absolute Error (MAE) of the best AutoML model is on the train-dataset:  291.43639037728326


## Publish and run from REST endpoint

Run the following code to publish the pipeline to your workspace. In your workspace in the portal, you can see metadata for the pipeline including run history and durations. You can also run the pipeline manually from the portal.

Additionally, publishing the pipeline enables a REST endpoint to rerun the pipeline from any HTTP library on any platform.


In [21]:
published_pipeline = pipeline_run.publish_pipeline(
    name="Houses_train", description="Training house price prediction pipeline", version="1.0")

published_pipeline

Name,Id,Status,Endpoint
Houses_train,54901fcd-4ac5-4741-a8e4-5caca93cfd74,Active,REST Endpoint


Authenticate once again, to retrieve the `auth_header` so that the endpoint can be used

In [22]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

Get the REST url from the endpoint property of the published pipeline object. You can also find the REST url in your workspace in the portal. Build an HTTP POST request to the endpoint, specifying your authentication header. Additionally, add a JSON payload object with the experiment name and the batch size parameter. As a reminder, the process_count_per_node is passed through to ParallelRunStep because you defined it is defined as a PipelineParameter object in the step configuration.

Make the request to trigger the run. Access the Id key from the response dict to get the value of the run id.


In [23]:
import requests

rest_endpoint = published_pipeline.endpoint
response = requests.post(rest_endpoint, 
                         headers=auth_header, 
                         json={"ExperimentName": "pipeline-rest-endpoint"}
                        )

In [24]:
try:
    response.raise_for_status()
except Exception:    
    raise Exception("Received bad response from the endpoint: {}\n"
                    "Response Code: {}\n"
                    "Headers: {}\n"
                    "Content: {}".format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

Submitted pipeline run:  f7226768-f26b-4aec-aa15-3b8c31cb61cc


Use the run id to monitor the status of the new run. This will take another 10-15 min to run and will look similar to the previous pipeline run, so if you don't need to see another pipeline run, you can skip watching the full output.

In [25]:
from azureml.pipeline.core.run import PipelineRun
from azureml.widgets import RunDetails

published_pipeline_run = PipelineRun(ws.experiments["pipeline-rest-endpoint"], run_id)
RunDetails(published_pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

In [26]:
print("End of notebook")

End of notebook
