Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-with-data-dependency-steps.png)

# Showcasing GlobalDataset and PipelineParameter

This notebook demonstrateas the use of **GlobalDataset** and **PipelineParameters** in AML Pipeline. You will learn how strings and **GlobalDataset** can be parameterized and submitted to AML Pipelines via **PipelineParameters**.
To see more about how parameters work between steps, please refer [aml-pipelines-with-data-dependency-steps](https://aka.ms/pl-data-dep).

* [How to create a Pipeline with a GlobalDataset PipelineParameter](#index1)
* [How to submit a Pipeline with a GlobalDataset PipelineParameter](#index2)
* [How to submit a Pipeline and change the GlobalDataset PipelineParameter value from the sdk](#index3)
* [How to submit a Pipeline and change the GlobalDataset PipelineParameter value using a REST call](#index4)
* [How to create a datastore trigger schedule and use the data_path_parameter_name to get the path of the changed blob in the Pipeline](#index5)

## Azure Machine Learning and Pipeline SDK-specific imports

In [None]:
from azureml.core import Experiment, Dataset
from azureml.core.compute import ComputeTarget, AmlCompute

from azureml.pipeline.wrapper import Module, Pipeline, PipelineRun, dsl
from azureml.pipeline.wrapper._dataset import _GlobalDataset
from azureml.pipeline.wrapper._pipeline_parameters import PipelineParameter

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

## Initialize Workspace

Login to azure with cli and set the default workspace using `az ml folder attach` command.

After this operation, the workspace could be retrived with the `Workspace.from_config()` for SDK usage.

In [None]:
# NOTE: Update the following information with your environment

SUBSCRIPTION_ID = '<your subscription ID>'
WORKSPACE_NAME = '<your workspace name>'
RESOURCE_GROUP_NAME = '<your resource group>'

In [None]:
!az login -o none 
!az account set -s $SUBSCRIPTION_ID 
!az ml folder attach -w $WORKSPACE_NAME -g $RESOURCE_GROUP_NAME 

In [None]:
from azureml.core import Workspace

workspace = Workspace.from_config()

In [None]:
from workspace_helpers import setup_default_workspace
workspace = setup_default_workspace()

## Create an Azure ML experiment

Let's create an experiment named "automl-classification" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.

In [None]:
# Choose a name for the run history container in the workspace.
experiment_name = 'showcasing-GlobalDataset-PipelineParameter'

## Create or Attach an AmlCompute cluster
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for your AutoML run. In this tutorial, you get the default `AmlCompute` as your training compute resource.

In [None]:
from azureml.core.compute_target import ComputeTargetException

# Retrieve an already attached Azure Machine Learning Compute.
cluster_name = "cpu-cluster"
try:
    compute_target = ComputeTarget(workspace=workspace, name=cluster_name)
    print('Found existing compute target {}.'.format(cluster_name))
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size="Standard_D2_v2",
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True, timeout_in_minutes=20)

print("Azure Machine Learning Compute attached")

## Data and arguments setup 

We use a Select Column Module to illustrate how to use Datapath and PipelinePrameter.

In [None]:
dataset_name = 'titanic-cleaned.csv'

# get dataset
titanic_dataset = Dataset.get_by_name(workspace, dataset_name)

# module parameters
select_columns = "{\"isFilter\":true,\"rules\":[{\"exclude\":false,\"ruleType\":\"AllColumns\"}]}"

# use 'Select Columns in Dataset' as the sample module
select_column_module_func = Module.load(workspace, namespace='azureml', name='Select Columns in Dataset')

<a id='index1'></a>

## Create a Pipeline with a GlobalDataset PipelineParameter

Create a module by assigning its parameters directly.

In [None]:
# define a module
select_column_module = select_column_module_func(dataset=titanic_dataset, select_columns=select_columns)

In [None]:
# define a pipeline
pipeline = Pipeline(nodes=[select_column_module],
                    workspace=workspace,
                    name="select-column-sample-pipeline",
                    description="pipeline for GlobalDataset-PipelineParameter sample usage",
                    default_compute_target=cluster_name)

In [None]:
pipeline_run = pipeline.submit(experiment_name=experiment_name)

pipeline_run.wait_for_completion()

Create a pipeline using GlobalDataset PipelineParameter.

In [None]:
# DataPath PipelineParameter

datastore_name = 'workspaceblobstore'
datapath = _GlobalDataset(workspace=workspace, data_store_name=datastore_name, relative_path=dataset_name)

datapath_parameter = PipelineParameter(name='datapath', default_value=datapath)

In [None]:
# pipeline parameter
pipeline_parameter = PipelineParameter(name='select_columns', default_value=select_columns)

In [None]:
select_column_module = select_column_module_func()
select_column_module.set_inputs(dataset=datapath_parameter)
select_column_module.set_parameters(select_columns=pipeline_parameter)

In [None]:
pipeline = Pipeline(nodes=[select_column_module],
                    workspace=workspace,
                    name="select-column-sample-pipeline",
                    description="pipeline for GlobalDataset-PipelineParameter sample usage",
                    default_compute_target=cluster_name)

pipeline_run_with_params = pipeline.submit(experiment_name=experiment_name)

<a id='index2'></a>

## Submit a Pipeline with a GlobalDataset PipelineParameter

Pipelines can be submitted with default values of PipelineParameters by not specifying any parameters.

In [None]:
pipeline_run = pipeline.submit(experiment_name=experiment_name)
print("Pipeline is submitted for execution")

In [None]:
pipeline_run

In [None]:
pipeline_run.wait_for_completion()

<a id='index3'></a>

## Submit a Pipeline and change the GlobalDataset PipelineParameter value from the sdk

Or Pipelines can be submitted with values other than default ones by using pipeline_parameters. 

In [None]:
@dsl.pipeline(name='select-column-sample-pipeline',
              description='pipeline for GlobalDataset-PipelineParameter sample usage',
              default_compute_target=cluster_name)
def sample_pipeline(input, _select_columns):
    select_column_module = select_column_module_func(dataset=input,
                                                     select_columns=_select_columns)
    return select_column_module.outputs

pipeline = sample_pipeline(input=None, _select_columns=None)
pipeline_run_with_params = pipeline.submit(experiment_name=experiment_name,
                                           pipeline_parameters={'input': titanic_dataset,
                                                                '_select_columns': select_columns})

In [None]:
pipeline_run_with_params.wait_for_completion()

In [None]:
pipeline_run_with_params

<a id='index4'></a>

## Submit a Pipeline and change the GlobalDataset PipelineParameter value using a REST call

Let's published the pipeline to use the rest endpoint of the published pipeline.

In [None]:
pipeline = sample_pipeline(input=titanic_dataset, _select_columns=select_columns)

published_pipeline = pipeline._publish(experiment_name=experiment_name,
                                       name='published pipeline',
                                       description="Pipeline to test GlobalDataset")

published_pipeline

In [None]:
from azureml.core.authentication import InteractiveLoginAuthentication
import requests

auth = InteractiveLoginAuthentication()
aad_token = auth.get_authentication_header()

rest_endpoint = published_pipeline.endpoint

print("You can perform HTTP POST on URL {} to trigger this pipeline".format(rest_endpoint))

In [None]:
def_blob_store = workspace.get_default_datastore()
print("Default datastore's name: {}".format(def_blob_store.name))

# specify the param when running the pipeline
response = requests.post(rest_endpoint, 
                         headers=aad_token, 
                         json={"ExperimentName": "MyRestPipeline",
                               "RunSource": "SDK",
                               "DataPathAssignments": {
                                   "datapath": { 
                                       "DataStoreName": def_blob_store.name
                                   }
                               },
                               "ParameterAssignments": {"input_string": "sample_string3"}
                              }
                        )

In [None]:
try:
    response.raise_for_status()
except Exception:    
    raise Exception('Received bad response from the endpoint: {}\n'
                    'Response Code: {}\n'
                    'Headers: {}\n'
                    'Content: {}'.format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

In [None]:
published_pipeline_run_via_rest = PipelineRun(workspace.experiments["MyRestPipeline"], run_id)
published_pipeline_run_via_rest

In [None]:
published_pipeline_run_via_rest.wait_for_completion()

<a id='index5'></a>

## Create a Datastore trigger schedule and use data path parameter

When the Pipeline is scheduled with GlobalDataset parameter, it will be triggered by the modified or added data in the GlobalDataset. ```path_on_datastore``` should be a folder and the value of the GlobalDataset will be replaced by the path of the modified data.

In [None]:
from azureml.pipeline.core import Schedule

schedule = Schedule.create(workspace=workspace, 
                           name="Datastore_trigger_schedule",
                           pipeline_id=published_pipeline.id, 
                           experiment_name='Scheduled_Pipeline',
                           datastore=def_blob_store,
                           wait_for_provisioning=True,
                           description="Datastore trigger schedule demo",
                           path_on_datastore="sample_datapath_for_folder",
                           data_path_parameter_name="datapath") #Same name as used above to create PipelineParameter

print("Created schedule with id: {}".format(schedule.id))

In [None]:
schedule.disable()
schedule