Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Showcasing Dataset and Pipeline Parameter

This notebook demonstrateas the usage of [**Dataset**](https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fpython%2Fapi%2Fazureml-core%2Fazureml.core.dataset%28class%29%3Fview%3Dazure-ml-py), more specifically [**FileDataset**](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py), we don't support [**TabularDataset**](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for now. You will learn how **Dataset** and other parameters are submitted to AML Pipelines via **Pipeline Parameters**. By parametrizing datasets, you can dynamically run pipeline experiments with different datasets without any code change.

* [How to create a Pipeline with Pipeline Parameter](#create_pipeline)
* [How to submit a Pipeline with Pipeline Parameter](#submit_pipeline)
* [How to submit a Pipeline and change the Pipeline Parameter value from the sdk](#submit_with_pipeline_parameters)
* [How to submit a Pipeline and change the Pipeline Parameter value using a REST call](#submit_using_rest_call)

## Azure Machine Learning and Pipeline SDK-specific imports

In [None]:
import azureml.core
from azureml.core import Dataset
from azureml.core.compute import ComputeTarget, AmlCompute

from azureml.pipeline.wrapper import Module, Pipeline, PipelineRun, dsl

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

## Initialize Workspace

Login to azure with cli and set the default workspace using `az ml folder attach` command.

After this operation, the workspace could be retrived with the `Workspace.from_config()` for SDK usage.

In [None]:
# NOTE: Update the following information with your environment

SUBSCRIPTION_ID = '<your subscription ID>'
WORKSPACE_NAME = '<your workspace name>'
RESOURCE_GROUP_NAME = '<your resource group>'

In [None]:
!az login -o none 
!az account set -s $SUBSCRIPTION_ID 
!az ml folder attach -w $WORKSPACE_NAME -g $RESOURCE_GROUP_NAME 

In [None]:
from azureml.core import Workspace

workspace = Workspace.from_config()

## Retrieve or create an Azure Machine Learning compute target

In [None]:
from azureml.core.compute_target import ComputeTargetException

# Retrieve an already attached Azure Machine Learning Compute.
cluster_name = "cpu-cluster"
try:
    compute_target = ComputeTarget(workspace=workspace, name=cluster_name)
    print('Found existing compute target {}.'.format(cluster_name))
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size="Standard_D2_v2",
                                                           max_nodes=4)

    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)
    compute_target.wait_for_completion(show_output=True, timeout_in_minutes=20)

print("Azure Machine Learning Compute attached")

## Dataset and Arguments Setup

The following illustrates how to get/create a [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py) from an external CSV file, we use 'titanic-cleaned.csv' dataset as the sample dataset.

We use Select Column Module as the sample module to illustrate how a pipeline is created.

In [None]:
# create a FileDataset
titanic_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/Titanic.csv')
print('Datatype of file_dataset: {}'.format(type(titanic_dataset)))

In [None]:
# get 'Select Columns in Dataset' module function
select_column_module_func = Module.load(workspace, namespace='azureml', name='Select Columns in Dataset')

# define module parameters
select_columns = "{\"isFilter\":true,\"rules\":[{\"exclude\":false,\"ruleType\":\"AllColumns\"}]}"

# choose a name for the run history container in the workspace.
experiment_name = 'showcasing-Dataset-PipelineParameter'

<a id='create_pipeline'></a>

## Create a Pipeline with Pipeline Parameter


Create a pipeline using pipeline parameter. In the sample pipeline function below, pipeline parameters are:
* input
* _select_columns

In [None]:
# define a pipeline function
@dsl.pipeline(name='select-column-sample-pipeline',
              description='pipeline for Dataset and Pipeline Parameter sample usage',
              default_compute_target=cluster_name)
def sample_pipeline(input, _select_columns):
    print('Datatype of input: {}'.format(type(input)))
    print('Datatype of _select_columns: {}'.format(type(_select_columns)))

    select_column_module = select_column_module_func(dataset=input,
                                                     select_columns=_select_columns)
    return select_column_module.outputs

In [None]:
# create a pipeline using pipeline parameter
# dsl.pipeline will transfer inputs parameters into PipelineParameter datatype.
pipeline = sample_pipeline(input=titanic_dataset, _select_columns=select_columns)
print("Pipeline is created")

<a id='submit_pipeline'></a>

## Submit a Pipeline with default Pipeline Parameters

Pipelines can be submitted with default values of Pipeline Parameters by not specifying any parameters.

In [None]:
# submit pipeline
pipeline_run = pipeline.submit(experiment_name=experiment_name)
print("Pipeline is submitted for execution")

In [None]:
pipeline_run

In [None]:
pipeline_run.wait_for_completion()

<a id='submit_with_pipeline_parameters'></a>

## Submit a Pipeline and change the Pipeline Parameters value from the sdk

The training pipeline can be reused with different input datasets by passing them in as `pipeline_parameters`.

In [None]:
# create a new FileDataset
crime_dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/crime-spring.csv')

In [None]:
# update pipeline parameters when submit using 'pipeline_parameters'
pipeline_run_with_params = pipeline.submit(experiment_name=experiment_name,
                                           pipeline_parameters={'input': crime_dataset,
                                                                '_select_columns': select_columns})

In [None]:
pipeline_run_with_params.wait_for_completion()

In [None]:
pipeline_run_with_params

<a id='submit_using_rest_call'></a>

## Submit a Pipeline and change the Pipeline Parameter value using a REST call

Let's published the pipeline to use the rest endpoint of the published pipeline. We publish a pipeline using **PipelineEndpoint**.

In [None]:
from azureml.pipeline.wrapper import PipelineEndpoint

# publish pipeline to an endpoint named "PipelineParameterTest", and make it as default version.
pipeline_endpoint = PipelineEndpoint.publish(workspace=workspace, name="PipelineParameterTest",
                                             pipeline=pipeline, description="Test description Notebook", 
                                             set_as_default=True)

pipeline_endpoint

In [None]:
# only "Active" status Pipeline Endpoints can be submitted
if pipeline_endpoint.status == "Disabled":
    pipeline_endpoint.enable()

In [None]:
from azureml.core.authentication import InteractiveLoginAuthentication
import requests

auth = InteractiveLoginAuthentication()
aad_token = auth.get_authentication_header()

rest_endpoint = pipeline_endpoint.endpoint

print("You can perform HTTP POST on URL {} to trigger this pipeline".format(rest_endpoint))

In [None]:
# we can change "select_columns" parameter, "Date" is a column name in crime_dataset
select_columns = "{\"isFilter\":true,\"rules\":"\
                 "[{\"exclude\":false,\"ruleType\":\"ColumnNames\",\"columns\":[\"Date\"]}]}"

# specify the param when running the pipeline
# NOTE: parameter name "input" and "_select_columns" should be the same with these in pipeline defination parameters
response = requests.post(rest_endpoint, 
                         headers=aad_token, 
                         json={"ExperimentName": "MyRestPipeline",
                               "RunSource": "SDK",
                                "DataSetDefinitionValueAssignments": {
                                    "input": {
                                        "SavedDataSetReference": {"Id": crime_dataset.id}
                                    }
                                },
                               "ParameterAssignments": {"_select_columns": select_columns}
                              }
                        )

In [None]:
try:
    response.raise_for_status()
except Exception:
    raise Exception('Received bad response from the endpoint: {}\n'
                    'Response Code: {}\n'
                    'Headers: {}\n'
                    'Content: {}'.format(rest_endpoint, response.status_code, response.headers, response.content))

run_id = response.json().get('Id')
print('Submitted pipeline run: ', run_id)

In [None]:
published_pipeline_run_via_rest = PipelineRun(workspace.experiments["MyRestPipeline"], run_id)
published_pipeline_run_via_rest

In [None]:
published_pipeline_run_via_rest.wait_for_completion()

## Finish

Disable created PipelineEndpoint and PublishedPipeline in this notebook.

In [None]:
# disable pipeline endpoint
pipeline_endpoint.disable()

# disable the published pipeline
default_version = pipeline_endpoint.default_version
pipeline_list = pipeline_endpoint.list_pipelines(active_only=True)
pipeline_list[default_version].disable()