Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Using Customized Modules to Create an Azure Machine Learning Pipeline

In this notebook, we will demonstrate how to use customized modules to create an Azure Machine Learning Pipeline which is associated with the [fastText](https://fasttext.cc/) algorithm. Customized modules are created with the extension azure-cli-ml of Azure CLI. 

If you don't have the input dataset, you could prepare your it with the instructions in ```prepare_data.ipynb```. 

The outline of this notebook is as follows:

- Upload the dataset onto a blob container and register it to the workspace.
- Register an anonymous module from yaml file to workspace.
- Construct the pipeline.
- Visualize and run the pipeline.

## Prerequisites
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook located at https://github.com/Azure/MachineLearningNotebooks first. This sets you up with a working config file that has information on your workspace, subscription id, etc.

In [1]:
from azureml.core import Dataset, Datastore, Workspace
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.data.datapath import DataPath
from azureml.pipeline.wrapper import Module, dsl

### Connect to workspace
Create a workspace object from the existing workspace. Workspace.from_config() reads the file config.json and loads the details into an object named workspace.

In [2]:
workspace = Workspace.from_config('config.json')
print(workspace.name, workspace.resource_group, workspace.location, workspace.subscription_id,
      workspace.compute_targets.keys(), sep='\n')

DesignerDRI_EASTUS
DesignerDRI
eastus
74eccef0-4b8d-4f83-b5f9-fa100d155b22
dict_keys(['attached-aks', 'default', 'compute', 'aml-compute', 'aml-compute-gpu'])


### Create or Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

**Creation of compute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace the code will skip the creation process.**


In [3]:
aml_compute_name = 'aml-compute-gpu'
if aml_compute_name in workspace.compute_targets:
    aml_compute = AmlCompute(workspace, aml_compute_name)
    print("Found existing compute target: {}".format(aml_compute_name))
else:
    print("Creating new compute target: {}".format(aml_compute_name))
    provisioning_config = AmlCompute.provisioning_configuration(vm_size="Standard_NC6", min_nodes=0, max_nodes=2)
    aml_compute = ComputeTarget.create(workspace, aml_compute_name, provisioning_config)
    aml_compute.wait_for_completion(show_output=True)

Found existing compute target: aml-compute-gpu


### If you use your own dataset, you need to install an extra dependence
if you use our dataset, just skip the next cell

In [4]:
# Install dataset runtime to enable dataset registration in sample notebooks
!pip install azureml-dataset-runtime[fuse] --extra-index-url https://azuremlsdktestpypi.azureedge.net/modulesdkpreview --user --upgrade

Looking in indexes: https://pypi.org/simple, https://azuremlsdktestpypi.azureedge.net/modulesdkpreview
Requirement already up-to-date: azureml-dataset-runtime[fuse] in /home/azureuser/.local/lib/python3.6/site-packages (1.11.0.post1)


### Load dataset
If you use your own dataset, give your dataset a name by ```dataset_name```
Your dataset will be uploaded onto a blob container and registered to this workspace.

In [5]:
dataset_name = "THUCNews"
# if the workspace don't contain the dataset, it will be registered automatically
if not dataset_name in workspace.datasets:
    # your files will be uploaded to the directory of 'path_on_datastore' in the blob container
    path_on_datastore = 'my_dataset'
    datastore = Datastore.get(workspace=workspace, datastore_name='workspaceblobstore')
    # upload files in 'data/data_for_pipeline'
    datastore.upload(src_dir='data/data_for_pipeline', target_path=path_on_datastore, overwrite=True, show_progress=True)
    # dataset description
    description = 'THUCNews dataset is generated by filtering and filtering historical data \
    of Sina News RSS subscription channel from 2005 to 2011'
    # get the DataPath object associated with the uploaded dataset.
    datastore_path = [DataPath(datastore=datastore, path_on_datastore=path_on_datastore)]
    # Dataset.File.from_files() needs the dependence named 'azureml-dataset-runtime[fuse]'
    data = Dataset.File.from_files(path=datastore_path)
    # register the dataset to this workspace.
    data.register(workspace=workspace, name=dataset_name, description=description, create_new_version=True)
dataset = workspace.datasets[dataset_name]

Uploading an estimated of 3 files
Uploading data/data_for_pipeline/data.txt
Uploading data/data_for_pipeline/label.txt
Uploading data/data_for_pipeline/word_to_index.json
Uploaded data/data_for_pipeline/data.txt, 1 files out of an estimated total of 3
Uploaded data/data_for_pipeline/label.txt, 2 files out of an estimated total of 3
Uploaded data/data_for_pipeline/word_to_index.json, 3 files out of an estimated total of 3
Uploaded 3 files


### Register an anonymous module from yaml file to this workspace.
If you decorate your module function with ```@dsl.module```, azure-cli-ml could help to generate the ```*.spec.yaml``` file.

In [6]:
split_data_txt_module_func = Module.from_yaml(workspace, 'split_data_txt/split_data_txt.spec.yaml')
fasttext_train_module_func = Module.from_yaml(workspace, 'fasttext_train/fasttext_train.spec.yaml')
fasttext_evaluation_module_func = Module.from_yaml(workspace, 'fasttext_evaluation/fasttext_evaluation.spec.yaml')
compare_two_models_module_func = Module.from_yaml(workspace, 'compare_two_models/compare_two_models.spec.yaml')

### Construct the pipeline
our pipeline contains two sub pipelines. They represent two training processes of the fastText model with different parameters. 

In [7]:
# sub pipeline
@dsl.pipeline(name='sub_pipeline', description='A sub pipeline including processes of data processing/train/evaluation',
              default_compute_target=aml_compute_name)
def training_pipeline(epochs, batch_size, max_len):
    split_data_txt = split_data_txt_module_func(
        input_dir=dataset,
        training_data_ratio=0.7,
        validation_data_ratio=0.1
    )
    fasttext_train = fasttext_train_module_func(
        training_data_dir=split_data_txt.outputs.training_data_output,
        validation_data_dir=split_data_txt.outputs.validation_data_output,
        epochs=epochs,
        batch_size=batch_size,
        max_len=max_len,
        embed_dim=300,
        hidden_size=256,
        ngram_size=200000,
        learning_rate=0.001
    )

    fasttext_evaluation = fasttext_evaluation_module_func(
        trained_model_dir=fasttext_train.outputs.trained_model_dir,
        test_data_dir=split_data_txt.outputs.test_data_output
    )

    return {**fasttext_evaluation.outputs, **fasttext_train.outputs}

In [8]:
@dsl.pipeline(name='fasttext_pipeline',
              description='The pipeline that trains two fasttext models and output the better one',
              default_compute_target=aml_compute_name)
def fasttext_pipeline():
    train_and_evalute_model1 = training_pipeline(epochs=3, batch_size=64, max_len=32)
    train_and_evalute_model2 = training_pipeline(epochs=6, batch_size=64, max_len=32)
    compare = compare_two_models_module_func(
        first_trained_model=train_and_evalute_model1.outputs.trained_model_dir,
        first_trained_result=train_and_evalute_model1.outputs.model_testing_result,
        second_trained_model=train_and_evalute_model2.outputs.trained_model_dir,
        second_trained_result=train_and_evalute_model2.outputs.model_testing_result
    )
    return {**compare.outputs}


### Visualize and run the pipeline.

In [9]:
# get the pipeline
pipeline = fasttext_pipeline()
# save the pipeline if necessary
# pipeline.save(experiment_name=experiment_name)

In [None]:
# visualize the pipeline
pipeline.validate()

In [None]:
# run the pipeline
experiment_name = 'fasttext_pipeline'
# regenerate_outputs indicates whether to force regeneration of all step outputs and disallow data reuse for this run
# if regenerate_outputs is False, this run may reuse results from previous runs and subsequent runs may reuse the results of this run
pipeline_run = pipeline.submit(experiment_name=experiment_name, regenerate_outputs=False)
# wait_for_completion() visualize the execution process of the pipeline
# you could also view this process on Azure Machine Learning Portal
# pipeline_run.wait_for_completion()

<IPython.core.display.Javascript object>

ValidateView(container_id='container_id_011247b6-81de-4b5f-8177-eb8aed67dc00_widget', env_json='{"subscription…

{'result': 'validation passed', 'errors': []}

In [11]:
# run the pipeline
experiment_name = 'fasttext_pipeline'
# regenerate_outputs indicates whether to force regeneration of all step outputs and disallow data reuse for this run
# if regenerate_outputs is False, this run may reuse results from previous runs and subsequent runs may reuse the results of this run
pipeline_run = pipeline.submit(experiment_name=experiment_name, regenerate_outputs=False)
# wait_for_completion() visualize the execution process of the pipeline
# you could also view this process on Azure Machine Learning Portal
# pipeline_run.wait_for_completion()

Submitted PipelineRun 8606db9f-6277-4357-98e2-397124e48ce8
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/fasttext_pipeline/runs/8606db9f-6277-4357-98e2-397124e48ce8?wsid=/subscriptions/74eccef0-4b8d-4f83-b5f9-fa100d155b22/resourcegroups/DesignerDRI/workspaces/DesignerDRI_EASTUS
