<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Objective

The goal of this notebook is to create and run a parallel scoring pipeline. This is the third of three notebooks.

The order of the execution of these notebooks should be:

- [parallel_scoring_0_prepare_azure_resources](parallel_scoring_0_prepare_azure_resources.ipynb): This creates the azure resources used in other notebooks. This should already be run when this notebook is executed.
- [parallel_scoring_1_prepare_data_and_model](parallel_scoring_1_prepare_data_and_model.ipynb): This uploads ratings data and an example model to Azure where agents can see them. This should already be run when this notebook is executed.
- [parallel_scoring_2_prepare_and_run_amlpipeline](parallel_scoring_2_prepare_and_run_amlpipeline.ipynb) **(This notebook)**: This creates and executes a parallel pipeline that leverages the resources and files created in prior steps.

## Load relevant modules

In [None]:
import os
import json
import sys
import shutil
from datetime import datetime
from azureml.core.compute import AmlCompute
from azureml.core.datastore import Datastore
from azureml.data.data_reference import DataReference
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.core.runconfig import CondaDependencies, RunConfiguration
from azureml.core import Workspace, Run, Experiment
from azureml.core.authentication import ServicePrincipalAuthentication
from azureml.pipeline.core.schedule import ScheduleRecurrence, Schedule
from azureml.core import Experiment


## Load the configuration data created in the initial notebook.

In [None]:
pipeline_config = 'pipeline_config_programmatic.json'
with open(pipeline_config) as f:
    j = json.loads(f.read())

## Sign-in to the Azure machine learning workspace 

In [None]:
# SP authentication
sp_auth = ServicePrincipalAuthentication(
    tenant_id=j["sp_tenant"], username=j["sp_client"], password=j["sp_secret"]
)

# AML workspace
aml_ws = Workspace.get(
    name=j["aml_work_space"],
    auth=sp_auth,
    subscription_id=str(j["subscription_id"]),
    resource_group=j["resource_group_name"],
)

## Set up data stores and references

A `Datastore` is an object that references an store of some kind. In these cases, we are just using azure blob storage.

A `DataReference` is an object that resolves to the mount-point where its corresponding named `Datastore` is attached on a compute agent.

In [None]:
# Pipeline inputs, models, and outputs
inputs_ds = Datastore.register_azure_blob_container(
    aml_ws,
    datastore_name="inputs_ds",
    container_name=j["data_blob_container"],
    account_name=j["blob_account"],
    account_key=j["blob_key"],
)
inputs_dir = DataReference(datastore=inputs_ds, data_reference_name="inputs")

models_ds = Datastore.register_azure_blob_container(
    aml_ws,
    datastore_name="models_ds",
    container_name=j["models_blob_container"],
    account_name=j["blob_account"],
    account_key=j["blob_key"],
)
models_dir = DataReference(datastore=models_ds, data_reference_name="models")

outputs_ds = Datastore.register_azure_blob_container(
    aml_ws,
    datastore_name="outputs_ds",
    container_name=j["preds_blob_container"],
    account_name=j["blob_account"],
    account_key=j["blob_key"],
)
outputs_dir = PipelineData(name="outputs", datastore=outputs_ds, is_directory=True)

## Create the details of hte environment

The `RunConfiguration` specifies the pre-requisites that need to be installed on the compute agent.

In [None]:
# Run config
conda_dependencies = CondaDependencies.create(
    pip_packages=j["pip_packages"],
    conda_packages=j["conda_packages"],
    python_version=j["python_version"]
)
run_config = RunConfiguration(conda_dependencies=conda_dependencies)
run_config.environment.docker.enabled = True

## Set up some parameters for the pipeline.

In [None]:
MOVIELENS_DATA_SIZE = '10m'

if MOVIELENS_DATA_SIZE == '10m':
    MAX_ALL = 72000
    NUM_PER_RUN = 10000
    compute_target = AmlCompute(aml_ws, j["cluster_name"])    
#    compute_target = AmlCompute(aml_ws, "top10-mvl-d4v2")    
else:
    MAX_ALL = 140000
    NUM_PER_RUN = 10000
    # getting memory errors...
    compute_target = AmlCompute(aml_ws, "top10-mvl-d4v2")    

# AML compute target


## Add one dependency

The scoring script depends on the `reco_utils` module, because the scoring script needs to know about the `SARSingleNode` class, which is defined there. For that module to be loaded on the compute agents, it must be inside the directory that is referenced by the `source_directory` parameter of the `PythonScriptStep()`. So, we need to put a copy there.

In [None]:
## copy reco_utils into scripts for this...
reco_utils_home = '../../reco_utils'
pipeline_reco_utils_copy = os.path.join(j["python_script_directory"],'reco_utils')

if os.path.exists(pipeline_reco_utils_copy):
    ## if it already exists, then remove it to make sure nothing is cached incorrectly
    print('removing stale copied version')
    shutil.rmtree(pipeline_reco_utils_copy)

## copy it over.
shutil.copytree(reco_utils_home, pipeline_reco_utils_copy)

## Create the AML Pipeline

Create a `PythonScriptStep` for each of the chunks in your dataset. The size of your chunk is determined by `NUM_PER_RUN` set above.

Because there are no data dependencies (i.e. that the outputs of one step don't provide inputs to others), the steps will be queued in parallel and executed as resources become free.

Each step contains information on:

- `name`: the name of the step.
- `script_name`: the name of the script to run
- `arguments`: arguments passed to the script
- `inputs`: inputs in the form of DataReferences. What blobs or datastores need to get mounted on the compute agent?
- `outputs`: outputs in the form of DataReferences. What blobs or datastores does this step write to?
- `source_directory`: What is the lowest point the directory tree that contains the script and all relevant dependencies. If there are modules that `script_name` depends on, then these should live inside the `source_directory`.
- `compute_target`: What computational engine should be used?
- `runconfig`: How does that computational engine need to be configured in order to run `script_name`? What dependencies must be installed?
- `allow_reuse`: a flag to indicate whether it can re-use previously computed steps that have not had any changes. `False` indicates that a new run will always be generated.

In [None]:
# Create a pipeline step for a subset of data...

steps = []
CUR_MIN = 1
CUR_MAX = CUR_MIN + NUM_PER_RUN

## will say for 10m
## if have copied the reco_utils dir to this dir...
while CUR_MIN < MAX_ALL:
    outputs_dir = PipelineData(name="outputs", datastore=outputs_ds, is_directory=True)
    cur_name = "{}_{}_{}".format(CUR_MIN, CUR_MAX, MOVIELENS_DATA_SIZE)
    print(cur_name)
    step = PythonScriptStep(
        name=cur_name,
        script_name=j["python_script_name"],
        arguments=[CUR_MIN, CUR_MAX, inputs_dir, models_dir, outputs_dir, '10', MOVIELENS_DATA_SIZE],
        inputs=[models_dir, inputs_dir],
        outputs=[outputs_dir],
        source_directory=j["python_script_directory"],
        compute_target=compute_target,
        runconfig=run_config,
        allow_reuse=False,
    )
    steps.append(step)
    CUR_MIN = CUR_MAX
    CUR_MAX = CUR_MIN + NUM_PER_RUN

## Create and Validate the Pipeline

In [None]:
pipeline = Pipeline(workspace=aml_ws, steps=steps)
pipeline.validate()

## Run the Pipeline as an Experiment

In [None]:
exp_name = 'prog_reco_score_%s' %(MOVIELENS_DATA_SIZE)
print(exp_name)
pipeline_run = Experiment(aml_ws, exp_name).submit(pipeline)

## Monitor the Run

In [None]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()