Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Using a Trained Model for Batch Inference

In this notebook, we will demonstrate how to make predictions on large quantities of data asynchronously using the ML pipelines with Azure Machine Learning. Batch inference (or batch scoring) provides cost-effective inference, with unparalleled throughput for asynchronous applications. Batch prediction pipelines can scale to perform inference on terabytes of production data. Batch prediction is optimized for high throughput, fire-and-forget predictions for a large collection of data.

> **Tip**
The dataset we use is not that huge. We aim to make you know the workflow of batch inference. If your system requires low-latency processing (to process a single document or small set of documents quickly), please use realtime inference. Refer to fasttext_realtime_inference.ipynb for more details. 

The outline of this notebook is as follows:

- Create a DataStore referencing documents stored in a blob container.
- Reference a trained fastText model from a complete experiment.
- Use the fastText model to do batch inference on the documents in the data blob container.

## Prerequisites
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook located at https://github.com/Azure/MachineLearningNotebooks first. This sets you up with a working config file that has information on your workspace, subscription id, etc.

In [3]:
import pandas as pd
from azureml.core import Workspace, Dataset, Datastore, Run
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.data.datapath import DataPath
from azureml.core.experiment import Experiment
from azureml.pipeline.wrapper import PipelineRun, Module, dsl

### Connect to workspace
Create a workspace object from the existing workspace. Workspace.from_config() reads the file config.json and loads the details into an object named workspace.


In [4]:
workspace = Workspace.from_config('config2.json')
print(workspace.name, workspace.resource_group, workspace.location, workspace.subscription_id,
      workspace.compute_targets.keys(), sep='\n')

fundamental3
fundamental
eastasia
4f455bd0-f95a-4b7d-8d08-078611508e0b
dict_keys(['myaks2', 'aml-compute', 'my-compute'])


### Create or Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

**Creation of compute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace the code will skip the creation process.**


In [5]:
aml_compute_name = 'aml-compute'
if aml_compute_name in workspace.compute_targets:
    aml_compute = AmlCompute(workspace, aml_compute_name)
    print("Found existing compute target: {}".format(aml_compute_name))
else:
    print("Creating new compute target: {}".format(aml_compute_name))
    provisioning_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2", min_nodes=1, max_nodes=4)
    aml_compute = ComputeTarget.create(workspace, aml_compute_name, provisioning_config)
    aml_compute.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Found existing compute target: aml-compute


### Upload the dataset onto a blob container and register it to the workspace.

In [6]:
dataset_name = 'THUCNews_For_Batch_Inference'
# if the workspace don't contain the dataset, then register it
if not dataset_name in workspace.datasets:
    # upload files onto path_on_datastore to a blob container
    # our files are in the directory of 'path_on_datastore' in the blob container
    path_on_datastore = 'data_for_batch_inference'
    datastore = Datastore.get(workspace=workspace, datastore_name='workspaceblobstore')
    datastore.upload(src_dir='data/data_for_batch_inference', target_path=path_on_datastore, overwrite=True, show_progress=True)
    # description of the dataset
    description = 'THUCNews dataset is generated by filtering and filtering historical data \
    of Sina News RSS subscription channel from 2005 to 2011'
    # get the DataPath object associated with the uploaded dataset
    datastore_path = [DataPath(datastore=datastore, path_on_datastore=path_on_datastore)]
    data = Dataset.File.from_files(path=datastore_path)
    # register the dataset to your workspace
    data.register(workspace=workspace, name=dataset_name, description=description, create_new_version=True)
# get the registered dataset
dataset = workspace.datasets[dataset_name]

Uploading an estimated of 100 files
Uploading data/data_for_batch_inference/0
Uploading data/data_for_batch_inference/1
Uploading data/data_for_batch_inference/10
Uploading data/data_for_batch_inference/11
Uploading data/data_for_batch_inference/12
Uploading data/data_for_batch_inference/13
Uploading data/data_for_batch_inference/14
Uploading data/data_for_batch_inference/15
Uploading data/data_for_batch_inference/16
Uploading data/data_for_batch_inference/17
Uploading data/data_for_batch_inference/18
Uploading data/data_for_batch_inference/19
Uploading data/data_for_batch_inference/2
Uploading data/data_for_batch_inference/20
Uploading data/data_for_batch_inference/21
Uploading data/data_for_batch_inference/22
Uploading data/data_for_batch_inference/23
Uploading data/data_for_batch_inference/24
Uploading data/data_for_batch_inference/25
Uploading data/data_for_batch_inference/26
Uploading data/data_for_batch_inference/27
Uploading data/data_for_batch_inference/28
Uploading data/data_f

Uploaded data/data_for_batch_inference/65, 63 files out of an estimated total of 100
Uploading data/data_for_batch_inference/94
Uploaded data/data_for_batch_inference/66, 64 files out of an estimated total of 100
Uploading data/data_for_batch_inference/95
Uploaded data/data_for_batch_inference/67, 65 files out of an estimated total of 100
Uploading data/data_for_batch_inference/96
Uploaded data/data_for_batch_inference/68, 66 files out of an estimated total of 100
Uploading data/data_for_batch_inference/97
Uploaded data/data_for_batch_inference/69, 67 files out of an estimated total of 100
Uploading data/data_for_batch_inference/98
Uploaded data/data_for_batch_inference/7, 68 files out of an estimated total of 100
Uploading data/data_for_batch_inference/99
Uploaded data/data_for_batch_inference/70, 69 files out of an estimated total of 100
Uploaded data/data_for_batch_inference/71, 70 files out of an estimated total of 100
Uploaded data/data_for_batch_inference/72, 71 files out of an e

### Register an anonymous module from yaml file to the workspace.
If you decorate your module function with ```@dsl.module```, azure-cli-ml could help to generate the ```*.spec.yaml``` file.

In [7]:
fasttext_score_module_func = Module.from_yaml(workspace, 'fasttext_score/fasttext_score.spec.yaml')

### Load a trained fastText model from a complete experiment
- get all experiments
- choose an experiment from all experiments
- get the latest run
- get a PipelineRun associated with the run

In [8]:
exp_name_list = [exp.name for exp in Experiment.list(workspace)]
exp_name_list

['fasttext_test',
 'sample-pipelines',
 'automobile',
 'fasttext_predict',
 'sample-pipelines2',
 'fasttext_with_two_training_process',
 'train-within-notebook',
 'train-on-local',
 'logging-api-test',
 'fasttext_with_one_training_process',
 'fasttext_train',
 'my_test',
 'split_data_txt',
 'compare_two_models',
 'yucongj-test',
 'fasttext_parallel_score',
 'parallel',
 'dir',
 'test0717',
 'test_0727',
 'test_0727_experiment',
 'localtest',
 'mpi_0729',
 'mpi_0729_experiment',
 'test',
 'para_0729',
 'para_0729_experiment',
 'basic_0721',
 'basic_0721_experiment',
 'deploy',
 'fasttext_training_process',
 'fasttext_batch_inference',
 'fasttext_pipeline',
 'fasttext_evaluation']

### Choose the experiment you want with its name

In [9]:
experiment_name = "fasttext_pipeline"
experiment = Experiment(workspace, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
fasttext_pipeline,fundamental3,Link to Azure Machine Learning studio,Link to Documentation


In [10]:
# azureml.pipeline.core.run.PipelineRun
run = Run.list(experiment, status='Completed').__next__()
run

Experiment,Id,Type,Status,Details Page,Docs Page
fasttext_pipeline,42946bf0-b870-4af9-b230-7c257e9523d8,azureml.PipelineRun,Completed,Link to Azure Machine Learning studio,Link to Documentation


### Get a PipelineRun object

In [11]:
run_id = run.id
# azureml.pipeline.wrapper._pipeline_run.PipelineRun
pipeline_run = PipelineRun(experiment, run_id)
pipeline_run

Experiment,Id,Type,Status,Details Page,Docs Page
fasttext_pipeline,42946bf0-b870-4af9-b230-7c257e9523d8,azureml.PipelineRun,Completed,Link to Azure Machine Learning studio,Link to Documentation


### Visualize the pipeline so as to obtain information about the module

In [12]:
pipeline_run.visualize()

<IPython.core.display.Javascript object>

ValidateView(container_id='container_id_288aeb82-97eb-4459-841a-4157f0783542_widget', env_json='{}', graph_jso…

### Get the trained model from a StepRun object.
- get a StepRun from the PipelineRun
- get the port with the trained model from the StepRun
- get DataPath from the port
- change DataPath into the form of module input

In [13]:
# obtain step_run_id from the visualization result.
step_run_id = '9502941c-d015-4e8d-b7cb-436c783f17d1'
step_run = pipeline_run.get_step_run(step_run_id)
step_run

Experiment,Id,Type,Status,Details Page,Docs Page
fasttext_pipeline,9502941c-d015-4e8d-b7cb-436c783f17d1,azureml.StepRun,Completed,Link to Azure Machine Learning studio,Link to Documentation


### In order to use the trained model from a port without registration, we need to install an extra dependence

In [14]:
# Install dataset runtime to enable dataset registration in sample notebooks
!pip install azureml-dataset-runtime[fuse] --extra-index-url https://azuremlsdktestpypi.azureedge.net/modulesdkpreview --user --upgrade

Looking in indexes: https://pypi.org/simple, https://azuremlsdktestpypi.azureedge.net/modulesdkpreview
Requirement already up-to-date: azureml-dataset-runtime[fuse] in /home/azureuser/.local/lib/python3.6/site-packages (1.11.0.post1)


### Use the trained model as the input of a new pipeline

In [15]:
# get_port() should supports three kinds of names: (1)The better model (2)the_better_model (3)The_better_model
port = step_run.get_port(name='The_better_model')
data_path = port.get_data_path()
model = Dataset.File.from_files(path=[data_path]).as_named_input('model_for_batch_inference')
model

<azureml.data.dataset_consumption_config.DatasetConsumptionConfig at 0x7f63ab6ba320>

### Construct the pipeline

In [16]:
@dsl.pipeline(name='batch inference', description='Batch Inference', default_compute_target=aml_compute.name)
def training_pipeline():
    fasttext_score = fasttext_score_module_func(
        texts_to_score=dataset,
        fasttext_model_dir=model
    )
    fasttext_score.runsettings.configure(node_count=1, process_count_per_node=2, mini_batch_size="64")

In [17]:
# pipeline
pipeline = training_pipeline()
# pipeline.save(experiment_name=experiment_name)
pipeline

<azureml.pipeline.wrapper._pipeline.Pipeline at 0x7f63aabc1320>

In [None]:
# pipeline_run
experiment_name = 'fasttext_batch_inference'
pipeline_run = pipeline.submit(experiment_name=experiment_name, regenerate_outputs=True)
pipeline_run.wait_for_completion()

Submitted PipelineRun dd31b79d-3146-4e6a-b49b-6dfb8f646c56
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/fasttext_batch_inference/runs/dd31b79d-3146-4e6a-b49b-6dfb8f646c56?wsid=/subscriptions/4f455bd0-f95a-4b7d-8d08-078611508e0b/resourcegroups/fundamental/workspaces/fundamental3
PipelineRunId: dd31b79d-3146-4e6a-b49b-6dfb8f646c56
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/fasttext_batch_inference/runs/dd31b79d-3146-4e6a-b49b-6dfb8f646c56?wsid=/subscriptions/4f455bd0-f95a-4b7d-8d08-078611508e0b/resourcegroups/fundamental/workspaces/fundamental3


<IPython.core.display.Javascript object>

ValidateView(container_id='container_id_9229f8b8-24c7-4160-b2a0-1c2430380d20_widget', env_json='{}', graph_jso…

### Download results of batch

In [None]:
port = step_run.get_port(name='Scored data output dir')
save_path = port.download(overwrite=True)
save_path