Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Using a Trained  FastText Model for Batch Inference

In this notebook, we will demonstrate how to make predictions on large quantities of data asynchronously using the ML pipelines with Azure Machine Learning. Batch inference (or batch scoring) provides cost-effective inference, with unparalleled throughput for asynchronous applications. Batch prediction pipelines can scale to perform inference on terabytes of production data. Batch prediction is optimized for high throughput, fire-and-forget predictions for a large collection of data.

> **Tip**
The dataset we use is not that huge. We aim to make you know the workflow of batch inference. If your system requires low-latency processing (to process a single document or small set of documents quickly), please use realtime inference. Refer to fasttext_realtime_inference.ipynb for more details. 

The outline of this notebook is as follows:

- Create a DataStore referencing documents stored in a blob container.
- Reference a trained fastText model from a complete experiment.
- Use the fastText model to do batch inference on the documents in the data blob container.

## Prerequisites
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook located at https://github.com/Azure/MachineLearningNotebooks first. This sets you up with a working config file that has information on your workspace, subscription id, etc.

In [1]:
import pandas as pd
from azureml.core import Workspace, Dataset, Datastore, Run
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.data.datapath import DataPath
from azureml.core.experiment import Experiment
from azureml.pipeline.wrapper import PipelineRun, Module, dsl
from azureml.pipeline.wrapper import PipelineEndpoint

### Connect to workspace
Create a workspace object from the existing workspace. Workspace.from_config() reads the file config.json and loads the details into an object named workspace.


In [2]:
workspace = Workspace.from_config('config.json')
print(workspace.name, workspace.resource_group, workspace.location, workspace.subscription_id,
      workspace.compute_targets.keys(), sep='\n')

DesignerDRI_EASTUS
DesignerDRI
eastus
74eccef0-4b8d-4f83-b5f9-fa100d155b22
dict_keys(['attached-aks', 'default', 'compute', 'aml-compute', 'aml-compute-gpu'])


### Create or Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

**Creation of compute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace the code will skip the creation process.**


In [3]:
aml_compute_name = 'aml-compute'
if aml_compute_name in workspace.compute_targets:
    aml_compute = AmlCompute(workspace, aml_compute_name)
    print("Found existing compute target: {}".format(aml_compute_name))
else:
    print("Creating new compute target: {}".format(aml_compute_name))
    provisioning_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D2_V2", min_nodes=1, max_nodes=4)
    aml_compute = ComputeTarget.create(workspace, aml_compute_name, provisioning_config)
    aml_compute.wait_for_completion(show_output=True)

Found existing compute target: aml-compute


### Upload the dataset into a blob container and prepare it as the input of batch inference

In [4]:
# upload files onto path_on_datastore to a blob container
# our files are in the directory of 'path_on_datastore' in the blob container
path_on_datastore = 'data_for_batch_inference'
datastore = Datastore.get(workspace=workspace, datastore_name='workspaceblobstore')
datastore.upload(src_dir='data/data_for_batch_inference', target_path=path_on_datastore, overwrite=True, show_progress=False)

# get the DataPath object associated with the uploaded dataset
datastore_path = [DataPath(datastore=datastore, path_on_datastore=path_on_datastore)]

# dataset used as the input of the batch inference
dataset = Dataset.File.from_files(path=datastore_path).as_named_input('dataset_for_batch_inference')

### Register an anonymous module from yaml file to the workspace
If you decorate your module function with ```@dsl.module```, azure-cli-ml could help to generate the ```*.spec.yaml``` file.
Please refer to the customized modules for more details.

In [5]:
fasttext_score_module_func = Module.from_yaml(workspace, 'fasttext_score/fasttext_score.spec.yaml')

### Load a trained fastText model from a complete experiment
- get all experiments
- choose an experiment from all experiments
- get the latest run
- get a PipelineRun associated with the run

In [6]:
exp_name_list = [exp.name for exp in Experiment.list(workspace)]
exp_name_list

['sample10',
 'sample5',
 'sample5-realtime',
 'simple10-batch',
 'pythonscript',
 'Data_dependency',
 'clement',
 'new_module',
 'test_module2',
 'test_m',
 'module_SDK_local_module_test',
 'fasttext_pipeline',
 'fasttext_batch_inference',
 'fasttext_pipeline2',
 'fasttext_pipeline_endpoint',
 'split_data_txt',
 'fasttext_train']

### Choose the experiment you want with its name

In [7]:
experiment_name = "fasttext_pipeline"
experiment = Experiment(workspace, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
fasttext_pipeline,DesignerDRI_EASTUS,Link to Azure Machine Learning studio,Link to Documentation


In [8]:
# azureml.core.Run
run = Run.list(experiment, status='Completed').__next__()
run

Experiment,Id,Type,Status,Details Page,Docs Page
fasttext_pipeline,2f56028c-6ba7-4d76-a55b-05836feab252,azureml.PipelineRun,Completed,Link to Azure Machine Learning studio,Link to Documentation


### Get a PipelineRun object

In [9]:
# azureml.pipeline.wrapper.PipelineRun
pipeline_run = PipelineRun(experiment, run.id)
pipeline_run

Experiment,Id,Type,Status,Details Page,Docs Page
fasttext_pipeline,2f56028c-6ba7-4d76-a55b-05836feab252,azureml.PipelineRun,Completed,Link to Azure Machine Learning studio,Link to Documentation


### Visualize the pipeline so as to obtain information about the module

In [10]:
pipeline_run.visualize()

PipelineRunId: 2f56028c-6ba7-4d76-a55b-05836feab252
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/fasttext_pipeline/runs/2f56028c-6ba7-4d76-a55b-05836feab252?wsid=/subscriptions/74eccef0-4b8d-4f83-b5f9-fa100d155b22/resourcegroups/DesignerDRI/workspaces/DesignerDRI_EASTUS
use default ui version set: ~=0.1.0


<IPython.core.display.Javascript object>

ValidateView(container_id='container_id_2467d273-d046-48ec-b65b-6f21b07169cc_widget', env_json='{}', graph_jso…

### Get a StepRun object

In [11]:
# You need to update the step run id
# When the process of visualization is finished, right click the "Compare Two Models" and get the step run id from "View Run Id"
step_run_id = 'e2a3e63f-6e3c-4cc4-82bf-37c818e7e302'
step_run = pipeline_run.get_step_run(step_run_id)
step_run

Experiment,Id,Type,Status,Details Page,Docs Page
fasttext_pipeline,e2a3e63f-6e3c-4cc4-82bf-37c818e7e302,azureml.StepRun,Completed,Link to Azure Machine Learning studio,Link to Documentation


### In order to use the trained model from a port without registration, we need to install an extra dependence

In [None]:
# Install dataset runtime to enable dataset registration in sample notebooks
!pip install azureml-dataset-runtime[fuse] --extra-index-url https://azuremlsdktestpypi.azureedge.net/modulesdkpreview --user --upgrade

### Use the trained model as the input of a new pipeline

In [13]:
port = step_run.get_port(name='The better model')
model_data_path = port.get_data_path()
model = Dataset.File.from_files(path=[model_data_path]).as_named_input('model_for_batch_inference')
model

<azureml.data.dataset_consumption_config.DatasetConsumptionConfig at 0x7f9438426ad0>

### Construct the pipeline

In [14]:
@dsl.pipeline(name='batch inference', description='Batch Inference', default_compute_target=aml_compute.name)
def training_pipeline(dataset, model):
    fasttext_score = fasttext_score_module_func(
        texts_to_score=dataset,
        fasttext_model_dir=model
    )
    fasttext_score.runsettings.configure(node_count=2, process_count_per_node=2, mini_batch_size='64')

In [15]:
pipeline = training_pipeline(dataset=dataset, model=model)
# pipeline.save(experiment_name=experiment_name)
pipeline

<azureml.pipeline.wrapper.pipeline.Pipeline at 0x7f943842d290>

### Run the pipeline

In [None]:
# pipeline_run
experiment_name = 'fasttext_batch_inference'
pipeline_run = pipeline.submit(experiment_name=experiment_name, regenerate_outputs=False)
pipeline_run.wait_for_completion()

### Download results of batch inference

In [17]:
# You need to change the step run id
# When the process of visualization is finished, right click the "FastText Score" and get the step run id from "View Run Id"
step_run_id = 'c2f5e15a-2fa0-40fd-8688-360086a95330'
step_run = pipeline_run.get_step_run(step_run_id)
step_run

Experiment,Id,Type,Status,Details Page,Docs Page
fasttext_batch_inference,c2f5e15a-2fa0-40fd-8688-360086a95330,azureml.StepRun,Completed,Link to Azure Machine Learning studio,Link to Documentation


In [18]:
port = step_run.get_port(name='Scored data output dir')
saved_path = port.download(overwrite=True)
saved_path

Downloading azureml/c2f5e15a-2fa0-40fd-8688-360086a95330/Scored_data_output_dir/303a8f68f1ac4bc78c26a7fecdcf9734.parquet
Downloading azureml/c2f5e15a-2fa0-40fd-8688-360086a95330/Scored_data_output_dir/61d5f90f9e9543098d7b60c36aba5a49.parquet
Downloading azureml/c2f5e15a-2fa0-40fd-8688-360086a95330/Scored_data_output_dir/7852799bfc5348cf8977fd789f57f478.parquet
Downloading azureml/c2f5e15a-2fa0-40fd-8688-360086a95330/Scored_data_output_dir/faece9eafbfc4b9c9f8c2e2ded7c6c49.parquet
Downloaded azureml/c2f5e15a-2fa0-40fd-8688-360086a95330/Scored_data_output_dir/303a8f68f1ac4bc78c26a7fecdcf9734.parquet, 1 files out of an estimated total of 4
Downloaded azureml/c2f5e15a-2fa0-40fd-8688-360086a95330/Scored_data_output_dir/faece9eafbfc4b9c9f8c2e2ded7c6c49.parquet, 2 files out of an estimated total of 4
Downloaded azureml/c2f5e15a-2fa0-40fd-8688-360086a95330/Scored_data_output_dir/61d5f90f9e9543098d7b60c36aba5a49.parquet, 3 files out of an estimated total of 4
Downloaded azureml/c2f5e15a-2fa0-40f

'/tmp/azureml/c2f5e15a-2fa0-40fd-8688-360086a95330/Scored_data_output_dir'

### Check the results of batch inference

In [19]:
df_list = []
for i, file in enumerate(os.listdir(saved_path)):
    path = os.path.join(saved_path, file)
    df_list.append(pd.read_parquet(path, engine='pyarrow'))
df = pd.concat(df_list) 
print(df.shape)
df.head(n=10)

(200, 2)


Unnamed: 0,Filename,Class
0,/tmp/tmppr1hm8vu/0,entertainment
1,/tmp/tmppr1hm8vu/1,education
2,/tmp/tmppr1hm8vu/10,finance
3,/tmp/tmppr1hm8vu/100,game
4,/tmp/tmppr1hm8vu/101,education
5,/tmp/tmppr1hm8vu/102,society
6,/tmp/tmppr1hm8vu/103,game
7,/tmp/tmppr1hm8vu/104,education
8,/tmp/tmppr1hm8vu/105,entertainment
9,/tmp/tmppr1hm8vu/106,game


## Reuse this pipeline with PipelineEndpoint
Suppose you need to do batch inference for a new dataset. Just reuse this pipeline with PipelineEndpoint.

Suppose you want to choose a new model to do batch inference. Just reuse this pipeline with PipelineEndpoint.

### Create a pipeline endpoint
Publish the above pipeline to a pipeline endpoint.

In [20]:
name = 'fasttext_endpoint'
try:
    pipeline_endpoint = PipelineEndpoint.get(workspace=workspace, name=name)
except:
    # If there exists a pipeline endpoint, publish the above pipeline to a pipeline endpoint
    pipeline_endpoint = PipelineEndpoint.publish(workspace=workspace, name=name, pipeline=pipeline_run)
pipeline_endpoint

Name,Description,Date updated,Updated by,Last run time,Last run status,Status,tags,Portal Link
fasttext_endpoint,,2020-08-18 09:37:10.713346+00:00,Xiaoyu Yang,2020-08-21 03:59:38.319468+00:00,Unknown,Unknown,azureml.Designer: true,Link


### Prepare a new dataset
Here, we use the dataset we used just now. To make a difference, we give it a new module name.

In [21]:
dataset = Dataset.File.from_files(path=datastore_path).as_named_input('dataset_for_batch_inference_new')
dataset

<azureml.data.dataset_consumption_config.DatasetConsumptionConfig at 0x7f9439b2c110>

### Choose a new model
Here, we use the model used just now. To make a difference, we give it a new module name.

In [22]:
model = Dataset.File.from_files(path=[model_data_path]).as_named_input('model_for_batch_inference_new')
model

<azureml.data.dataset_consumption_config.DatasetConsumptionConfig at 0x7f943816ba90>

### Submit a pipeline experiment through pipeline parameters

In [None]:
experiment_name = 'fasttext_pipeline_endpoint'
pipeline_run = pipeline_endpoint.submit(experiment_name=experiment_name, parameters={'dataset':dataset, 'model':model})
pipeline_run.wait_for_completion()