# Using Azure Machine Learning Pipelines for Batch Inference

In this notebook, we will demonstrate how to make predictions on large quantities of data asynchronously using the ML pipelines with Azure Machine Learning. Batch inference (or batch scoring) provides cost-effective inference, with unparalleled throughput for asynchronous applications. Batch prediction pipelines can scale to perform inference on terabytes of production data. Batch prediction is optimized for high throughput, fire-and-forget predictions for a large collection of data.

In this example will be take a digit identification model already-trained on MNIST dataset using the [AzureML training with deep learning example notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-keras/train-hyperparameter-tune-deploy-with-keras.ipynb), and run that trained model on some of the MNIST test images in batch.

The input dataset used for this notebook differs from a standard MNIST dataset in that it has been converted to PNG images to demonstrate use of files as inputs to Batch Inference. A sample of PNG-converted images of the MNIST dataset were take from [this repository](https://github.com/myleott/mnist_png).

The outline of this notebook is as follows:

- Create a DataStore referencing MNIST images stored in a blob container.
- Upload the pretrained MNIST model to datastore
- Use the uploaded model to do batch inference on the images in the data blob container.

In [None]:
# Install azureml-sdk with PipelineRun
# Important! After install succeed, need to restart kernel

%config IPCompleter.greedy=True
!pip install azureml-pipeline-wrapper[notebooks]==0.1.0.20471586 --extra-index-url https://azuremlsdktestpypi.azureedge.net/CLI-SDK-Runners-Validation/20471586 --user --upgrade

## Connect to your workspace

In [None]:
from azureml.core import Workspace

subscription_id="your_subscription_id"
resource_group="your_resource_group"
name="your_workspace_name"

ws = Workspace.get(subscription_id=subscription_id, resource_group=resource_group, name=name)
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

## Create or Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

**Creation of compute takes approximately 5 minutes. If the AmlCompute with that name is already in your workspace the code will skip the creation process.**

In [None]:
import os
from azureml.core.compute import AmlCompute, ComputeTarget

# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "aml-compute")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_D2_V2")


if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

## Create a datastore containing sample images
The input dataset used for this notebook differs from a standard MNIST dataset in that it has been converted to PNG images to demonstrate use of files as inputs to Batch Inference. A sample of PNG-converted images of the MNIST dataset were take from [this repository](https://github.com/myleott/mnist_png).

We have created a public blob container sampledata on an account named pipelinedata, containing these images from the MNIST dataset. In the next step, we create a datastore with the name images_datastore, which points to this blob container. In the call to *register_azure_blob_container* below, setting the *overwrite* flag to True overwrites any datastore that was created previously with that name.

This step can be changed to point to your blob container by providing your own *datastore_name*, *container_name*, and *account_name*.

In [None]:
from azureml.core.datastore import Datastore

account_name = "pipelinedata"
datastore_name = "mnist_datastore"
container_name = "sampledata"

mnist_data = Datastore.register_azure_blob_container(ws, 
                      datastore_name=datastore_name, 
                      container_name=container_name, 
                      account_name=account_name,
                      overwrite=True)

Next, let's specify the default datastore for the outputs and uploading the trained models.

In [None]:
def_data_store = ws.get_default_datastore()

## Create a FileDataset
A [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py) references single or multiple files in your datastores or public urls. The files can be of any format. FileDataset provides you with the ability to download or mount the files to your compute. By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred.

In [None]:
from azureml.core.dataset import Dataset

mnist_ds_name = 'mnist_sample_data'

path_on_datastore = mnist_data.path('mnist')
input_mnist_ds = Dataset.File.from_files(path=path_on_datastore, validate=False)

## Download the Model
Download and extract the model from https://pipelinedata.blob.core.windows.net/mnist-model/mnist-tf.tar.gz to "models" directory

In [None]:
import tarfile
import urllib.request

# create directory for model
model_dir = 'models'
if not os.path.isdir(model_dir):
    os.mkdir(model_dir)

url="https://pipelinedata.blob.core.windows.net/mnist-model/mnist-tf.tar.gz"
response = urllib.request.urlretrieve(url, "model.tar.gz")
tar = tarfile.open("model.tar.gz", "r:gz")
tar.extractall(model_dir)

os.listdir(model_dir)

## Upload model to default datastore

In [None]:
target_path = 'batch_inf_models'
def_data_store.upload(src_dir='models/', target_path=target_path, show_progress=True)
model_data_path = Dataset.File.from_files(path=(def_data_store, target_path))

## Register a module from an existing function to use the model to make batch predictions

In [None]:
from azureml.pipeline.wrapper import PipelineRun, Module, dsl
from batch_score import batch_score
score_module_func = Module.from_func(ws, batch_score)

use help() function to see the module definition

In [None]:
help(score_module_func)

## Define the inference pipeline

In [None]:
@dsl.pipeline(name='batch inference', description='Batch Inference', default_compute_target=compute_name)
def scoring_pipeline(dataset, model, output_file):
    score_module = score_module_func(
        images_to_score=dataset,
        model_dir=model,
        scored_data_output_name=output_file,
    )
    score_module.runsettings.configure(node_count=2, process_count_per_node=2, mini_batch_size='64')

## Create the pipeline with parameters

In [None]:
output_file_name = 'inference_result.txt'
pipeline = scoring_pipeline(dataset=input_mnist_ds, model=model_data_path, output_file=output_file_name)

## Run the pipeline

In [None]:
pipeline_run = pipeline.submit(experiment_name='batch-inf-test')

## Monitor the run

In [None]:
pipeline_run.wait_for_completion()

## View the prediction results per input image
In the digit_identification.py file above you can see that the ResultList with the filename and the prediction result gets returned. These are written to the DataStore specified in the PipelineData object as the output data, which in this case is called inferences. This containers the outputs from all of the worker nodes used in the compute cluster. You can download this data to view the results ... below just filters to the first 10 rows

In [None]:
import pandas as pd
import os

batch_run = pipeline_run.find_step_run('Batch Score')[0]
port = batch_run.get_port(name='Scored data output dir')
saved_path = port.download(overwrite=True)

output_file_name = 'inference_result.txt'
saved_file = os.path.join(saved_path, output_file_name)
df = pd.read_csv(saved_file, delimiter=":", header=None)
df.columns = ["Filename", "Prediction"]
print("Prediction has ", df.shape[0], " rows")
df.head(10)