Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Using Azure Machine Learning Pipelines for Batch Inference

In this notebook, we will demonstrate how to make predictions on large quantities of data asynchronously using the ML pipelines with Azure Machine Learning. Batch inference (or batch scoring) provides cost-effective inference, with unparalleled throughput for asynchronous applications. Batch prediction pipelines can scale to perform inference on terabytes of production data. Batch prediction is optimized for high throughput, fire-and-forget predictions for a large collection of data.

> **Tip**
If your system requires low-latency processing (to process a single document or small set of documents quickly), use [real-time scoring](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-consume-web-service) instead of batch prediction.

In this example will be take a digit identification model already-trained on MNIST dataset using the [AzureML training with deep learning example notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-keras/train-hyperparameter-tune-deploy-with-keras.ipynb), and run that trained model on some of the MNIST test images in batch.

The input dataset used for this notebook differs from a standard MNIST dataset in that it has been converted to PNG images to demonstrate use of files as inputs to Batch Inference. A sample of PNG-converted images of the MNIST dataset were take from [this repository](https://github.com/myleott/mnist_png).

The outline of this notebook is as follows:

- Create a DataStore referencing MNIST images stored in a blob container.
- Register the pretrained MNIST model into the model registry.
- Use the registered model to do batch inference on the images in the data blob container.

## Prerequisites
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook located at https://github.com/Azure/MachineLearningNotebooks first. This sets you up with a working config file that has information on your workspace, subscription id, etc.



In [None]:
import os
import json
from azureml.core import Workspace, Dataset, Datastore
from azureml.data.datapath import DataPath


### Create word_to_index.json

In [12]:
path = 'data/data.txt'
word_count = {}
index = 2
with open(path, 'r', encoding='utf-8') as f:
    for line in f.readlines():
        text = line.split('\t')[0]
        for word in text.split(' '):
            if not word in word_count:
                word_count[word] = 0
            word_count[word] += 1
word_count_list = sorted(word_count.items(), key=lambda x : x[1], reverse=True)

word_to_index = {"[PAD]": 0, "[UNK]": 1}
vocab_size = 5000
index = 2
for w_c in word_count_list:
    if index == vocab_size:
        break
    word = w_c[0]
    count = w_c[1]
    word_to_index[word] = index
    index +=1
with open('data/word_to_index.json', 'w', encoding='utf-8') as f:
    json.dump(word_to_index, f)

### Create a datastore containing sample images
The input dataset used for this notebook differs from a standard MNIST dataset in that it has been converted to PNG images to demonstrate use of files as inputs to Batch Inference. A sample of PNG-converted images of the MNIST dataset were take from [this repository](https://github.com/myleott/mnist_png).

We have created a public blob container `sampledata` on an account named `pipelinedata`, containing these images from the MNIST dataset. In the next step, we create a datastore with the name `images_datastore`, which points to this blob container. In the call to `register_azure_blob_container` below, setting the `overwrite` flag to `True` overwrites any datastore that was created previously with that name.

This step can be changed to point to your blob container by providing your own `datastore_name`, `container_name`, and `account_name`.



### Prepare data for batch inference

we need to create a file for each input sample

In [3]:
path = 'data/data.txt'
dir_ = 'data_for_batch_inference'
os.makedirs(dir_, exist_ok=True)
num = 200
with open(path, 'r', encoding='utf-8') as f:
    lines = []
    for i, line in enumerate(f.readlines()):
        if i==num:
            break
        lines.append(line.split('\t')[0])
for i, line in enumerate(lines):
    path = os.path.join(dir_, str(i))
    with open(path, 'w', encoding='utf-8') as f:
        f.write(line)

### Prepare your workspace

In [4]:
workspace = Workspace.from_config('config.json')
workspace

Workspace.create(name='fundamental3', subscription_id='4f455bd0-f95a-4b7d-8d08-078611508e0b', resource_group='fundamental')

### Datastore

In [5]:
path_on_datastore = 'my_dataset'
datastore = Datastore.get(workspace=workspace, datastore_name='workspaceblobstore')
datastore.upload(src_dir='data', target_path=path_on_datastore, overwrite=True, show_progress=True)

Uploading an estimated of 4 files
Uploading data/azureml/358a3e99-f299-4089-b2cf-cecc32ac34f8/Trained_model_dir/BestModel
Uploading data/data.txt
Uploading data/label.txt
Uploading data/word_to_index.json
Uploaded data/label.txt, 1 files out of an estimated total of 4
Uploaded data/azureml/358a3e99-f299-4089-b2cf-cecc32ac34f8/Trained_model_dir/BestModel, 2 files out of an estimated total of 4
Uploaded data/word_to_index.json, 3 files out of an estimated total of 4
Uploaded data/data.txt, 4 files out of an estimated total of 4
Uploaded 4 files


$AZUREML_DATAREFERENCE_451c7d517a0a42188a394042cb895dfa

### Register the dataset


In [6]:
dataset_name = 'THUCNews'
description = 'THUCNews dataset is generated by filtering and filtering historical data \
of Sina News RSS subscription channel from 2005 to 2011'
datastore_path = [DataPath(datastore=datastore, path_on_datastore=path_on_datastore)]
data = Dataset.File.from_files(path=datastore_path)
data.register(workspace=workspace, name=dataset_name, description=description, create_new_version=True)

{
  "source": [
    "('workspaceblobstore', 'my_dataset')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "f2d71ae3-678a-4673-9485-56c56e2e1389",
    "name": "THUCNews",
    "version": 1,
    "description": "THUCNews dataset is generated by filtering and filtering historical data of Sina News RSS subscription channel from 2005 to 2011",
    "workspace": "Workspace.create(name='fundamental3', subscription_id='4f455bd0-f95a-4b7d-8d08-078611508e0b', resource_group='fundamental')"
  }
}

In [7]:
### Datastore

Uploading an estimated of 200 files
Uploading data_for_batch_inference/0
Uploading data_for_batch_inference/1
Uploading data_for_batch_inference/10
Uploading data_for_batch_inference/100
Uploading data_for_batch_inference/101
Uploading data_for_batch_inference/102
Uploading data_for_batch_inference/103
Uploading data_for_batch_inference/104
Uploading data_for_batch_inference/105
Uploading data_for_batch_inference/106
Uploading data_for_batch_inference/107
Uploading data_for_batch_inference/108
Uploading data_for_batch_inference/109
Uploading data_for_batch_inference/11
Uploading data_for_batch_inference/110
Uploading data_for_batch_inference/111
Uploading data_for_batch_inference/112
Uploading data_for_batch_inference/113
Uploading data_for_batch_inference/114
Uploading data_for_batch_inference/115
Uploading data_for_batch_inference/116
Uploading data_for_batch_inference/117
Uploading data_for_batch_inference/118
Uploading data_for_batch_inference/119
Uploading data_for_batch_inference

Uploaded data_for_batch_inference/171, 82 files out of an estimated total of 200
Uploading data_for_batch_inference/2
Uploaded data_for_batch_inference/169, 83 files out of an estimated total of 200
Uploading data_for_batch_inference/20
Uploading data_for_batch_inference/21
Uploaded data_for_batch_inference/185, 84 files out of an estimated total of 200
Uploading data_for_batch_inference/22
Uploaded data_for_batch_inference/18, 85 files out of an estimated total of 200
Uploaded data_for_batch_inference/179, 86 files out of an estimated total of 200
Uploading data_for_batch_inference/23
Uploading data_for_batch_inference/24
Uploaded data_for_batch_inference/181, 87 files out of an estimated total of 200
Uploading data_for_batch_inference/25
Uploaded data_for_batch_inference/188, 88 files out of an estimated total of 200
Uploading data_for_batch_inference/26
Uploaded data_for_batch_inference/183, 89 files out of an estimated total of 200
Uploading data_for_batch_inference/27
Uploaded dat

Uploaded data_for_batch_inference/20, 152 files out of an estimated total of 200
Uploading data_for_batch_inference/84
Uploaded data_for_batch_inference/69, 153 files out of an estimated total of 200
Uploading data_for_batch_inference/85
Uploaded data_for_batch_inference/65, 154 files out of an estimated total of 200
Uploading data_for_batch_inference/86
Uploaded data_for_batch_inference/66, 155 files out of an estimated total of 200
Uploading data_for_batch_inference/87
Uploaded data_for_batch_inference/71, 156 files out of an estimated total of 200
Uploading data_for_batch_inference/88
Uploaded data_for_batch_inference/7, 157 files out of an estimated total of 200
Uploading data_for_batch_inference/89
Uploaded data_for_batch_inference/45, 158 files out of an estimated total of 200
Uploading data_for_batch_inference/9
Uploaded data_for_batch_inference/68, 159 files out of an estimated total of 200
Uploading data_for_batch_inference/90
Uploaded data_for_batch_inference/70, 160 files ou

$AZUREML_DATAREFERENCE_b09637c7feff45f793275fa2a9123854

### Register the dataset


In [8]:
dataset_name = 'THUCNews_For_Batch_Inference'
description = 'THUCNews dataset is generated by filtering and filtering historical data \
of Sina News RSS subscription channel from 2005 to 2011'
datastore_path = [DataPath(datastore=datastore, path_on_datastore=path_on_datastore)]
data = Dataset.File.from_files(path=datastore_path)
data.register(workspace=workspace, name=dataset_name, description=description, create_new_version=True)

{
  "source": [
    "('workspaceblobstore', 'my_dataset_for_batch_inference')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "f519240d-08f3-44bb-8e79-58b8c90810d6",
    "name": "THUCNews_For_Batch_Inference",
    "version": 1,
    "description": "THUCNews dataset is generated by filtering and filtering historical data of Sina News RSS subscription channel from 2005 to 2011",
    "workspace": "Workspace.create(name='fundamental3', subscription_id='4f455bd0-f95a-4b7d-8d08-078611508e0b', resource_group='fundamental')"
  }
}