Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/production-deploy-to-aks-gpu/production-deploy-to-aks-gpu.png)

# Deploying a web service hosted on NVIDIA Triton to Azure Kubernetes Service (ACI)
This notebook shows the steps for deploying a service with [NVIDIA Triton Inferencing Server](https://developer.nvidia.com/nvidia-triton-inference-server): registering a model, creating an image, provisioning a cluster (one time action), and deploying a service to it. 
We then test and delete the service, image and model.

In [1]:
import azureml.core
print(azureml.core.VERSION)

1.10.0


# Get workspace
Load existing workspace from the config file info.

In [5]:
from azureml.core.workspace import Workspace

subscription_id = os.getenv("SUBSCRIPTION_ID", default="a5fe3bc5-98f0-4c84-affc-a589f54d9b23")
resource_group = os.getenv("RESOURCE_GROUP", default="yultest")
workspace_name = os.getenv("WORKSPACE_NAME", default="yultest")
workspace_region = os.getenv("WORKSPACE_REGION", default="centraluseuap")
ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.
yultest
yultest
centraluseuap
a5fe3bc5-98f0-4c84-affc-a589f54d9b23


# Download the model

Prior to registering the model, you should have a model in one of the [supported formats](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_repository.html#framework-model-definition) by Triton. This cell will download a [pretrained ONNX densenet](https://contentmamluswest001.blob.core.windows.net/content/14b2744cf8d6418c87ffddc3f3127242/9502630827244d60a1214f250e3bbca7/08aed7327d694b8dbaee2c97b8d0fcba/densenet121-1.2.onnx) model.

** Note: ** If you have a previously-registered model, the file name needs to follow the [naming convention](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_repository.html?highlight=model%20onnx#framework-model-definition) or be specified using model configuration file below.

In [23]:
import os
import requests
import shutil
import tarfile
import tempfile

from io import BytesIO

model_url = 'https://contentmamluswest001.blob.core.windows.net/content/14b2744cf8d6418c87ffddc3f3127242/9502630827244d60a1214f250e3bbca7/08aed7327d694b8dbaee2c97b8d0fcba/densenet121-1.2.onnx'

version = '1'
model_dir = os.path.join('models', 'triton', 'densenet_onnx', version)
if not os.path.exists(model_dir):
    os.mkdir(model_dir)

target_file = os.path.join(model_dir, 'model.onnx')

if not os.path.exists(target_file):
    response = requests.get(model_url)
    open(target_file, 'wb').write(response.content)

config_file = os.path.join('models', 'triton', 'densenet_onnx', 'config.pbtxt')

# Add Model Configuration file

Each model needs a [Model Configuration](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_configuration.html) that provides required and optional information about the model.


In [24]:
%%writefile $config_file
name: "densenet_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
  {
    name: "data_0"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
    reshape { shape: [ 1, 3, 224, 224 ] }
  }
]
output [
  {
    name: "fc6_1"
    data_type: TYPE_FP32
    dims: [ 1000 ]
    reshape { shape: [ 1, 1000, 1, 1 ] }
    label_filename: "densenet_labels.txt"
  }
]

Writing models/triton/densenet_onnx/config.pbtxt


# Register the model
Register an existing trained model, add description and tags.

** Note: ** Under `model_path` there must be a sub-directory named `triton`, which has the structure of a Triton [Model Repository](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_repository.html#repository-layout).

In [34]:
from azureml.core.model import Model

model = Model.register(model_path="models", # This points to the local directory to upload.
                       model_name="densenet_onnx", # This is the name the model is registered as.
                       tags={'area': "Image classification", 'type': "classification"},
                       description="Image classification trained on Imagenet Dataset",
                       workspace=ws)

print(model.name, model.description, model.version)

Registering model densenet_onnx
densenet_onnx Image classification trained on Imagenet Dataset 2


# Deploy the model as a web service to ACI

First create a scoring script

** Note: ** Triton server listens to a fixed local port. You may choose to use the Triton Python [client library](https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/client_library.html) to talk to it, while keeping the flexibility of pre-/post- processing.

In [35]:
%%writefile score.py
import numpy as np
from PIL import Image
import sys
from functools import partial
import os
import io

import tritonhttpclient
from tritonclientutils import InferenceServerException

from azureml.contrib.services.aml_request import AMLRequest, rawhttp
from azureml.contrib.services.aml_response import AMLResponse

sys.path.append(os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'models'))
from utils import preprocess, postprocess


trition_client = None
_url = "localhost:8000"
_model = "densenet_onnx"
_scaling = "INCEPTION"

def init():
    global triton_client, max_batch_size, input_name, output_name, dtype

    triton_client = tritonhttpclient.InferenceServerClient(_url)

    max_batch_size = 0
    input_name = "data_0"
    output_name = "fc6_1"
    dtype = "FP32"


@rawhttp
def run(request):
    if request.method == 'POST':
        
        reqBody = request.get_data(False)
        img = Image.open(io.BytesIO(reqBody))
        
        image_data = preprocess(img, _scaling, dtype)
        
        input = tritonhttpclient.InferInput(input_name, image_data.shape, dtype)
        input.set_data_from_numpy(image_data, binary_data=True)
        output = tritonhttpclient.InferRequestedOutput(output_name, binary_data=True, class_count=1)
    
        res = triton_client.infer(_model,
                                [input],
                                request_id="0",
                                outputs=[output])

        result = postprocess(res, output_name, 1, max_batch_size > 0)

        return AMLResponse(result, 200)
    else:
        return AMLResponse("bad request", 500)

Overwriting score.py


### Load environment

In [36]:

from azureml.core.environment import Environment

# env = Environment.get(workspace=ws, name="AzureML-Triton-20.07")
env = Environment.load_from_directory(path = "./myenv")
 


Now create the deployment configuration objects and deploy the model as a webservice.

In [37]:
# Set the web service configuration (using default here)
from azureml.core import Webservice
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.exceptions import WebserviceException


inference_config = InferenceConfig(entry_script="score.py", environment=env)
aci_config = AciWebservice.deploy_configuration(cpu_cores=2, memory_gb=4)


In [39]:
%%time
service_name ='densenet-onnx'

try:
    Webservice(ws, service_name).delete()
except WebserviceException:
    pass

service = Model.deploy(workspace=ws,
                           name=service_name,
                           models=[model],
                           inference_config=inference_config,
                           deployment_config=aci_config)

service.wait_for_deployment(show_output = True)
print(service.state)

Running..............................................................................................................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy
CPU times: user 1.52 s, sys: 167 ms, total: 1.69 s
Wall time: 13min 29s


In [None]:
service.get_logs()

# Test the web service
We test the web sevice by passing the test images content.

In [43]:
%%time
import requests

# if (key) auth is enabled, fetch keys and include in the request


# # if token auth is enabled, fetch token and include in the request
# access_token, fetch_after = aks_service.get_token()
# headers = {'Content-Type':'application/json', 'Authorization': 'Bearer ' + access_token}

test_sample = open('car.jpg', 'rb').read()
resp = requests.post(service.scoring_uri, test_sample)
print(resp.text)

12.989695 (817) = SPORTS CAR
CPU times: user 5.15 ms, sys: 139 µs, total: 5.29 ms
Wall time: 227 ms


# Clean up
Delete the service, image, model and compute target

In [None]:
%%time
service.delete()
model.delete()
