# Deploy To MIR (Bring your own container)


## Prerequisite

1. Azure Cli  （https://docs.microsoft.com/en-us/cli/azure/）
2. CLI extension for Azure Machine Learning  （https://docs.microsoft.com/en-us/azure/machine-learning/reference-azure-machine-learning-cli）

## Prepare model repository
The model repository is the directory where you place the models. The directory layout must follow the ["triton model repository layout"](https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md) 

```
<model-repository-path>/
    <model-name>/
      [config.pbtxt]
      <version>/
        <model-definition-file>
      <version>/
        <model-definition-file>
      ...
    <model-name>/
      [config.pbtxt]
      <version>/
        <model-definition-file>
      <version>/
        <model-definition-file>
      ...
    ...
```

In [None]:
# For demo purpose, we use this `mock_repo.py` script to duplicate models into many copies.

! python ../mock_repo.py --copy 100 --name densenet_onnx  ../repository_sample  ./models

In [None]:
# prepare azureml workspace
SUBSCRIPTION="<your sub id>"
RESOURCE_GROUP="<your rg>"
WORKSPACE="<your ws>"
! az account set --subscription $SUBSCRIPTION
! az configure --defaults workspace=$WORKSPACE group=$RESOURCE_GROUP

In [None]:
# Register the repository as one azureml model 
MODEL_NAME="my-multi-model"
! az ml model create --name $MODEL_NAME --local-path=./models  --version 1 

# Prepare the Multi-Model Triton image



In [None]:
# find the ACR name
acr_id = !az ml workspace show --query container_registry
acr_name = acr_id[-1].replace('"', '').split('/')[-1]
print('The ACR name is ' + acr_name)

IMAGE_NAME = f"{acr_name}.azurecr.io/multi-model-triton:latest"

!az acr import --name $acr_name --source amlitpmvp.azurecr.io/yulhuang/multi-model-triton:latest --image multi-model-triton:latest 



## Create MIR endpoint

In [None]:
# create the endpoint yaml file, please 

ENDPOINT_NAME="multi-model-triton-byoc"   #change this to your endpoint
endpoint_yaml = f"""
name: {ENDPOINT_NAME}
auth_mode: key
"""

%store endpoint_yaml >endpoint.yaml

In [None]:
# create endpoint
! az ml online-endpoint create --file endpoint.yaml

## Create MIR deployment

In [None]:
DEPLOYMENT_NAME="triton"  # change this to your deployment name

deployment_yaml = f"""
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: {DEPLOYMENT_NAME}
endpoint_name: {ENDPOINT_NAME}
model:
   name: {MODEL_NAME}
   version: 1
   local_path: "./models"
environment:
  name: multi-model-triton-env
  version: 1
  image: {IMAGE_NAME}
  os_type: linux
  inference_config:
    liveness_route:
      port: 9000
      path: /v2/health/live
    readiness_route:
      port: 9000
      path: /v2/health/ready
    scoring_route:
      port: 9000
      path: /v2/models
instance_type: Standard_F2s_v2
instance_count: 1

"""

%store deployment_yaml >deployment.yaml

In [None]:
# create the deployment
!az ml online-deployment create  --file ./deployment.yaml

## Test the deployment

In [None]:
# find the scoring url
content=!az ml online-endpoint show --name $ENDPOINT_NAME --query "scoring_uri"
score_uri = content[-1].replace('"', '')
print(score_uri)

from urllib.parse import urlparse
u = urlparse(score_uri)
base_url = u.scheme +"://"+ u.netloc

In [None]:
# find the auth key
content = !az ml online-endpoint get-credentials --name multi-model-triton-byoc --query primaryKey
key = content[-1]

In [None]:
# Test inference by curl
model_name = "densenet_onnx_1"  #
#  `request_onnx.json` contains request body 

!curl --request POST $score_uri/$model_name/infer \
    --header "azureml-model-deployment: $DEPLOYMENT_NAME" \
    --header "Authorization: Bearer $key" \
    --header 'Content-Type: application/json' \
    --data "@sample-request_onnx.json"

In [None]:
# you can call multi-model repository api to retrieve the model status

!curl $base_url/v2/repository/index --header "Authorization: Bearer $key" 

In [None]:
# Unlike vanilla triton server, our multi-model manage model loading/unloading automatically by monitoring the memory usage.
# When receive a request, it load the model first if the model is not loaded before
# Also if memory usage is high, the model which idles the longest time will be unloaded

# client.py is a script spawn inference requests randomly
from urllib.parse import urlparse
u = urlparse(score_uri)
base_url = u.scheme +"://"+ u.netloc
 
! python ../client.py --base $base_url --request-file="./sample-request_onnx.json" --key $key '*'