# Amazon SageMaker MultiModel Endpoints

With Amazon SageMaker MultiModel Endpoints (internal feature name), customers will be able to create an Endpoint which can host multiple models behind the same Endpoint. These Endpoints are well suited to cases where there are a large number of models that can be served from a shared inference container and when the customer performing an InvokeEndpoint request tolerates occasional cold start related latency penalties for invoking infrequently used models.

At a high level, Amazon SageMaker manages the lifetime of the models in-memory for MultiModel Endpoints. When an invocation request is made for a particular model, Amazon SageMaker routes to a particular instance, downloads the model from S3 to that instance, and loads the required model to the memory of the customer container. Then Amazon SageMaker performs an invocation on the model. If the model is already loaded in memory, the invocation will be fast since the downloading and loading steps are skipped. If a model is 'popular' due to frequent invocations of that model, then it is likely to be in memory already and the inference requests should also be served fast.

---

### Contents

1. [Introduction to MXNet Model Server (MMS)](#Introduction-to-MXNet-Model-Server-(MMS))
1. [Building and registering a container using MMS](#Building-and-registering-a-container-using-MMS)
1. [Set up Boto to use private SageMaker fields](#Set-up-Boto-to-use-private-SageMaker-fields)
1. [Upload model artifacts to S3](#Upload-model-artifacts-to-S3)
1. [Import models into hosting](#Import-models-into-hosting)
1. [Invoke a model](#Invoke-a-model)

### Introduction to MXNet Model Server (MMS)

[MXNet Model Server](https://github.com/awslabs/mxnet-model-server) is an open source framework for serving machine learning models. It provides the HTTP frontend and model management capabilities required by MultiModel Endpoints to host multiple models within a single container, load models into and unload models out of the container dynamically, and performing inference on a specified loaded model.

Though the name implies the models are MXNet models, MMS supports a pluggable backend handler where you can implement your own algorithm.

This example uses a handler that supports loading and inference for MXNet models, which we will inspect below.

In [None]:
!cat container/model_handler.py

Of note are the `handle(data, context)` and `initialize(self, context)` methods.

The `initialize` method will be called when a model is loaded into memory. In this example, it loads the model artifacts at `model_dir` into MXNet.

The `handle` method will be called when invoking the model. In this example, it validates the input payload and then forwards the input to MXNet, returning the output.

This handler class is instantiated for every model loaded into the container, so state in the handler is not shared across models.

### Handling Out Of Memory conditions
If MXNet fails to load the model due to lack of memory, a `MemoryError` is raised. Any time a model cannot be loaded due to lack of memory or any other resource, a `MemoryError` must be raised. MMS will interpret the `MemoryError`, and return a 507 HTTP status code to SageMaker, where SageMaker will initiate unloading unused models to reclaim resources so the requested model can be loaded.

### Building and registering a container using MMS
The shell script below will build a docker image which uses MMS as the front end, and `container/model_handler.py` that we inspected above as the backend handler. It will then upload the image to an ECR repository in your account.

In [None]:
%%sh

# The name of our algorithm
algorithm_name=demo-sagemaker-multimodel

cd container

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

### Set up Boto to use private SageMaker fields
The new API fields required to create and invoke MultiModel Endpoints are packaged in this sample notebook. Below we install them into Boto.

In [None]:
!aws configure add-model --service-model file://sagemaker-2017-07-24.normal.json --service-name sagemaker-multimodel-endpoints
!aws configure add-model --service-model file://sagemaker-runtime.normal.json --service-name sagemaker-runtime-multimodel-endpoints

import boto3

sm_client = boto3.client(service_name='sagemaker-multimodel-endpoints')
runtime_sm_client = boto3.client(service_name='sagemaker-runtime-multimodel-endpoints')

### Set up the environment
Define the S3 bucket and prefix where the model artifacts that will be invokable by your MultiModel Endpoint will be located.

Also define the IAM role that will give SageMaker access to the model artifacts and ECR image that was created above.

In [None]:
from sagemaker import get_execution_role

account_id = boto3.client('sts').get_caller_identity()['Account']
region = boto3.Session().region_name

bucket = 'sagemaker-{}-{}'.format(region, account_id)
prefix = 'demo-multimodel-endpoint'

role = get_execution_role()

### Upload model artifacts to S3
In this example we will use a ResNet 18 and ResNet 152 model, both trained on the ImageNet datset. First we will download the pre trained models from MXNet's model zoo, then upload them to S3.

In [None]:
%%sh

mkdir resnet_18
cd resnet_18
wget -O resnet-18-0000.params http://data.mxnet.io/models/imagenet/resnet/18-layers/resnet-18-0000.params 
wget -O resnet-18-symbol.json http://data.mxnet.io/models/imagenet/resnet/18-layers/resnet-18-symbol.json 
wget -O synset.txt http://data.mxnet.io/models/imagenet/synset.txt 
echo '[{"shape": [1, 3, 224, 224], "name": "data"}]' > resnet-18-shapes.json
cd ..
tar -zcvf resnet_18.tar.gz -C resnet_18 .

In [None]:
%%sh

mkdir resnet_152
cd resnet_152
wget -O resnet-152-0000.params http://data.mxnet.io/models/imagenet/resnet/152-layers/resnet-152-0000.params 
wget -O resnet-152-symbol.json http://data.mxnet.io/models/imagenet/resnet/152-layers/resnet-152-symbol.json 
wget -O synset.txt http://data.mxnet.io/models/imagenet/synset.txt 
echo '[{"shape": [1, 3, 224, 224], "name": "data"}]' > resnet-152-shapes.json
cd ..
tar -zcvf resnet_152.tar.gz -C resnet_152 .

In [None]:
from botocore.client import ClientError
import os

s3 = boto3.resource('s3')
try:
    s3.meta.client.head_bucket(Bucket=bucket)
except ClientError:
    s3.create_bucket(Bucket=bucket)

models = {'resnet_18.tar.gz', 'resnet_152.tar.gz'}

for model in models:
    key = os.path.join(prefix, model)
    with open(model, 'rb') as file_obj:
        s3.Bucket(bucket).Object(key).upload_fileobj(file_obj)

### Import models into hosting
A big difference for MultiModel endpoints is that when creating the Model entity, the container's `ModelDataUrl` is the S3 prefix where the model artifacts that are invokable by the endpoint are located. The rest of the S3 path will be specified when actually invoking the model.

The `Mode` of container is specified as `MultiModel` to signify that the container will host multiple models.

In [None]:
from time import gmtime, strftime

model_name = 'DEMO-MultiModelModel' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model_url = 'https://s3-{}.amazonaws.com/{}/{}/'.format(region, bucket, prefix)
container = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, region, 'demo-sagemaker-multimodel')

print('Model name: ' + model_name)
print('Model data Url: ' + model_url)
print('Container image: ' + container)

container = {
    'Image': container,
    'ModelDataUrl': model_url,
    'Mode': 'MultiModel'
}

create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    Containers = [container])

print("Model Arn: " + create_model_response['ModelArn'])

### Create endpoint configuration
Endpoint config creation works the same way it does as single model endpoints.

In [None]:
endpoint_config_name = 'DEMO-MultiModelEndpointConfig-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print('Endpoint config name: ' + endpoint_config_name)

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType': 'ml.m5.4xlarge',
        'InitialInstanceCount': 2,
        'InitialVariantWeight': 1,
        'ModelName': model_name,
        'VariantName': 'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

### Create endpoint
Similarly, endpoint creation works the same way as for single model endpoints.

In [None]:
import time

endpoint_name = 'DEMO-MultiModelEndpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print('Endpoint name: ' + endpoint_name)

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Endpoint Status: " + status)

while status=='Creating':
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Endpoint Status: " + status)

print("Endpoint Arn: " + resp['EndpointArn'])
print("Endpoint Status: " + status)

### Invoke a model
Now we invoke the models that we uploaded to S3 previously. The first invocation of a model may be slow, since behind the scenes, SageMaker is downloading the model artifacts from S3 to the instance and loading it into the container.

First we will download an image of a cat as the payload to invoke the model, then call InvokeEndpoint to invoke the ResNet 18 model. The `TargetModel` field is concatenated with the S3 prefix specified in `ModelDataUrl` when creating the model, to generate the location of the model in S3.

In [None]:
!wget -O cat.jpg https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/doc/tutorials/python/predict_image/cat.jpg

with open('cat.jpg', 'rb') as f:
    payload = f.read()

In [None]:
%%time

import json

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/x-image',
    TargetModel='resnet_18.tar.gz', # this represents the rest of the S3 path where the model artifacts are located
    Body=payload)

print(*json.loads(response['Body'].read()), sep = '\n')

When we invoke the same ResNet 18 model a 2nd time, it is already downloaded to the instance and loaded in the container, so inference is faster.

In [None]:
%%time

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/x-image',
    TargetModel='resnet_18.tar.gz',
    Body=payload)

print(*json.loads(response['Body'].read()), sep = '\n')

### Invoke another model
Exercising the power of a MultiModel Endpoint, we can specify a different model (resnet_152.tar.gz) as `TargetModel` and perform inference on it using the same Endpoint.

In [None]:
%%time

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/x-image',
    TargetModel='resnet_152.tar.gz',
    Body=payload)

print(*json.loads(response['Body'].read()), sep = '\n')

### Invoke many models
We can add more models to the endpoint without having to update the endpoint. Below we are adding a 3rd model, `squeezenet_v1.0`. To demonstrate hosting multiple models behind the endpoint, this model is duplicated 10 times with a slightly different name in S3. In a more realistic scenario, these could be 10 new different models.

In [None]:
%%sh

mkdir squeezenet_v1.0
cd squeezenet_v1.0
wget -O squeezenet_v1.0-0000.params http://data.mxnet.io/models/imagenet/squeezenet/squeezenet_v1.0-0000.params
wget -O squeezenet_v1.0-symbol.json http://data.mxnet.io/models/imagenet/squeezenet/squeezenet_v1.0-symbol.json
wget -O synset.txt http://data.mxnet.io/models/imagenet/synset.txt
echo '[{"shape": [1, 3, 224, 224], "name": "data"}]' > squeezenet_v1.0-shapes.json 
cd ..
tar -zcvf squeezenet_v1.0.tar.gz -C squeezenet_v1.0 .

In [None]:
file = 'squeezenet_v1.0.tar.gz'

for x in range(0, 10):
    s3_file_name = 'demo-subfolder/squeezenet_v1.0_{}.tar.gz'.format(x)
    key = os.path.join(prefix, s3_file_name)
    with open(file, 'rb') as file_obj:
        s3.Bucket(bucket).Object(key).upload_fileobj(file_obj)
    models.add(s3_file_name)
    
print('Number of models: {}'.format(len(models)))
print('Models: {}'.format(models))

After uploading the SqueezeNet models to S3, we will invoke the endpoint 100 times, randomly choosing from one of the 12 models behind the S3 prefix for each invocation.

In [None]:
%%time

import random
from collections import defaultdict

results = defaultdict(int)

for x in range(0, 100):
    target_model = random.choice(tuple(models))
    response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/x-image',
        TargetModel=target_model,
        Body=payload)

    results[json.loads(response['Body'].read())[0]] += 1
    
print(*results.items(), sep = '\n')

### Updating a model
To update a model, you would follow the same approach as above and add it as a new model. For example, if you have retrained the `resnet_18.tar.gz` model and wanted to start invoking it, you would upload the updated model artifacts behind the S3 prefix with a new name such as `resnet_18_v2.tar.gz`, and then change the `TargetModel` field to invoke `resnet_18_v2.tar.gz` instead of `resnet_18.tar.gz`

### (Optional) Delete the hosting resources

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)