# Amazon SageMaker Multi-Model Endpoints using XGBoost
_**Hosting multiple trained machine learning models on a single Amazon SageMaker Endpoint**_

This notebook demonstrates

* Hosting multiple trained machine learning models on a single Amazon SageMaker endpoint
* Directing inference traffic to the endpoint and to a specific model.


**Table of Contents** 

1. [Introduction](#intro)
2. [Section 1 - Setup](#setup)
3. [Section 2 - Create the multi-model endpoint](#create-endpoint)
4. [Section 3 - Execute movie recommedation predictions](#movie-predictions)
5. [Section 4 - Update the multi-model endpoint with second recommendation model](#update-endpoint)
6. [Section 5 - Execute music recommedation predictions](#music-predictions)
8. [Clean up](#cleanup)

## Introduction <a id='intro'></a>

Amazon SageMaker provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly. 
Amazon SageMaker is a fully-managed service that covers the entire machine learning workflow. You can label and 
prepare your data, choose an algorithm, train a model, and then tune and optimize it for deployment. Amazon SageMaker 
gets your models into production to make predictions or take actions with less effort and lower costs than was 
previously possible.

With Amazon SageMaker Multi-Model Endpoints, you can create an endpoint that hosts multiple models. These Endpoints are well suited to cases where there are a large number of models that can be served from a shared inference container and when the prediction request tolerates occasional cold start latency penalties for invoking infrequently used models.

At high level, Amazon SageMaker manages the lifetime of the models in-memory for multi-model endpoints. When an invocation request is made for a particular model, Amazon SageMaker routes the request to a particular instance, downloads the model from S3 to that instance, and loads the required model to the memory of the container. Then Amazon SageMaker performs an invocation on the model. If the model is already loaded in memory, the invocation will be fast since the downloading and loading steps are skipped.

To demonstrate how multi-model endpoints are created, updated and used, this notebook provides an example using two XGBoost models, one for movie recommendations and one for music recommendations. The multi-model endpoint capability is designed to work across all machine learning frameworks and algorithms including those where you bring your own container.

## Section 1 - Setup <a id='setup'></a>

In this section, we will import the necessary libraries, setup variables and examine data that was used to train the XGBoost movie recommendation model provided with this notebook.

Let's start by specifying:

* The AWS region used to host your model.
* The IAM role associated with this SageMaker notebook instance.
* The S3 bucket used to store the data used to train your model, any additional model data, and the data captured from model invocations.

## Import libraries 

In [None]:
import numpy as np
import pandas as pd
import json
import datetime
import time
from time import gmtime, strftime
import matplotlib.pyplot as plt
import os

import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer
import boto3

## Build and register an XGBoost container that can serve multiple models

In [None]:
!pip install -qU awscli boto3 sagemaker

For the inference container to serve multiple models in a multi-model endpoint, it must implement additional APIs in order to load, list, get, unload and invoke specific models.

The 'mme' branch of the SageMaker XGBoost Container repository is an example implementation on how to adapt SageMaker's XGBoost framework container to use Multi Model Server, a framework that provides an HTTP frontend that implements the additional container APIs required by multi-model endpoints, and also provides a pluggable backend handler for serving models using a custom framework, in this case the XGBoost framework.

Using this branch, below we will build an XGBoost container that fulfills all of the multi-model endpoint container requirements, and then upload that image to Amazon Elastic Container Registry (ECR). Because uploading the image to ECR may create a new ECR repository, this notebook requires permissions in addition to the regular SageMakerFullAccess permissions. The easiest way to add these permissions is simply to add the managed policy AmazonEC2ContainerRegistryFullAccess to the role that you used to start your notebook instance. There's no need to restart your notebook instance when you do this, the new permissions will be available immediately.

In [None]:
ALGORITHM_NAME = 'multi-model-xgboost'

In [None]:
%%sh -s $ALGORITHM_NAME

algorithm_name=$1

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration
region=$(aws configure get region)

ecr_image="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email --registry-ids ${account})

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

# First clear out any prior version of the cloned repo
rm -rf sagemaker-xgboost-container/

# Clone the xgboost container repo
git clone --single-branch --branch mme https://github.com/aws/sagemaker-xgboost-container.git
cd sagemaker-xgboost-container/

# Build the "base" container image that encompasses the installation of the
# XGBoost framework and all of the dependencies needed.
docker build -q -t xgboost-container-base:0.90-2-cpu-py3 -f docker/0.90-2/base/Dockerfile.cpu .

# Create the SageMaker XGBoost Container Python package.
python setup.py bdist_wheel --universal

# Build the "final" container image that encompasses the installation of the
# code that implements the SageMaker multi-model container requirements.
docker build -q -t ${algorithm_name} -f docker/0.90-2/final/Dockerfile.cpu .

docker tag ${algorithm_name} ${ecr_image}

docker push ${ecr_image}

## Define Variables

In [None]:
sm_client = boto3.client(service_name='sagemaker')
runtime_sm_client = boto3.client(service_name='sagemaker-runtime')

s3 = boto3.resource('s3')
s3_client = boto3.client('s3')

sagemaker_session = sagemaker.Session()
role = get_execution_role()

ACCOUNT_ID = boto3.client('sts').get_caller_identity()['Account']
REGION     = boto3.Session().region_name
BUCKET     = sagemaker_session.default_bucket()

from sagemaker.amazon.amazon_estimator import get_image_uri
XGB_CONTAINER = get_image_uri(REGION, 'xgboost', '0.90-1')

DATA_PREFIX = 'sagemaker/Recommendations-MultiModelEndpoint'

RECOMMENDATIONS_MODEL_NAME     = 'recommendations' 
MULTI_MODEL_ARTIFACTS  = 'multi_model_artifacts'
ENDPOINT_INSTANCE_TYPE = 'ml.m4.xlarge'

XGB_CONTAINER = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(ACCOUNT_ID, REGION, 
                                                                           ALGORITHM_NAME)


In [None]:
#Pretrained models and data

LOCAL_MODELS_DIR='../../models'
LOCAL_DATA_DIR='../../data'

MOVIE_RECOMMENDATION_MODEL='movie-rec-model.tar.gz'
MUSIC_RECOMMENDATION_MODEL='music-rec-model.tar.gz'

MOVIE_RECOMMENDATION_TEST_DATA='movielens_users_items_for_predictions.csv'

MUSIC_RECOMMENDATION_TEST_DATA='music_users_items_for_predictions.csv'

MOVIE_META_DATA='movie_metadata.csv'
SONG_META_DATA='song_metadata.csv'

## Import models into hosting
A big difference for multi-model endpoints is that when creating the Model entity, the container's `ModelDataUrl` is the S3 prefix where the model artifacts that are invokable by the endpoint are located. The rest of the S3 path will be specified when actually invoking the model. Remember to close the location with a trailing slash.

The `Mode` of container is specified as `MultiModel` to signify that the container will host multiple models.

### Deploy model artifacts to be found by the endpoint
As described above, the multi-model endpoint is configured to find its model artifacts in a specific location in S3. For each trained model, we make a copy of its model artifacts into that location.

In our example, we are storing all the models within a single folder. The implementation of multi-model endpoints is flexible enough to permit an arbitrary folder structure. For a set of housing models for example, you could have a top level folder for each region, and the model artifacts would be copied to those regional folders.

Note that we are purposely *not* copying the first model. This will be copied later in the notebook to demonstrate how to dynamically add new models to an already running endpoint.

In [None]:
##Copy model to S3 bucket.
def copy_model_to_s3(model_name):
    key = os.path.join(DATA_PREFIX, MULTI_MODEL_ARTIFACTS, model_name)
    with open(LOCAL_MODELS_DIR+'/'+model_name, 'rb') as file_obj:
        print("Uploading ", file_obj , " to bucket ", BUCKET, " as " , key)
        s3.Bucket(BUCKET).Object(key).upload_fileobj(file_obj)

In [None]:
##Copy movie recommendation model to S3
copy_model_to_s3(MOVIE_RECOMMENDATION_MODEL)

### Create the Amazon SageMaker model metadata
Here we use `boto3` to establish the model metadata. Instead of describing a single model, this metadata will indicate the use of multi-model semantics and will identify the source location of all specific model artifacts.

In [None]:
def create_multi_model_metadata(multi_model_name, role):
    # establish the place in S3 from which the endpoint will pull individual models
    _model_url  = 's3://{}/{}/{}/'.format(BUCKET, DATA_PREFIX, MULTI_MODEL_ARTIFACTS)
    _container = {
        'Image':        XGB_CONTAINER,
        'ModelDataUrl': _model_url,
        'Mode':         'MultiModel'
    }
    create_model_response = sm_client.create_model(
        ModelName = multi_model_name,
        ExecutionRoleArn = role,
        Containers = [_container])
    
    return _model_url

In [None]:
name = '{}-{}'.format(RECOMMENDATIONS_MODEL_NAME, strftime('%Y-%m-%d-%H-%M-%S', gmtime()))
model_url = create_multi_model_metadata(name, role)
print("model_url ", model_url)

### Create the multi-model endpoint
There is nothing special about the SageMaker endpoint config metadata for a multi-model endpoint. You need to consider the appropriate instance type and number of instances for the projected prediction workload. The number and size of the individual models will drive memory requirements.

Once the endpoint config is in place, the endpoint creation is straightforward.

In [None]:
endpoint_config_name = name
print('Endpoint config name: ' + endpoint_config_name)

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType': ENDPOINT_INSTANCE_TYPE,
        'InitialInstanceCount': 1,
        'InitialVariantWeight': 1,
        'ModelName': name,
        'VariantName': 'AllTraffic'}])

endpoint_name = name
print('Endpoint name: ' + endpoint_name)

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])

This step takes about 10 minutes

In [None]:
print('Waiting for {} endpoint to be in service...'.format(endpoint_name))
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']

while status=='Creating':
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print('    {}...'.format(status))
print('DONE')

## Exercise the multi-model endpoint

### Establish a predictor

Since we are using the boto3 interface above to create the endpoint config and endpoint, we use `RealTimePredictor` to get access to the endpoint for predictions.

In [None]:
from sagemaker import RealTimePredictor

xgb_predictor = RealTimePredictor(endpoint_name)

xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

In [None]:
print("Model url - ", model_url)
print('Here are the models served by the endpoint :')
!aws s3 ls $model_url

### Section 3 - Execute movie recommedation predictions

In [None]:
movie_df = pd.read_csv(LOCAL_DATA_DIR+"/"+MOVIE_META_DATA, delimiter ='|', encoding='latin-1')

movie_df.columns = ["movie id", "movie title", "release date", "video release date",
              "IMDb URL", "unknown", "Action", "Adventure", "Animation",
              "Children's","Comedy","Crime","Documentary","Drama","Fantasy",
              "Film-Noir","Horror","Musical","Mystery", "Romance","Sci-Fi",
              "Thriller","War","Western"]

In [None]:
runtime_client = boto3.client('runtime.sagemaker')

def get_recommendations_for_user(model_name,user_id, show_predictions):
    predictions_for_user = str(user_id)
    predictions = []
    
    with open(LOCAL_DATA_DIR+"/"+MOVIE_RECOMMENDATION_TEST_DATA, 'r') as f:
        contents = f.readlines() 
    
    for i in range(0, len(contents) - 1):
        line = contents[i]
        split_data = line.split(',')
        #Remove the original rating value from data used for prediction
        original_value = split_data.pop(0)
        original_value = split_data.pop(0)
        #print('original rating ', original_value)

        user = split_data[0]
        item = split_data[1]
        #print('Predicting rating for User ', user, 'for item ', item)

        if (user == predictions_for_user) : 

            payload = ','.join(split_data)

            response = runtime_client.invoke_endpoint(EndpointName=endpoint_name,
                                                  ContentType='text/csv', 
                                                  TargetModel=model_name,
                                                  Body=payload)
            prediction = response['Body'].read().decode('utf-8')

            predictions.append([item, prediction])

            #print("Original Value ", original_value , "Prediction : ", float(prediction))

    if show_predictions:        
        sorted_predcitions =    sorted(predictions, key = lambda x: x[1], reverse=True)     

        ## Let's show only the top 10 recommendations
        recommendations = sorted_predcitions[0:9]

        print("Recommended movies for user with id : ", predictions_for_user)
        for rec in recommendations: 
            #print("rec is ", type(rec))
            movie_id = int(rec[0])
            #print("recommended_movie_item ", movie_id)
            movie_match = movie_df.loc[movie_df['movie id'] == movie_id]
            movie_titile = movie_match['movie title'].values[0]
            print("\t", movie_match['movie title'].values[0] )

In [None]:
## Get movie recommendations for a couple of user.
user_ids = [100, 235]

for user_id in user_ids:
    get_recommendations_for_user(MOVIE_RECOMMENDATION_MODEL,user_id, True)

### Section 4 - Update the multi-model endpoint with second recommendation model  <a id='update-endpoint'></a>

#### Dynamically deploy another model
Here we demonstrate the power of dynamic loading of new models. We purposely did not copy the first model when deploying models earlier. Now we deploy an additional model and can immediately invoke it through the multi-model endpoint. As with the earlier models, the first invocation to the new model takes longer, as the endpoint takes time to download the model and load it into memory.

In [None]:
##Copy music recommendation model to S3
copy_model_to_s3(MUSIC_RECOMMENDATION_MODEL)

In [None]:
print("Model url - ", model_url)
print('Here are the models served by the endpoint :')
!aws s3 ls $model_url

### Section 5 - Execute music recommedation predictions <a id='music-predictions'></a>

In [None]:
song_df = pd.read_csv(LOCAL_DATA_DIR+"/"+SONG_META_DATA, delimiter =',', encoding='latin-1')

In [None]:
song_df.columns

In [None]:
def get_music_recommendations_for_user(model_name,user_id, show_predictions):
    predictions_for_user = str(user_id)
    predictions = []
    
    with open(LOCAL_DATA_DIR+"/"+MUSIC_RECOMMENDATION_TEST_DATA, 'r') as f:
        contents = f.readlines() 
    
    for i in range(0, len(contents) - 1):
        line = contents[i]
        
        #print("line ", line)
        split_data = line.split(',')
        
        #print("split_data ", split_data)
        #Remove the original rating value from data used for prediction
        original_value = split_data.pop(0)
        
        user = split_data[0]
        item = split_data[1]
        
        #print('Predicting rating for User ', user, 'for item ', item)

        if (user == predictions_for_user) : 

            payload = ','.join(split_data)

            response = runtime_client.invoke_endpoint(EndpointName=endpoint_name,
                                                  ContentType='text/csv', 
                                                  TargetModel=model_name,
                                                  Body=payload)
            prediction = response['Body'].read().decode('utf-8')

            predictions.append([item, prediction])

            #print("Original Value ", original_value , "Prediction : ", prediction)

    if show_predictions:        
        sorted_predcitions =    sorted(predictions, key = lambda x: x[1], reverse=True)     

        ## Let's show only the top 10 recommendations
        recommendations = sorted_predcitions[0:9]

        print("Recommended songs for user with id : ", user)
        for rec in recommendations: 
            #print("rec is ", type(rec))
            song_id = float(rec[0])
            #print("recommended_song_item ", song_id)
            song_match = song_df.loc[song_df['short_song_id'] == song_id]
            song_title = song_match['title'].values[0]
            print("\t", song_title )

In [None]:
user_id = '30544.0'

get_music_recommendations_for_user(MUSIC_RECOMMENDATION_MODEL,user_id, True)

In [None]:
##TODO : See if we can show more

### Updating a model
To update a model, you would follow the same approach as above and add it as a new model. For example, if you have retrained the NewYork_NY.tar.gz model and wanted to start invoking it, you would upload the updated model artifacts behind the S3 prefix with a new name such as NewYork_NY_v2.tar.gz, and then change the TargetModel field to invoke NewYork_NY_v2.tar.gz instead of NewYork_NY.tar.gz. You do not want to overwrite the model artifacts in Amazon S3, because the old version of the model might still be loaded in the containers or on the storage volume of the instances on the endpoint. Invocations to the new model could then invoke the old version of the model.

Alternatively, you could stop the endpoint and re-deploy a fresh set of models.

## (Optional) Clean up <a id='cleanup'></a>
Here, to be sure we are not billed for endpoints we are no longer using, we clean up.

In [None]:
# shut down the endpoint
xgb_predictor.delete_endpoint()

In [None]:
# maybe delete model too
xgb_predictor.delete_model()