# Amazon SageMaker Multi-Model Endpoints using XGBoost
With Amazon SageMaker Multi-Model Endpoints (new feature under NDA), customers can create an endpoint that hosts multiple models. These Endpoints are well suited to cases where there are a large number of models that can be served from a shared inference container and when the prediction request tolerates occasional cold start latency penalties for invoking infrequently used models.

At a high level, Amazon SageMaker manages the lifetime of the models in-memory for multi-model endpoints. When an invocation request is made for a particular model, Amazon SageMaker routes the request to a particular instance, downloads the model from S3 to that instance, and loads the required model to the memory of the container. Then Amazon SageMaker performs an invocation on the model. If the model is already loaded in memory, the invocation will be fast since the downloading and loading steps are skipped.

To demonstrate how multi-model endpoints are created and used, this notebook provides an example using a set of XGBoost models that each predict housing prices for a single location. The multi-model endpoint capability is designed to work across all machine learning frameworks and algorithms including those where you bring your own container.

## Generate synthetic data for housing models

In [None]:
import numpy as np
import pandas as pd
import json
import datetime
import time
from time import gmtime, strftime
import matplotlib.pyplot as plt
import os

## TEMPORARY FOR BETA: Get access to the new feature in boto3

In [None]:
!aws configure add-model --service-model file://sagemaker-2017-07-24.normal.json --service-name sagemaker-multimodel-endpoints
!aws configure add-model --service-model file://sagemaker-runtime.normal.json --service-name sagemaker-runtime-multimodel-endpoints

## Train multiple house value prediction models

In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer
import boto3

sm_client = boto3.client(service_name='sagemaker-multimodel-endpoints')
runtime_sm_client = boto3.client(service_name='sagemaker-runtime-multimodel-endpoints')

s3 = boto3.resource('s3')
s3_client = boto3.client('s3')

sagemaker_session = sagemaker.Session()
role = get_execution_role()

ACCOUNT_ID = boto3.client('sts').get_caller_identity()['Account']
REGION     = boto3.Session().region_name
BUCKET     = sagemaker_session.default_bucket()

from sagemaker.amazon.amazon_estimator import get_image_uri
XGB_CONTAINER = get_image_uri(REGION, 'xgboost', '0.90-1')
# TEMP during beta: hard-code uri for beta container with multi-model endpoint support
#XGB_CONTAINER = '878107166805.dkr.ecr.us-east-1.amazonaws.com/preprod-xgboost-framework:mms_beta_1'
XGB_CONTAINER = '878107166805.dkr.ecr.us-west-2.amazonaws.com/preprod-xgboost-framework:mms_beta_1'

DATA_PREFIX            = 'DEMO_MME_REC'
HOUSING_MODEL_NAME     = 'recommendations' ##TODO : CHANGE THIS 
MULTI_MODEL_ARTIFACTS  = 'multi_model_artifacts'

#TRAIN_INSTANCE_TYPE    = 'ml.m4.xlarge'
ENDPOINT_INSTANCE_TYPE = 'ml.m4.xlarge'

### Split a given dataset into train, validation, and test

### Save datasets locally to support copying to s3

### Launch a single training job for a given housing location
There is nothing specific to multi-model endpoints in terms of the models it will host. They are trained in the same way as all other SageMaker models. Here we are using the XGBoost estimator and not waiting for the job to complete.

### Kick off a model training job for each housing location

### Wait for all model training to finish

## Import models into hosting
A big difference for multi-model endpoints is that when creating the Model entity, the container's `ModelDataUrl` is the S3 prefix where the model artifacts that are invokable by the endpoint are located. The rest of the S3 path will be specified when actually invoking the model. Remember to close the location with a trailing slash.

The `Mode` of container is specified as `MultiModel` to signify that the container will host multiple models.

### Deploy model artifacts to be found by the endpoint
As described above, the multi-model endpoint is configured to find its model artifacts in a specific location in S3. For each trained model, we make a copy of its model artifacts into that location.

In our example, we are storing all the models within a single folder. The implementation of multi-model endpoints is flexible enough to permit an arbitrary folder structure. For a set of housing models for example, you could have a top level folder for each region, and the model artifacts would be copied to those regional folders.

Note that we are purposely *not* copying the first model. This will be copied later in the notebook to demonstrate how to dynamically add new models to an already running endpoint.

In [None]:
models = {'movie-rec-model.tar.gz', 'model-maybe-music.tar.gz'}

for model in models:
    key = os.path.join(DATA_PREFIX, MULTI_MODEL_ARTIFACTS, model)
    with open('models/'+model, 'rb') as file_obj:
        print("Uploading ", file_obj , " to bucket ", BUCKET, " as " , key)
        s3.Bucket(BUCKET).Object(key).upload_fileobj(file_obj)

### Create the Amazon SageMaker model metadata
Here we use `boto3` to establish the model metadata. Instead of describing a single model, this metadata will indicate the use of multi-model semantics and will identify the source location of all specific model artifacts.

In [None]:
def create_multi_model_metadata(multi_model_name, role):
    # establish the place in S3 from which the endpoint will pull individual models
    _model_url  = 's3://{}/{}/{}/'.format(BUCKET, DATA_PREFIX, MULTI_MODEL_ARTIFACTS)
    _container = {
        'Image':        XGB_CONTAINER,
        'ModelDataUrl': _model_url,
        'Mode':         'MultiModel'
    }
    create_model_response = sm_client.create_model(
        ModelName = multi_model_name,
        ExecutionRoleArn = role,
        Containers = [_container])
    
    return _model_url

In [None]:
name = '{}-{}'.format(HOUSING_MODEL_NAME, strftime('%Y-%m-%d-%H-%M-%S', gmtime()))
model_url = create_multi_model_metadata(name, role)
print("model_url ", model_url)

### Create the multi-model endpoint
There is nothing special about the SageMaker endpoint config metadata for a multi-model endpoint. You need to consider the appropriate instance type and number of instances for the projected prediction workload. The number and size of the individual models will drive memory requirements.

Once the endpoint config is in place, the endpoint creation is straightforward.

In [None]:
#name = '{}-{}'.format("Recommendations", strftime('%Y-%m-%d-%H-%M-%S', gmtime()))

endpoint_config_name = name
print('Endpoint config name: ' + endpoint_config_name)

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType': ENDPOINT_INSTANCE_TYPE,
        'InitialInstanceCount': 1,
        'InitialVariantWeight': 1,
        'ModelName': name,
        'VariantName': 'AllTraffic'}])

endpoint_name = name
print('Endpoint name: ' + endpoint_name)

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])

In [None]:
print('Waiting for {} endpoint to be in service...'.format(endpoint_name))
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']

while status=='Creating':
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print('    {}...'.format(status))
print('DONE')

## Exercise the multi-model endpoint

### Establish a predictor

Since we are using the boto3 interface above to create the endpoint config and endpoint, we use `RealTimePredictor` to get access to the endpoint for predictions.

In [None]:
from sagemaker import RealTimePredictor

xgb_predictor = RealTimePredictor(endpoint_name)

xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

### Invoke multiple individual models hosted behind a single endpoint
Here we iterate through a set of housing predictions, choosing the specific location-based housing model at random. Notice the cold start price paid for the first invocation of any given model. Subsequent invocations of the same model take advantage of the model already being loaded into memory.

In [None]:
print("Model url - ", model_url)
print('Here are the models that the endpoint has at its disposal:')
!aws s3 ls $model_url

In [None]:
full_model_name="movie-rec-model.tar.gz"
#full_model_name="model-maybe-music.tar.gz"

#payload='502,678,883702448,23092,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0'
#payload='741,682,891455960,63108,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1'
#payload='276,127,874786568,95064,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0'
payload='574,347,891278860,53188,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0'
#payload='542,194,886532534,60515,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0'

response = runtime_sm_client.invoke_endpoint(
                        EndpointName=endpoint_name,
                        ContentType='text/csv',
                        TargetModel=full_model_name,
                        Body=payload)

#predicted_value = json.loads(response['Body'].read())[0]

prediction = response['Body'].read().decode('utf-8')
print("prediction : ", prediction)

In [None]:
##TODO : Need to show predictions from the test data file.
##Test data has already been prepped (in the XGBoost notebook) and uploaded here.
test_data = pd.read_csv('data/movielens_test.csv')

In [None]:
with open('data/movielens_test.csv', 'r') as f:
    contents = f.readlines()
    
#print('contents ', type(contents))

#for i in range(0, len(contents)):
for i in range(0, 20):

    line = contents[i]
    #print(line)
    split_data = line.split(',')
    #print('type of split_data : ',   type(split_data))
    #print('split_data : ', split_data)

    original_value = split_data.pop(0)
    
    payload = ','.join(split_data)
    #print('payload_string : \n')
    #print(payload)

    #print('original_value : \n')
    #print(original_value)
    
    response = runtime_sm_client.invoke_endpoint(
                        EndpointName=endpoint_name,
                        ContentType='text/csv',
                        TargetModel=full_model_name,
                        Body=payload)
    
    prediction = response['Body'].read().decode('utf-8')
    
    #print("type of prediction : ", type(prediction))
    #prediction = response['Body'].read().decode('utf-8')
    #print("prediction : ", prediction)

    print("Original Value ", original_value , "Prediction : ", prediction)
    
    

    

    

In [None]:
##TODO : will need to fix this.

# iterate through invocations with random inputs against a random model showing results and latency
for i in range(10):
    model_name = LOCATIONS[np.random.randint(1, len(LOCATIONS[:PARALLEL_TRAINING_JOBS]))]
    full_model_name = '{}.tar.gz'.format(model_name)
    predict_one_house_value(gen_random_house()[1:], full_model_name)

### Dynamically deploy another model
Here we demonstrate the power of dynamic loading of new models. We purposely did not copy the first model when deploying models earlier. Now we deploy an additional model and can immediately invoke it through the multi-model endpoint. As with the earlier models, the first invocation to the new model takes longer, as the endpoint takes time to download the model and load it into memory.

In [None]:
# add another model to the endpoint and exercise it
##TODO : Show how to add another model
#deploy_artifacts_to_mme(training_jobs[0])

### Invoke the newly deployed model
Exercise the newly deployed model without the need for any endpoint update or restart.

In [None]:
print('Here are the models that the endpoint has at its disposal:')
!aws s3 ls $model_url

In [None]:
##TODO : Show using multiple models

#model_name = LOCATIONS[0]
#full_model_name = '{}.tar.gz'.format(model_name)
#for i in range(5):
 #   features = gen_random_house()
  #  predict_one_house_value(gen_random_house()[1:], full_model_name)

## Clean up
Here, to be sure we are not billed for endpoints we are no longer using, we clean up.

In [None]:
# shut down the endpoint
xgb_predictor.delete_endpoint()

In [None]:
# maybe delete model too
xgb_predictor.delete_model()