# Amazon SageMaker Model Monitoring Beta Data Capture
_**Hosting a Model in Amazon SageMaker and Capturing Inference requests, results, and metadata**_

*NOTE - THIS FEATURE IS CONFIDENTIAL AND SHARED UNDER NDA. THE FEATURE IS IN BETA AND SHOULD NOT BE USED FOR PRODUCTION ENDPOINTS. THIS FEATURE IS CURRENTLY IN DEVELOPMENT AND THE API SPECIFICATIONS MAY CHANGE*

---





---
## Background

Amazon SageMaker provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly. Amazon SageMaker is a fully-managed service that covers the entire machine learning workflow. You can label and prepare your data, choose an algorithm, train a model, and then tune and optimize it for deployment. Amazon SageMaker gets your models into production to make predictions or take actions with less effort and lower costs than was previously possible.

Amazon SageMaker is adding new capabilities that monitor ML models while in production and that detect deviations in data quality in comparison to a baseline dataset (e.g. training data set). They enable you to capture the metadata and the input and output for invocations of the models that you deploy with Amazon SageMaker. They also enable you to analyze the data and monitor its quality. 

This notebook shows you how to capture the model invocation data from an endpoint and then view that data in S3. Soon, we plan to provide an additional example that shows how to analyze the data collected. 

---
## Setup

Let's start by specifying:

* The AWS region used to host you model.
* The IAM role ARN used to give learning and hosting access to your data. See the documentation for how to specify these.
* The S3 bucket used to store the data used to train your model, any additional model data, and the data captured from model invocations.

In [None]:
%%time

import os
import boto3
import re
import json
from sagemaker import get_execution_role

region = boto3.Session().region_name

role = get_execution_role()

bucket='reinvent-sagemaker-deployment-2019' # put your s3 bucket name here, and create s3 bucket
prefix = 'sagemaker/movie-recommendations-xgboost'

## Add the SageMaker Internal Model

This step is required to enable the data capture feature for beta.

In [None]:
##FIRST NEED TO MAKE SURE you have access to "sagemaker-2017-07-24.normal.json" 
##THIS IS NEEDED FOR THE BETA PERIOD
!aws configure add-model --service-model file://sagemaker-2017-07-24.normal.json --service-name sagemaker-internal

## Deploy model on a SageMaker Endpoint

### Upload the pre-trained model to S3

This code uploads a pre-trained XGBoost model that is ready for you to deploy. You can also use your own pre-trained model in this step. If you already have a pretrained model in s3, you can add it instead by specifying its s3_key.

### Import model into hosting
This step creates an Amazon SageMaker model from the model file previously uploaded to S3. If you have already created an Amazon SageMaker model, you can skip this step.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')

In [None]:
%%time
from time import gmtime, strftime

model_name = "MoviePredictions-EndpointDataCaptureModel-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

##TODO : sagemaker-internal service needs to be changed to 'sagemaker' post GA
sm_client = boto3.client('sagemaker-internal')

model_url = 'https://{}.s3-{}.amazonaws.com/{}/model.tar.gz'.format(bucket, region, prefix)

print (model_url)

primary_container = {
    'Image': container,
    'ModelDataUrl': model_url,
}

create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

### Create Endpoint Configuration

This step is required to deploy an Amazon SageMaker model on an endpoint. To enable data capture for monitoring the model data quality, you specify the new capture option called "DataCaptureConfig". You can capture the request payload, the response payload or both with this configuration. The data capture is supported at the endpoint configuration level and applies to all variants. The captured data is stored in a json format. If you are using your own Amazon SageMaker model, you still need to complete this step to create a new endpoint configuration. The comments highlight the new API parameters for data capture.

In [None]:
from time import gmtime, strftime

data_capture_sub_folder = "datacapture-xgboost-movie-recommendations"
s3_capture_upload_path = 's3://{}/{}'.format(bucket, data_capture_sub_folder)

data_capture_configuration = {
    "EnableCapture": True, # flag turns data capture on and off
    "InitialSamplingPercentage": 90, # sampling rate to capture data. max is 100%
    "DestinationS3Uri": s3_capture_upload_path, # s3 location where captured data is saved
    "CaptureOptions": [
        {
            "CaptureMode": "Output" # The type of capture this option enables. Values can be: [Output/Input]
        },
        {
            "CaptureMode": "Input" # The type of capture this option enables. Values can be: [Output/Input]
        }
    ],
    "CaptureContentTypeHeader": {
       "CsvContentTypes": ["text/csv"], # headers which should signal to decode the payload into CSV format 
       "JsonContentTypes": ["application/json"] # headers which should signal to decode the payload into JSON format 
     }
}

endpoint_config_name = 'XGBoost-MovieRec-DataCaptureEndpointConfig-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_config_name)
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m5.xlarge',
        'InitialInstanceCount':1,
        'InitialVariantWeight':1,
        'ModelName':model_name,
        'VariantName':'AllTrafficVariant'
    }],
    DataCaptureConfig = data_capture_configuration) # This is where the new capture options are applied

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

### Create Endpoint
This step uses the endpoint configuration specified above to create an endpoint. This takes a few minutes (approximately 9 minutes) to complete.

In [None]:
%%time
import time

endpoint_name = 'XGBoost-MovieRec-DataCaptureEndpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

while status=='Creating':
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Status: " + status)

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

## Invoke the Deployed Model

You can now send data to this endpoint to get inferences in realtime. Because we have enabled the data capture in the previous steps, the request and response payload along with some additional metadata will be saved in the S3 location you have specified.

In [None]:
runtime_client = boto3.client('runtime.sagemaker')

In [None]:
##Check that we can predict with a subset of data.
with open('movielens_test.csv', 'r') as f:
    contents = f.readlines()
    
for i in range(0, 20):
    line = contents[i]
    split_data = line.split(',')
    #Remove the original rating value from data used for prediction
    original_value = split_data.pop(0)
    
    payload = ','.join(split_data)
    
    response = runtime_client.invoke_endpoint(EndpointName=endpoint_name,
                                              ContentType='text/csv', 
                                              Body=payload)
    prediction = response['Body'].read().decode('utf-8')

    print("Original Value ", original_value , "Prediction : ", float(prediction))
    

This step invokes the endpoint with included sample data for ~2 minutes. Data is captured based on the sampling percentage specified. If your endpoint runs for a long time, the data from your endpoint will continue to be captured and saved.

In [None]:
##TODO : Fix the test data so we can run this for the entire dataset
for i in range(0, 40):    
    ## Process the test content 
    line = contents[i]
    split_data = line.split(',')
    original_value = split_data.pop(0)
    payload = ','.join(split_data)
    
    
    response = runtime_client.invoke_endpoint(EndpointName=endpoint_name,
                                              ContentType='text/csv',
                                              Body=payload)
                                              #Body=contents[iteration])
    print("Original Value ", original_value , "Prediction : ", float(response['Body'].read().decode('utf-8')) )
    time.sleep(1)

## View Captured Data

Now let's list the data capture files stored in S3. You should expect to see different files from different time periods organized based on the hour in which the invocation occurred. The format of the s3 path is:

s3://bucket-name/endpoint-name/year/month/day/hour/variant-name/filename.jsonl

**NOTE:** This path is subject to change during the beta period

In [None]:
s3_client = boto3.Session().client('s3')
result = s3_client.list_objects(Bucket=bucket, Prefix=data_capture_sub_folder)
print("data_capture_sub_folder : " , data_capture_sub_folder)
capture_files = [capture_file.get("Key") for capture_file in result.get('Contents')]
print("Found Capture Files:")
print("\n ".join(capture_files))

Next, let's view the contents of a single capture file. Here you should see all the data captured in a json-line formatted file. For simplicity the code shows only the first few lines of the file.

In [None]:
def get_obj_body(obj_key):
    return s3_client.get_object(Bucket=bucket, Key=obj_key).get('Body').read().decode("utf-8")

capture_file = get_obj_body(capture_files[-1])
print(capture_file[:2000])

Finally, let's view the contents of a single line. This follows the data capture naming convention that you provided during the Endpoint Config setup.

In [None]:
import json
print(json.dumps(json.loads(capture_file.split('\n')[0]), indent=2))

## Analyzing collected data for data quality issues

Currently the data analysis feature is not yet enabled in beta. We will reach out to you when this is available.

### (Optional) Delete the Resources

You can keep your Endpoint running to continue capturing data. If you do not plan to collect more data or use this endpoint further, you should delete the endpoint to avoid incurring additional charges. Note that deleting your endpoint does not delete the data that was captured during the model invocaations. That data is persisted in S3 until you delete it yourself.

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [None]:
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)

In [None]:
sm_client.delete_model(ModelName=model_name)