# XGBoost AutoScaling Example

Amazon SageMaker supports automatic scaling (autoscaling) for your hosted models. Autoscaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. When the workload increases, autoscaling brings more instances online. When the workload decreases, autoscaling removes unnecessary instances so that you don't pay for provisioned instances that you aren't using.

**Define a scaling policy**


To specify the metrics and target values for a scaling policy, you configure a target-tracking scaling policy. You can use either a predefined metric or a custom metric.

Scaling policy configuration is represented by a JSON block. You save your scaling policy configuration as a JSON block in a text file. You use that text file when invoking the AWS CLI or the Application Auto Scaling API. For more information about policy configuration syntax, see TargetTrackingScalingPolicyConfiguration in the Application Auto Scaling API Reference.

The following options are available for defining a target-tracking scaling policy configuration.


**Use a predefined metric**

To quickly define a target-tracking scaling policy for a variant, use the SageMakerVariantInvocationsPerInstance predefined metric. SageMakerVariantInvocationsPerInstance is the average number of times per minute that each instance for a variant is invoked. We strongly recommend using this metric.

To use a predefined metric in a scaling policy, create a target tracking configuration for your policy. In the target tracking configuration, include a PredefinedMetricSpecification for the predefined metric and a TargetValue for the target value of that metric.

**Use a custom metric**

If you need to define a target-tracking scaling policy that meets your custom requirements, define a custom metric. You can define a custom metric based on any production variant metric that changes in proportion to scaling.

Not all SageMaker metrics work for target tracking. The metric must be a valid utilization metric, and it must describe how busy an instance is. The value of the metric must increase or decrease in inverse proportion to the number of variant instances. That is, the value of the metric should decrease when the number of instances increases.


## Start Lab 1 Prepare Real time End point for Bring Your Own Model

In [None]:
# Cell 01

import boto3
import sagemaker
from sagemaker.estimator import Estimator

boto_session = boto3.session.Session()
region = boto_session.region_name

sagemaker_session = sagemaker.Session()
base_job_prefix = 'xgboost-example'
role = sagemaker.get_execution_role()

default_bucket = sagemaker_session.default_bucket()
s3_prefix = base_job_prefix

training_instance_type = 'ml.m5.xlarge'

BUCKET=sagemaker_session.default_bucket()
print(BUCKET)

## Download Data and Prepare Training Input in S3

In [None]:
# Cell 02
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/uci_abalone/train_csv/abalone_dataset1_train.csv .  
    

In [None]:
# Cell 03
!aws s3 cp abalone_dataset1_train.csv s3://{default_bucket}/xgboost-regression/train.csv  
    

## Retrieve XGBoost Image and Prepare Training Estimator W/ HyperParameters

In [None]:
# Cell 04
model_path = f's3://{default_bucket}/{s3_prefix}/xgb_model'

image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.0-1",
    py_version="py3",
    instance_type=training_instance_type,
)



## Upload the Model for real time to S3

In [None]:
# Cell 05
model_artifacts = sagemaker.s3.S3Uploader().upload(
    
    local_path='./models/realtime/model.tar.gz',
    desired_s3_uri=f"s3://{BUCKET}/models/realtime",
)
model_artifacts

In [None]:
# Cell 06
model_artifacts

## Create SM Client to Create Model, EP Config, EP

In [None]:
# Cell 07
sm_client = boto3.client(service_name='sagemaker')

## Model Creation

In [None]:
# Cell 08
from time import gmtime, strftime
model_name = 'xgboost-uploaded' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print('Model name: ' + model_name)

reference_container = {
    "Image": image_uri,
    "ModelDataUrl": model_artifacts
}

create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer= reference_container)

print("Model Arn: " + create_model_response['ModelArn'])

## Endpoint Config Creation

In [None]:
# Cell 09
endpoint_config_name = 'xgboost-config' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
instance_type='ml.m4.xlarge'
print('Endpoint config name: ' + endpoint_config_name)

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType': instance_type,
        'InitialInstanceCount': 1,
        'InitialVariantWeight': 1,
        'ModelName': model_name,
        'VariantName': 'AllTraffic',
        }])

print("Endpoint config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

## Endpoint Creation

In [None]:
%%time
# Cell 010

import time

endpoint_name = 'xgboost-uploaded' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print('Endpoint name: ' + endpoint_name)

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Endpoint Status: " + status)

print('Waiting for {} endpoint to be in service...'.format(endpoint_name))
waiter = sm_client.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=endpoint_name)

## Sample Invocation

In [None]:
# Cell 011
import boto3
smr = boto3.client('sagemaker-runtime')

resp = smr.invoke_endpoint(EndpointName=endpoint_name, Body=b'.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0', 
                           ContentType='text/csv')

print(resp['Body'].read())

## END LAB 1

In [None]:
# Cell 012
print("Excellent you have created a Real time end point with your own Model")

## START Lab 2

## AutoScaling SageMaker Real-Time Endpoint

Here we define a scaling policy based off of invocations per instance. We set the maximum instance count to 4. We can define this using the Boto3 SDK. There's different types of scaling policies: Simple Scaling, Target Tracking Scaling, Step Scaling, Scheduled Scaling, and On-Demand Scaling. For this we'll be using Target Tracking Scaling and be using the Invocations Per Instance Metric as the basis for scaling.

This cell below is an example of the various Scaling options available

In [None]:
# Cell 013
# AutoScaling client
asg = boto3.client('application-autoscaling')

# Resource type is variant and the unique identifier is the resource ID.
resource_id=f"endpoint/{endpoint_name}/variant/AllTraffic"

# scaling configuration
response = asg.register_scalable_target(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
    MinCapacity=1,
    MaxCapacity=3
)
print(f"registered:scalable:{response}::")


In [None]:
# Cell 014
# scaling configuration
asg.describe_scaling_policies(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
)


In [None]:
# Cell 015
# scaling activities - should be nothing since we do not have any alarms triggered
asg.describe_scaling_activities(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
)

In [None]:
# Cell 016

# Target Scaling for keeping Invocations per instance to be a threshold
response = asg.put_scaling_policy(
    PolicyName='SagemakerEndpointInvocationScalingPolicy',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'PredefinedMetricSpecification': {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance",
        },
        'TargetValue': 0.5, # Threshold
        'ScaleOutCooldown': 30, # duration between scale out
        "DisableScaleIn": True
    }
)
print(f"Target invocations created: {response}")


In [None]:
# Cell 017
from threading import Thread
import time
invoke_endpoint=True

def invoke_endpoint_forever():
    smr_local = boto3.client('sagemaker-runtime')
    while invoke_endpoint:
        try:
            resp = smr_local.invoke_endpoint(
                EndpointName=endpoint_name, 
                Body=b'.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0', 
                ContentType='text/csv')
            time.sleep(0.0005)

        except:
            pass




In [None]:
# Cell 018
# - Thread 1
thread1 = Thread(target=invoke_endpoint_forever)
thread1.start()

# - Thread 2
thread2 = Thread(target=invoke_endpoint_forever)
thread2.start()


In [None]:
# Cell 019
request_duration = 250
end_time = time.time() + request_duration
print(f"test will run for {request_duration} seconds")
while time.time() < end_time:
    resp = smr.invoke_endpoint(EndpointName=endpoint_name, Body=b'.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0', 
                           ContentType='text/csv')
    
print("Test finished:time to look at stats")

In [None]:
# Cell 020
# scaling configuration
asg.describe_scaling_activities(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
)

We can monitor these invocations through CloudWatch which you can access through the SageMaker console.

We can zoom in to monitor the InvocationsPerInstance metric more.

<img src='./images/AutoScale1.png' width="900" height="400">
<img src='./images/AutoScale2.png' width="900" height="400">

In [None]:
# Cell 021
response = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print("Status: " + status)


while status=='Updating':
    time.sleep(1)
    response = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = response['EndpointStatus']
    instance_count = response['ProductionVariants'][0]['CurrentInstanceCount']
    print(f"Status: {status}")
    print(f"Current Instance count: {instance_count}")

print("Update completed!")
response = sm_client.describe_endpoint(EndpointName=endpoint_name)
instance_count = response['ProductionVariants'][0]['CurrentInstanceCount']
print(f"Status: {status}")
print(f"Current Instance count: {instance_count}")

In [None]:
# Cell 022
response 

In [None]:
# Cell 023
import pandas as pd
import datetime

cw = boto3.Session().client("cloudwatch")


def get_invocation_metrics_for_endpoint_variant(endpoint_name, variant_name, start_time, end_time):
    metrics = cw.get_metric_statistics(
        Namespace="AWS/SageMaker",
        MetricName="Invocations",
        StartTime=start_time,
        EndTime=end_time,
        Period=60,
        Statistics=["Sum"],
        Dimensions=[
            {"Name": "EndpointName", "Value": endpoint_name},
            {"Name": "VariantName", "Value": variant_name},
        ],
    )
    return (
        pd.DataFrame(metrics["Datapoints"])
        .sort_values("Timestamp")
        .set_index("Timestamp")
        .drop("Unit", axis=1)
        .rename(columns={"Sum": variant_name})
    )
def get_instance_count_metrics_for_endpoint_variant(endpoint_name, variant_name, start_time, end_time):
    metrics = cw.get_metric_statistics(
        Namespace="AWS/SageMaker",
        MetricName="InvocationsPerInstance",
        StartTime=start_time,
        EndTime=end_time,
        Period=60,
        Statistics=["Maximum"],
        Dimensions=[
            {"Name": "EndpointName", "Value": endpoint_name},
            {"Name": "VariantName", "Value": variant_name},
        ],
    )
    return (
        pd.DataFrame(metrics["Datapoints"])
        .sort_values("Timestamp")
        .set_index("Timestamp")
        .drop("Unit", axis=1)
        .rename(columns={"Sample Count": variant_name})
    )


def plot_endpoint_metrics(start_time=None):
    start_time = start_time or datetime.datetime.now() - datetime.timedelta(minutes=60)
    end_time = datetime.datetime.now()
    metrics_variant = get_invocation_metrics_for_endpoint_variant(
        endpoint_name, 'AllTraffic', start_time, end_time
    )
    metrics_variant.plot()
    
    metrics_variant = get_instance_count_metrics_for_endpoint_variant(
        endpoint_name, 'AllTraffic', start_time, end_time
    )
    metrics_variant.plot()
    return metrics_variant

In [None]:
# Cell 024
#time.sleep(20)  # let metrics catch up
plot_endpoint_metrics()

In [None]:
# Cell 025
invoke_endpoint=False

In [None]:
# Cell 026
# scaling configuration
response_de = asg.deregister_scalable_target(
        ServiceNamespace='sagemaker', #
        ResourceId=resource_id,
        ScalableDimension='sagemaker:variant:DesiredInstanceCount'
)
print(f"registered:scalable:{response_de}::")