# XGBoost AutoScaling Example

Amazon SageMaker supports automatic scaling (autoscaling) for your hosted models. Autoscaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. When the workload increases, autoscaling brings more instances online. When the workload decreases, autoscaling removes unnecessary instances so that you don't pay for provisioned instances that you aren't using.

**Define a scaling policy**


To specify the metrics and target values for a scaling policy, you configure a target-tracking scaling policy. You can use either a predefined metric or a custom metric.

Scaling policy configuration is represented by a JSON block. You save your scaling policy configuration as a JSON block in a text file. You use that text file when invoking the AWS CLI or the Application Auto Scaling API. For more information about policy configuration syntax, see TargetTrackingScalingPolicyConfiguration in the Application Auto Scaling API Reference.

The following options are available for defining a target-tracking scaling policy configuration.


**Use a predefined metric**

To quickly define a target-tracking scaling policy for a variant, use the SageMakerVariantInvocationsPerInstance predefined metric. SageMakerVariantInvocationsPerInstance is the average number of times per minute that each instance for a variant is invoked. We strongly recommend using this metric.

To use a predefined metric in a scaling policy, create a target tracking configuration for your policy. In the target tracking configuration, include a PredefinedMetricSpecification for the predefined metric and a TargetValue for the target value of that metric.

**Use a custom metric**

If you need to define a target-tracking scaling policy that meets your custom requirements, define a custom metric. You can define a custom metric based on any production variant metric that changes in proportion to scaling.

Not all SageMaker metrics work for target tracking. The metric must be a valid utilization metric, and it must describe how busy an instance is. The value of the metric must increase or decrease in inverse proportion to the number of variant instances. That is, the value of the metric should decrease when the number of instances increases.


In [34]:
# Cell 01
import boto3
import sagemaker
from sagemaker.estimator import Estimator

boto_session = boto3.session.Session()
region = boto_session.region_name

sagemaker_session = sagemaker.Session()
base_job_prefix = 'xgboost-example'
role = sagemaker.get_execution_role()

default_bucket = sagemaker_session.default_bucket()
s3_prefix = base_job_prefix

training_instance_type = 'ml.m5.xlarge'

## Download Data and Prepare Training Input in S3

In [5]:
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/uci_abalone/train_csv/abalone_dataset1_train.csv .  
    

download: s3://sagemaker-sample-files/datasets/tabular/uci_abalone/train_csv/abalone_dataset1_train.csv to ./abalone_dataset1_train.csv


In [6]:
!aws s3 cp abalone_dataset1_train.csv s3://{default_bucket}/xgboost-regression/train.csv  
    

upload: ./abalone_dataset1_train.csv to s3://sagemaker-us-east-1-622343165275/xgboost-regression/train.csv


In [7]:
from sagemaker.inputs import TrainingInput
training_path = f's3://{default_bucket}/xgboost-regression/train.csv'
train_input = TrainingInput(training_path, content_type="text/csv")

## Retrieve XGBoost Image and Prepare Training Estimator W/ HyperParameters

In [8]:
model_path = f's3://{default_bucket}/{s3_prefix}/xgb_model'

image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.0-1",
    py_version="py3",
    instance_type=training_instance_type,
)

xgb_train = Estimator(
    image_uri=image_uri,
    instance_type=training_instance_type,
    instance_count=1,
    output_path=model_path,
    sagemaker_session=sagemaker_session,
    role=role
)

xgb_train.set_hyperparameters(
    objective="reg:linear",
    num_round=50,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7,
    silent=0,
)

## Model Training

In [9]:
xgb_train.fit({'train': train_input})

2022-09-30 18:34:55 Starting - Starting the training job...
2022-09-30 18:35:15 Starting - Preparing the instances for trainingProfilerReport-1664562895: InProgress
......
2022-09-30 18:36:21 Downloading - Downloading input data...
2022-09-30 18:36:41 Training - Downloading the training image...
2022-09-30 18:37:21 Training - Training image download completed. Training in progress..[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value reg:linear to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[18:37:23] 2923x8 matrix with 23384 entries loaded from /opt/ml/input/data/train?for

## Retrieve Model Data

In [10]:
model_artifacts = xgb_train.model_data
model_artifacts

's3://sagemaker-us-east-1-622343165275/xgboost-example/xgb_model/sagemaker-xgboost-2022-09-30-18-34-55-039/output/model.tar.gz'

## Upload the Model for real time to S3

In [154]:
import sagemaker

BUCKET=sagemaker.Session().default_bucket()
print(BUCKET)

sagemaker-us-east-1-622343165275


In [158]:
model_artifacts = sagemaker.s3.S3Uploader().upload(
    
    local_path='./models/realtime/model.tar.gz',
    desired_s3_uri=f"s3://{BUCKET}/models/realtime",
)
model_artifacts

's3://sagemaker-us-east-1-622343165275/models/realtime/model.tar.gz'

In [159]:
model_artifacts

's3://sagemaker-us-east-1-622343165275/models/realtime/model.tar.gz'

## Create SM Client to Create Model, EP Config, EP

In [160]:
sm_client = boto3.client(service_name='sagemaker')

## Model Creation

In [190]:
from time import gmtime, strftime
model_name = 'xgboost-uploaded' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print('Model name: ' + model_name)

reference_container = {
    "Image": image_uri,
    "ModelDataUrl": model_artifacts
}

create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer= reference_container)

print("Model Arn: " + create_model_response['ModelArn'])

Model name: xgboost-uploaded2022-10-01-00-58-03
Model Arn: arn:aws:sagemaker:us-east-1:622343165275:model/xgboost-uploaded2022-10-01-00-58-03


## Endpoint Config Creation

In [191]:
endpoint_config_name = 'xgboost-config' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
instance_type='ml.m4.xlarge'
print('Endpoint config name: ' + endpoint_config_name)

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType': instance_type,
        'InitialInstanceCount': 1,
        'InitialVariantWeight': 1,
        'ModelName': model_name,
        'VariantName': 'AllTraffic',
        }])

print("Endpoint config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

Endpoint config name: xgboost-config2022-10-01-00-58-07
Endpoint config Arn: arn:aws:sagemaker:us-east-1:622343165275:endpoint-config/xgboost-config2022-10-01-00-58-07


## Endpoint Creation

In [192]:
%%time

import time

endpoint_name = 'xgboost-uploaded' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print('Endpoint name: ' + endpoint_name)

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Endpoint Status: " + status)

print('Waiting for {} endpoint to be in service...'.format(endpoint_name))
waiter = sm_client.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=endpoint_name)

Endpoint name: xgboost-uploaded2022-10-01-00-58-11
Endpoint Arn: arn:aws:sagemaker:us-east-1:622343165275:endpoint/xgboost-uploaded2022-10-01-00-58-11
Endpoint Status: Creating
Waiting for xgboost-uploaded2022-10-01-00-58-11 endpoint to be in service...
CPU times: user 120 ms, sys: 7.97 ms, total: 128 ms
Wall time: 4min 1s


## Sample Invocation

In [196]:
import boto3
smr = boto3.client('sagemaker-runtime')

resp = smr.invoke_endpoint(EndpointName=endpoint_name, Body=b'.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0', 
                           ContentType='text/csv')

print(resp['Body'].read())

b'4.566554546356201'


## AutoScaling SageMaker Real-Time Endpoint

Here we define a scaling policy based off of invocations per instance. We set the maximum instance count to 4. We can define this using the Boto3 SDK. There's different types of scaling policies: Simple Scaling, Target Tracking Scaling, Step Scaling, Scheduled Scaling, and On-Demand Scaling. For this we'll be using Target Tracking Scaling and be using the Invocations Per Instance Metric as the basis for scaling.

In [227]:
# AutoScaling client
asg = boto3.client('application-autoscaling')

# Resource type is variant and the unique identifier is the resource ID.
resource_id=f"endpoint/{endpoint_name}/variant/AllTraffic"

# scaling configuration
response = asg.register_scalable_target(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
    MinCapacity=1,
    MaxCapacity=3
)
print(f"registered:scalable:{response}::")


registered:scalable:{'ResponseMetadata': {'RequestId': '3fe55f40-37fe-4701-9b1b-7a06993e7ab6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '3fe55f40-37fe-4701-9b1b-7a06993e7ab6', 'content-type': 'application/x-amz-json-1.1', 'content-length': '2', 'date': 'Sat, 01 Oct 2022 01:58:19 GMT'}, 'RetryAttempts': 0}}::


In [236]:
# scaling configuration
asg.describe_scaling_policies(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
)


{'ScalingPolicies': [{'PolicyARN': 'arn:aws:autoscaling:us-east-1:622343165275:scalingPolicy:ee74f36f-117a-4e14-afd4-3fe2ffc640cb:resource/sagemaker/endpoint/xgboost-uploaded2022-10-01-00-58-11/variant/AllTraffic:policyName/SagemakerEndpointInvocationScalingPolicy',
   'PolicyName': 'SagemakerEndpointInvocationScalingPolicy',
   'ServiceNamespace': 'sagemaker',
   'ResourceId': 'endpoint/xgboost-uploaded2022-10-01-00-58-11/variant/AllTraffic',
   'ScalableDimension': 'sagemaker:variant:DesiredInstanceCount',
   'PolicyType': 'TargetTrackingScaling',
   'TargetTrackingScalingPolicyConfiguration': {'TargetValue': 0.5,
    'PredefinedMetricSpecification': {'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'},
    'ScaleOutCooldown': 30,
    'DisableScaleIn': True},
   'Alarms': [{'AlarmName': 'TargetTracking-endpoint/xgboost-uploaded2022-10-01-00-58-11/variant/AllTraffic-AlarmHigh-8073af5b-268e-422d-b91d-024dbe66febe',
     'AlarmARN': 'arn:aws:cloudwatch:us-east-1:6223431652

In [237]:
# scaling configuration
asg.describe_scaling_activities(
    ServiceNamespace='sagemaker', #
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', 
)

{'ScalingActivities': [{'ActivityId': 'c4fb7e94-5d50-4f22-afeb-0d3d31f20c7c',
   'ServiceNamespace': 'sagemaker',
   'ResourceId': 'endpoint/xgboost-uploaded2022-10-01-00-58-11/variant/AllTraffic',
   'ScalableDimension': 'sagemaker:variant:DesiredInstanceCount',
   'Description': 'Setting desired instance count to 1.',
   'Cause': 'monitor alarm TargetTracking-endpoint/xgboost-uploaded2022-10-01-00-58-11/variant/AllTraffic-AlarmLow-32d6e02e-c8c9-4185-815c-19fbd29d1a9c in state ALARM triggered policy SagemakerEndpointInvocationScalingPolicy',
   'StartTime': datetime.datetime(2022, 10, 1, 1, 38, 17, 454000, tzinfo=tzlocal()),
   'EndTime': datetime.datetime(2022, 10, 1, 1, 38, 52, 74000, tzinfo=tzlocal()),
   'StatusCode': 'Successful',
   'StatusMessage': 'Successfully set desired instance count to 1. Change successfully fulfilled by sagemaker.'},
  {'ActivityId': '0250cc7e-662e-453e-98ea-09156822d5fa',
   'ServiceNamespace': 'sagemaker',
   'ResourceId': 'endpoint/xgboost-uploaded202

In [233]:

# Target Scaling
response = asg.put_scaling_policy(
    PolicyName='SagemakerEndpointInvocationScalingPolicy',
    ServiceNamespace='sagemaker',
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'PredefinedMetricSpecification': {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance",
        },
        'TargetValue': 0.5, # Threshold
        'ScaleOutCooldown': 30, # duration between scale out
        "DisableScaleIn": True
    }
)
print(f"Target invocations created: {response}")


Target invocations created: {'PolicyARN': 'arn:aws:autoscaling:us-east-1:622343165275:scalingPolicy:ee74f36f-117a-4e14-afd4-3fe2ffc640cb:resource/sagemaker/endpoint/xgboost-uploaded2022-10-01-00-58-11/variant/AllTraffic:policyName/SagemakerEndpointInvocationScalingPolicy', 'Alarms': [{'AlarmName': 'TargetTracking-endpoint/xgboost-uploaded2022-10-01-00-58-11/variant/AllTraffic-AlarmHigh-8073af5b-268e-422d-b91d-024dbe66febe', 'AlarmARN': 'arn:aws:cloudwatch:us-east-1:622343165275:alarm:TargetTracking-endpoint/xgboost-uploaded2022-10-01-00-58-11/variant/AllTraffic-AlarmHigh-8073af5b-268e-422d-b91d-024dbe66febe'}], 'ResponseMetadata': {'RequestId': 'd7dc7af1-156e-430a-a873-7a671944e17d', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'd7dc7af1-156e-430a-a873-7a671944e17d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '584', 'date': 'Sat, 01 Oct 2022 02:01:12 GMT'}, 'RetryAttempts': 0}}


In [171]:
if 1==2:
    #Example 2 - CPUUtilization metric
    response = asg.put_scaling_policy(
        PolicyName='SagemakerEndpointInvocationScalingPolicy1',
        ServiceNamespace='sagemaker',
        ResourceId=resource_id,
        ScalableDimension='sagemaker:variant:DesiredInstanceCount',
        PolicyType='TargetTrackingScaling',
        TargetTrackingScalingPolicyConfiguration={
            "TargetValue": 40,
            "CustomizedMetricSpecification":
            {
                "MetricName": "CPUUtilization",
                "Namespace": "/aws/sagemaker/Endpoints",
                "Dimensions": [
                    {"Name": "EndpointName", "Value": endpoint_name },
                    {"Name": "VariantName","Value": "AllTraffic"}
        ],
            "Statistic": "Average",
            "Unit": "Percent"
        },
        "DisableScaleIn": True,
      }
    )
    print(f"CPU utilization scaling created: {response}")

    
    

In [226]:
# scaling configuration
if 1 == 2:
    response_de = asg.deregister_scalable_target(
        ServiceNamespace='sagemaker', #
        ResourceId=resource_id,
        ScalableDimension='sagemaker:variant:DesiredInstanceCount'
    )
    print(f"registered:scalable:{response_de}::")

registered:scalable:{'ResponseMetadata': {'RequestId': '5523d4ab-a969-4bd2-9bf7-31894fa4c32e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '5523d4ab-a969-4bd2-9bf7-31894fa4c32e', 'content-type': 'application/x-amz-json-1.1', 'content-length': '2', 'date': 'Sat, 01 Oct 2022 01:45:55 GMT'}, 'RetryAttempts': 0}}::


In [239]:
from threading import Thread
import time
invoke_endpoint=True

def invoke_endpoint_forever():
    smr_local = boto3.client('sagemaker-runtime')
    while invoke_endpoint:
        try:
            resp = smr_local.invoke_endpoint(
                EndpointName=endpoint_name, 
                Body=b'.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0', 
                ContentType='text/csv')
            time.sleep(0.0005)
            #print(resp['Body'].read())

        except:
            pass




In [240]:
# - Thread 1
thread1 = Thread(target=invoke_endpoint_forever)
thread1.start()

# - Thread 2
thread2 = Thread(target=invoke_endpoint_forever)
thread2.start()

# - Thread 3
thread3 = Thread(target=invoke_endpoint_forever)
#thread3.start()

# - Thread 4
thread4 = Thread(target=invoke_endpoint_forever)
#thread4.start()

# - Thread 5
thread5 = Thread(target=invoke_endpoint_forever)
#thread5.start()

In [66]:
#Thread(target=invoke_endpoint_forever).start()

In [248]:
invoke_endpoint=False

In [None]:
request_duration = 250
end_time = time.time() + request_duration
print(f"test will run for {request_duration} seconds")
while time.time() < end_time:
    resp = smr.invoke_endpoint(EndpointName=endpoint_name, Body=b'.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0', 
                           ContentType='text/csv')
    
print("Test finished:time to look at stats")

test will run for 250 seconds


We can monitor these invocations through CloudWatch which you can access through the SageMaker console.

<img src='invocations.png' width="900" height="400">

We can zoom in to monitor the InvocationsPerInstance metric more.

<img src='./images/AutoScale1.png' width="900" height="400">
<img src='./images/AutoScale2.png' width="900" height="400">

In [None]:
response = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print("Status: " + status)


while status=='Updating':
    time.sleep(1)
    response = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = response['EndpointStatus']
    instance_count = response['ProductionVariants'][0]['CurrentInstanceCount']
    print(f"Status: {status}")
    print(f"Current Instance count: {instance_count}")

print("Update completed!")
response = sm_client.describe_endpoint(EndpointName=endpoint_name)
instance_count = response['ProductionVariants'][0]['CurrentInstanceCount']
print(f"Status: {status}")
print(f"Current Instance count: {instance_count}")

In [None]:
response 

In [None]:
import pandas as pd
import datetime

cw = boto3.Session().client("cloudwatch")


def get_invocation_metrics_for_endpoint_variant(endpoint_name, variant_name, start_time, end_time):
    metrics = cw.get_metric_statistics(
        Namespace="AWS/SageMaker",
        MetricName="Invocations",
        StartTime=start_time,
        EndTime=end_time,
        Period=60,
        Statistics=["Sum"],
        Dimensions=[
            {"Name": "EndpointName", "Value": endpoint_name},
            {"Name": "VariantName", "Value": variant_name},
        ],
    )
    return (
        pd.DataFrame(metrics["Datapoints"])
        .sort_values("Timestamp")
        .set_index("Timestamp")
        .drop("Unit", axis=1)
        .rename(columns={"Sum": variant_name})
    )
def get_instance_count_metrics_for_endpoint_variant(endpoint_name, variant_name, start_time, end_time):
    metrics = cw.get_metric_statistics(
        Namespace="AWS/SageMaker",
        MetricName="InvocationsPerInstance",
        StartTime=start_time,
        EndTime=end_time,
        Period=60,
        Statistics=["Sum"],
        Dimensions=[
            {"Name": "EndpointName", "Value": endpoint_name},
            {"Name": "VariantName", "Value": variant_name},
        ],
    )
    return (
        pd.DataFrame(metrics["Datapoints"])
        .sort_values("Timestamp")
        .set_index("Timestamp")
        .drop("Unit", axis=1)
        .rename(columns={"Sum": variant_name})
    )


def plot_endpoint_metrics(start_time=None):
    start_time = start_time or datetime.datetime.now() - datetime.timedelta(minutes=60)
    end_time = datetime.datetime.now()
    metrics_variant = get_invocation_metrics_for_endpoint_variant(
        endpoint_name, 'AllTraffic', start_time, end_time
    )
    metrics_variant.plot()
    
    metrics_variant = get_instance_count_metrics_for_endpoint_variant(
        endpoint_name, 'AllTraffic', start_time, end_time
    )
    metrics_variant.plot()
    return metrics_variant

In [None]:
#time.sleep(20)  # let metrics catch up
plot_endpoint_metrics()