# SageMaker Serverless Inference


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

---

In [39]:
boto3.__version__

'1.38.35'

## XGBoost Regression Example

Amazon SageMaker Serverless Inference is a purpose-built inference option that makes it easy for customers to deploy and scale ML models. Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints also automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies.

For this notebook we'll be working with the SageMaker XGBoost Algorithm to train a model and then deploy a serverless endpoint. We will be using the public S3 Abalone regression dataset for this example.

<b>Notebook Setting</b>
- <b>SageMaker Classic Notebook Instance</b>: ml.m5.xlarge Notebook Instance & `conda_python3` Kernel
- <b>SageMaker Studio</b>: Python 3 (Data Science)
- <b>Regions Available</b>: SageMaker Serverless Inference is currently available in the following regions: US East (Northern Virginia), US East (Ohio), US West (Oregon), EU (Ireland), Asia Pacific (Tokyo) and Asia Pacific (Sydney)

## Table of Contents
- Setup
- Model Training
- Deployment
    - Model Creation
    - Endpoint Configuration (Adjust for Serverless)
    - Serverless Endpoint Creation
    - Endpoint Invocation
- Cleanup

## Setup

For testing you need to properly configure your Notebook Role to have <b>SageMaker Full Access</b>.

Let's start by upgrading the Python SDK, `boto3` and AWS `CLI` (Command Line Interface) packages.

In [1]:
! pip install sagemaker botocore boto3 awscli --upgrade

Collecting sagemaker
  Downloading sagemaker-2.246.0-py3-none-any.whl.metadata (17 kB)
Collecting botocore
  Downloading botocore-1.38.35-py3-none-any.whl.metadata (5.7 kB)
Collecting boto3
  Downloading boto3-1.38.35-py3-none-any.whl.metadata (6.6 kB)
Collecting awscli
  Downloading awscli-1.40.34-py3-none-any.whl.metadata (11 kB)
Collecting cloudpickle>=2.2.1 (from sagemaker)
  Using cached cloudpickle-3.1.1-py3-none-any.whl.metadata (7.1 kB)
Collecting docker (from sagemaker)
  Using cached docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting fastapi (from sagemaker)
  Using cached fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting google-pasta (from sagemaker)
  Using cached google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting graphene<4,>=3 (from sagemaker)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting importlib-metadata<7.0,>=1.4.0 (from sagemaker)
  Using cached importlib_metadata-6.11.0-py3-none-any.whl.metadata (

In [2]:
# Setup clients
import boto3

client = boto3.client(service_name="sagemaker")
runtime = boto3.client(service_name="sagemaker-runtime")

### SageMaker Setup
To begin, we import the AWS SDK for Python (Boto3) and set up our environment, including an IAM role and an S3 bucket to store our data.

In [4]:
import boto3
import sagemaker
from sagemaker.estimator import Estimator

boto_session = boto3.session.Session()
region = boto_session.region_name
print(region)

sagemaker_session = sagemaker.Session()
base_job_prefix = "xgboost-example"
default_bucket_prefix = sagemaker_session.default_bucket_prefix

# If a default bucket prefix is specified, append it to the s3 path
if default_bucket_prefix:
    base_job_prefix = f"{default_bucket_prefix}/{base_job_prefix}"

role = "arn:aws:iam::794038231401:role/service-role/AmazonSageMaker-ExecutionRole-20250509T182764"
print(role)

default_bucket = sagemaker_session.default_bucket()
s3_prefix = base_job_prefix

training_instance_type = "ml.m5.xlarge"

us-east-1
arn:aws:iam::794038231401:role/service-role/AmazonSageMaker-ExecutionRole-20250509T182764


Retrieve the Abalone dataset from a publicly hosted S3 bucket.

In [5]:
# retrieve data
s3 = boto3.client("s3")
s3.download_file(
    f"sagemaker-example-files-prod-{region}",
    "datasets/tabular/uci_abalone/train_csv/abalone_dataset1_train.csv",
    "abalone_dataset1_train.csv",
)

Upload the Abalone dataset to the default S3 bucket.

In [6]:
# upload data to S3
!aws s3 cp abalone_dataset1_train.csv s3://{default_bucket}/xgboost-regression/train.csv

upload: ./abalone_dataset1_train.csv to s3://sagemaker-us-east-1-794038231401/xgboost-regression/train.csv


## Model Training

Now, we train an ML model using the XGBoost Algorithm. In this example, we use a SageMaker-provided [XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) container image and configure an estimator to train our model.

In [7]:
from sagemaker.inputs import TrainingInput

training_path = f"s3://{default_bucket}/xgboost-regression/train.csv"
train_input = TrainingInput(training_path, content_type="text/csv")

In [8]:
model_path = f"s3://{default_bucket}/{s3_prefix}/xgb_model"

# retrieve xgboost image
image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.0-1",
    py_version="py3",
    instance_type=training_instance_type,
)

# Configure Training Estimator
xgb_train = Estimator(
    image_uri=image_uri,
    instance_type=training_instance_type,
    instance_count=1,
    output_path=model_path,
    sagemaker_session=sagemaker_session,
    role=role,
)

# Set Hyperparameters
xgb_train.set_hyperparameters(
    objective="reg:linear",
    num_round=50,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7,
    silent=0,
)

Train the model on the Abalone dataset.

In [9]:
# Fit model
xgb_train.fit({"train": train_input})

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2025-06-11-23-42-08-404


2025-06-11 23:42:09 Starting - Starting the training job...
2025-06-11 23:42:43 Downloading - Downloading input data...
2025-06-11 23:43:09 Downloading - Downloading the training image......
2025-06-11 23:44:13 Training - Training image download completed. Training in progress.
2025-06-11 23:44:13 Uploading - Uploading generated training model
2025-06-11 23:44:13 Completed - Training job completed
[34m[2025-06-11 23:43:55.887 ip-10-0-160-101.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value reg:linear to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root

## Deployment

After training the model, retrieve the model artifacts so that we can deploy the model to an endpoint.

In [23]:
# Retrieve model data from training job
model_artifacts = xgb_train.model_data
model_artifacts

's3://sagemaker-us-east-1-794038231401/xgboost-example/xgb_model/sagemaker-xgboost-2025-06-11-23-42-08-404/output/model.tar.gz'

### Model Creation
Create a model by providing your model artifacts, the container image URI, environment variables for the container (if applicable), a model name, and the SageMaker IAM role.

In [24]:
from time import gmtime, strftime

model_name = "xgboost-serverless" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + model_name)

# dummy environment variables
byo_container_env_vars = {"SAGEMAKER_CONTAINER_LOG_LEVEL": "20", "SOME_ENV_VAR": "myEnvVar"}

create_model_response = client.create_model(
    ModelName=model_name,
    Containers=[
        {
            "Image": image_uri,
            "Mode": "SingleModel",
            "ModelDataUrl": model_artifacts,
            "Environment": byo_container_env_vars,
        }
    ],
    ExecutionRoleArn=role,
)

print("Model Arn: " + create_model_response["ModelArn"])

Model name: xgboost-serverless2025-06-12-12-37-12
Model Arn: arn:aws:sagemaker:us-east-1:794038231401:model/xgboost-serverless2025-06-12-12-37-12


### Endpoint Configuration Creation

This is where you can adjust the <b>Serverless Configuration</b> for your endpoint. The current max concurrent invocations for a single endpoint, known as `MaxConcurrency`, can be any value from <b>1 to 200</b>, and `MemorySize` can be any of the following: <b>1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB</b>.

In [25]:
xgboost_epc_name = "xgboost-serverless-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName=xgboost_epc_name,
    ProductionVariants=[
        {
            "VariantName": "byoVariant",
            "ModelName": model_name,
            "ServerlessConfig": {
                "MemorySizeInMB": 1024,
                "MaxConcurrency": 10,
            },
        },
    ],
)

print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

Endpoint Configuration Arn: arn:aws:sagemaker:us-east-1:794038231401:endpoint-config/xgboost-serverless-epc2025-06-12-12-37-17


### Serverless Endpoint Creation
Now that we have an endpoint configuration, we can create a serverless endpoint and deploy our model to it. When creating the endpoint, provide the name of your endpoint configuration and a name for the new endpoint.

In [26]:
endpoint_name = "xgboost-serverless-ep" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=xgboost_epc_name,
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-east-1:794038231401:endpoint/xgboost-serverless-ep2025-06-12-12-37-37


Wait until the endpoint status is `InService` before invoking the endpoint.

In [27]:
# wait for endpoint to reach a terminal state (InService) using describe endpoint
import time

describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)

while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
    print(describe_endpoint_response["EndpointStatus"])
    time.sleep(15)

describe_endpoint_response

{'EndpointName': 'xgboost-serverless-ep2025-06-12-12-37-37',
 'EndpointArn': 'arn:aws:sagemaker:us-east-1:794038231401:endpoint/xgboost-serverless-ep2025-06-12-12-37-37',
 'EndpointConfigName': 'xgboost-serverless-epc2025-06-12-12-37-17',
 'ProductionVariants': [{'VariantName': 'byoVariant',
   'DeployedImages': [{'SpecifiedImage': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3',
     'ResolvedImage': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost@sha256:da43a3b51e4fddd7743132d10eb2578d42c33f1a4d256bb4eaad349d4515b9b7',
     'ResolutionTime': datetime.datetime(2025, 6, 12, 6, 37, 38, 328000, tzinfo=tzlocal())}],
   'CurrentWeight': 1.0,
   'DesiredWeight': 1.0,
   'CurrentInstanceCount': 0,
   'CurrentServerlessConfig': {'MemorySizeInMB': 1024, 'MaxConcurrency': 10}}],
 'EndpointStatus': 'InService',
 'CreationTime': datetime.datetime(2025, 6, 12, 6, 37, 37, 598000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2025, 6, 12, 6,

### Endpoint Invocation
Invoke the endpoint by sending a request to it. The following is a sample data point grabbed from the CSV file downloaded from the public Abalone dataset.

In [28]:
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=b".345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0",
    ContentType="text/csv",
)

print(response["Body"].read())

b'4.566554546356201'


In [29]:
import boto3
import pandas as pd
from io import StringIO
import tarfile
import os

# Assuming default_bucket is defined
# If not, replace with actual bucket name
# default_bucket = "your-bucket-name"

training_path = f"s3://{default_bucket}/xgboost-regression/train.csv"

# Download the training data
s3 = boto3.client('s3')
bucket, key = training_path.replace("s3://", "").split("/", 1)
obj = s3.get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')

# Read into pandas DataFrame
df = pd.read_csv(StringIO(data), header=None)

# For XGBoost, first column is target, so features are from column 1 onwards
features = df.iloc[:, 1:]

# Select a subset, say first 10 rows
sample_df = features.head(10)

# Save to local CSV
local_csv = "sample.csv"
sample_df.to_csv(local_csv, index=False, header=False)

# Create tar.gz file
local_tar = "payload.tar.gz"
with tarfile.open(local_tar, "w:gz") as tar:
    tar.add(local_csv, arcname=os.path.basename(local_csv))

# Upload to S3
sample_key = "xgboost-regression/payload.tar.gz"
s3.upload_file(local_tar, bucket, sample_key)

sample_payload_url = f"s3://{bucket}/{sample_key}"
print(f"Sample payload uploaded to {sample_payload_url}")

# Optionally, clean up local files
os.remove(local_csv)
os.remove(local_tar)

Sample payload uploaded to s3://sagemaker-us-east-1-794038231401/xgboost-regression/payload.tar.gz


In [62]:
import boto3
import time

# Initialize SageMaker client
sagemaker_client = boto3.client('sagemaker')

# Specify the endpoint config name
endpoint_config_name = 'xgboost-serverless-epc2025-06-11-23-56-50'

# Get the model name from the endpoint config
response = sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)
model_name = response['ProductionVariants'][0]['ModelName']

# Specify the IAM role ARN
role_arn = role

from datetime import datetime
job_name = f'my-serverless-recommendation-job-{int(datetime.now().timestamp())}'

response = sagemaker_client.create_inference_recommendations_job(
    JobName=job_name,
    JobType='Advanced',
    RoleArn=role_arn,
    StoppingConditions= {
        'MaxInvocations': 10000,  # Maximum requests per minute to test
        'ModelLatencyThresholds': [
            {
                'Percentile': 'P95',  # 95th percentile latency
                'ValueInMilliseconds': 1000  # Stop if latency exceeds 100ms
            }
        ]
    },
    InputConfig={
        'ModelName': model_name,
        'EndpointConfigurations': [
            {
                'ServerlessConfig': {
                    'MemorySizeInMB': 1024,
                    'MaxConcurrency': 200
                }
            },
            {
                'ServerlessConfig': {
                    'MemorySizeInMB': 2048,
                    'MaxConcurrency': 200
                }
            },
            {
                'ServerlessConfig': {
                    'MemorySizeInMB': 4096,
                    'MaxConcurrency': 200
                }
            }
        ],
        'ContainerConfig': {
            'Framework': 'XGBoost',
            'FrameworkVersion': '1.0-1',
            'PayloadConfig': {
                'SamplePayloadUrl': sample_payload_url,
                'SupportedContentTypes': ['text/csv']
            }
        },
        'JobDurationInSeconds': 15000,
        "TrafficPattern": {
            "TrafficType": "STAIRS",
            "Stairs": { 
                    "DurationInSeconds": 120,
                    "NumberOfSteps": 60,
                    "UsersPerStep": 3
                }
        },
        'ResourceLimit': {
            'MaxNumberOfTests': 10,
            'MaxParallelOfTests': 5
        },
    }
)

In [63]:
print(f"Inference recommendation job created: {response['JobArn']}")

# Wait for the job to complete
while True:
    response = sagemaker_client.describe_inference_recommendations_job(JobName=job_name)
    status = response['Status']
    if status in ['COMPLETED', 'FAILED', 'STOPPED']:
        break
    print(f"Job status: {status}")
    time.sleep(60)

if status == 'COMPLETED':
    recommendations = response['InferenceRecommendations']
    for rec in recommendations:
        if 'ServerlessConfig' in rec['EndpointConfiguration']:
            serverless_config = rec['EndpointConfiguration']['ServerlessConfig']
            memory_size = serverless_config['MemorySizeInMB']
            max_concurrency = serverless_config['MaxConcurrency']
            metrics = rec['Metrics']
            print(f"MemorySizeInMB: {memory_size}, MaxConcurrency: {max_concurrency}")
            print(f"Metrics: {metrics}")
else:
    print(f"Job did not complete successfully. Status: {status}")

Inference recommendation job created: arn:aws:sagemaker:us-east-1:794038231401:inference-recommendations-job/my-serverless-recommendation-job-1749749311
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
Job status: IN_PROGRESS
MemorySizeInMB: 2048, MaxConcurrency: 200
Metrics: {'CostPerHour': 0.14399999380111694, 'CostPerInference': 1.8453022221365245e-07, 'MaxInvocations': 13006, 'ModelLatency': 4, 'MemoryUtilization': 17.245738983154297, 'ModelSetupTime': 5700835}
MemorySizeInMB: 1024, MaxConcurrency: 200
Metrics: {'CostPerHour': 0.07199999690055847, 'CostPerInference': 8.48476275905341e-08, 'MaxInvocations':

## Clean Up
Delete any resources you created in this notebook that you no longer wish to use.

In [None]:
client.delete_model(ModelName=model_name)
client.delete_endpoint_config(EndpointConfigName=xgboost_epc_name)
client.delete_endpoint(EndpointName=endpoint_name)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/serverless-inference|Serverless-Inference-Walkthrough.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/serverless-inference|Serverless-Inference-Walkthrough.ipynb)
