# 🚀 Deploy Qwen2.5 Coder-3B-Instruct Model on Amazon SageMaker AI using LMI

## Introduction: [Qwen2.5 Coder 3B](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct)

Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). As of now, Qwen2.5-Coder has covered six mainstream model sizes, 0.5, 1.5, 3, 7, 14, 32 billion parameters, to meet the needs of different developers. Qwen2.5-Coder brings the following improvements upon CodeQwen1.5:

Significantly improvements in code generation, code reasoning and code fixing. Base on the strong Qwen2.5, we scale up the training tokens into 5.5 trillion including source code, text-code grounding, Synthetic data, etc. Qwen2.5-Coder-32B has become the current state-of-the-art open-source codeLLM, with its coding abilities matching those of GPT-4o.
A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies.
This repo contains the instruction-tuned 3B Qwen2.5-Coder model, which has the following features:

Type: Causal Language Models
Training Stage: Pretraining & Post-training
Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings
Number of Parameters: 3.09B
Number of Paramaters (Non-Embedding): 2.77B
Number of Layers: 36
Number of Attention Heads (GQA): 16 for Q and 2 for KV
Context Length: Full 32,768 tokens
For more details, please refer to our blog, GitHub, Documentation, Arxiv.

In [1]:
%pip install -Uq sagemaker boto3 huggingface_hub --force-reinstall --no-cache-dir --quiet --no-warn-conflicts

Note: you may need to restart the kernel to use updated packages.


In [2]:
import json
import sagemaker
import boto3
import sys
import time
from typing import List, Dict
from datetime import datetime
from sagemaker.huggingface import (
    HuggingFaceModel, 
    get_huggingface_llm_image_uri
)

boto_region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session(boto_session=boto3.Session(region_name=boto_region))
role = sagemaker.get_execution_role()
sagemaker_client = boto3.client("sagemaker")
sagemaker_runtime_client = boto3.client("sagemaker-runtime")
s3_client = boto3.client("s3")

model_bucket = 'practiceb22' #sagemaker_session.default_bucket()  # bucket to house artifacts
s3_model_prefix = (
    "qwenmodels"  # folder within bucket where code artifact will go
)
prefix = sagemaker.utils.unique_name_from_base("DEMO")
print(f"prefix: {prefix}")


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
prefix: DEMO-1751885837-6db5


## Setup your SageMaker Real-time Endpoint 
### Create a SageMaker endpoint configuration

We begin by creating the endpoint configuration and set MinInstanceCount to 0. This allows the endpoint to scale in all the way down to zero instances when not in use. See the [notebook example for SageMaker AI endpoint scale down to zero](https://github.com/aws-samples/sagemaker-genai-hosting-examples/tree/02236395d44cf54c201eefec01fd8da0a454092d/scale-to-zero-endpoint).

There are a few parameters we want to setup for our endpoint. We first start by setting the variant name, and instance type we want our endpoint to use. In addition we set the *model_data_download_timeout_in_seconds* and *container_startup_health_check_timeout_in_seconds* to have some guardrails for when we deploy inference components to our endpoint. In addition we will use Managed Instance Scaling which allows SageMaker to scale the number of instances based on the requirements of the scaling of your inference components. We set a *MinInstanceCount* and *MinInstanceCount* variable to size this according to the workload you want to service and also maintain controls around cost. Lastly, we set *RoutingStrategy* for the endpoint to optimally tune how to route requests to instances and inference components for the best performance.

The suggested instance types to host the QwQ 30B model can be `ml.g5.24xlarge`, `ml.g6.12xlarge`, `ml.g6e.12xlarge`.

In [None]:
# Set an unique endpoint config name
endpoint_config_name = f"{prefix}-endpoint-config"
print(f"Demo endpoint config name: {endpoint_config_name}")

# Set varient name and instance type for hosting
variant_name = "AllTraffic"
instance_type = "ml.g5.2xlarge"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

min_instance_count = 0 # Minimum instance must be set to 0
max_instance_count = 2

sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": min_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)

### Create the SageMaker endpoint
Next, we create our endpoint using the above endpoint config

In [23]:
# Set a unique endpoint name
endpoint_name = f"{prefix}-endpoint"
print(f"Demo endpoint name: {endpoint_name}")

sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)

Demo endpoint name: DEMO-1751885837-6db5-endpoint


{'EndpointArn': 'arn:aws:sagemaker:ap-south-1:572285620711:endpoint/DEMO-1751885837-6db5-endpoint',
 'ResponseMetadata': {'RequestId': '2f2641f3-44ae-4c1b-b4c8-838468814c98',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '2f2641f3-44ae-4c1b-b4c8-838468814c98',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '98',
   'date': 'Mon, 07 Jul 2025 11:44:12 GMT'},
  'RetryAttempts': 0}}

In [24]:
sagemaker_session.wait_for_endpoint(endpoint_name)

------!

{'EndpointName': 'DEMO-1751885837-6db5-endpoint',
 'EndpointArn': 'arn:aws:sagemaker:ap-south-1:572285620711:endpoint/DEMO-1751885837-6db5-endpoint',
 'EndpointConfigName': 'DEMO-1751885837-6db5-endpoint-config',
 'ProductionVariants': [{'VariantName': 'AllTraffic',
   'CurrentInstanceCount': 1,
   'DesiredInstanceCount': 1,
   'ManagedInstanceScaling': {'Status': 'ENABLED',
    'MinInstanceCount': 0,
    'MaxInstanceCount': 2},
   'RoutingConfig': {'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'}}],
 'EndpointStatus': 'InService',
 'CreationTime': datetime.datetime(2025, 7, 7, 11, 44, 12, 968000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2025, 7, 7, 11, 47, 17, 664000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': 'cad40cae-5618-47c2-aba7-63d08b06de88',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'cad40cae-5618-47c2-aba7-63d08b06de88',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '557',
   'date': 'Mon, 07 Jul 2025 11

## Deploy using Amazon SageMaker Large Model Inference (LMI) container 
In this example we are goign to use the LMI v15 container powered by vLLM 0.8.4 with support for the vLLM V1 engine. This version now supports the latest open-source models, such as Meta’s Llama 4 models Scout and Maverick, Google’s Gemma 3, Alibaba’s Qwen, Mistral AI, DeepSeek-R, and many more. You can find more details of the LMI v15 container from [the blog here](https://aws.amazon.com/blogs/machine-learning/supercharge-your-llm-performance-with-amazon-sagemaker-large-model-inference-container-v15/).



### Create Model Artifact
We will be deploying the Qwen 30B A3B model using the LMI container. In order to do so you need to set the image you would like to use with the proper configuartion. You can also create a SageMaker model to be referenced when you create your inference component

#### Download the model from Hugging Face and upload the model artifacts on Amazon S3
In this example, we will demonstrate how to download your copy of the model from huggingface and upload it to an s3 location in your AWS account, then deploy the model with the downloaded model artifacts to an endpoint. 

First, download the model artifact data from HuggingFace. 


In [6]:
from huggingface_hub import snapshot_download
from pathlib import Path
import os
import sagemaker
import jinja2

qwen25_3B = "Qwen/Qwen2.5-Coder-3B-Instruct"

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = qwen25_3B
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.safetensors", "*.bin", "*.txt"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.21G [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

In [7]:
# define a variable to contain the s3url of the location that has the model
pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"
print(f"Pretrained model will be uploaded to ---- > {pretrained_model_location}")

Pretrained model will be uploaded to ---- > s3://practiceb22/qwenmodels/


Upload model data to s3.

In [18]:
model_download_path

'./models--Qwen--Qwen2.5-Coder-3B-Instruct/snapshots/488639f1ff808d1d3d0ba301aef8c11461451ec5'

In [19]:
model_artifact = sagemaker_session.upload_data(path=model_download_path, bucket = model_bucket, key_prefix=s3_model_prefix)
print(f"Model uploaded to --- > {model_artifact}")
print(f"We will set option.s3url={model_artifact}")

Model uploaded to --- > s3://practiceb22/qwenmodels
We will set option.s3url=s3://practiceb22/qwenmodels


In [None]:
# optional
# !rm -rf {model_download_path}

To find our more of the SageMaker `create_model` api call, you can see the details in [the boto3 doc](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html). Note that you can use the **CompressionType** to specify how the model data is prepared.  

If you choose `Gzip` and choose `S3Object` as the value of `S3DataType`, `S3Uri` identifies an object that is a gzip-compressed TAR archive. SageMaker will attempt to decompress and untar the object during model deployment.

If you choose `None` and `S3Prefix` as the value of `S3DataType`, then for each S3 object under the key name pefix referenced by `S3Uri`, SageMaker will trim its key by the prefix, and use the remainder as the path (relative to `/opt/ml/model`) of the file holding the content of the S3 object. SageMaker will split the remainder by slash (/), using intermediate parts as directory names and the last part as filename of the file holding the content of the S3 object.


In [25]:
# Define region where you have capacity
REGION = boto_region

#Select the latest container. Check the link for the latest available version https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers 
CONTAINER_VERSION = '0.33.0-lmi15.0.0-cu128'

# Construct container URI
container_uri = f'763104351884.dkr.ecr.{REGION}.amazonaws.com/djl-inference:{CONTAINER_VERSION}'

pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"
qwen2_5_model = {
    "Image": container_uri,
    'ModelDataSource': {
                'S3DataSource': {
                    'S3Uri': pretrained_model_location,
                    'S3DataType': 'S3Prefix',
                    'CompressionType': 'None',
                }
            },
    "Environment": {
        "SAGEMAKER_MODEL_SERVER_WORKERS": "1",
        "MESSAGES_API_ENABLED": "true",
        "OPTION_MAX_ROLLING_BATCH_SIZE": "4",
        "OPTION_MODEL_LOADING_TIMEOUT": "1500",
        "SERVING_FAIL_FAST": "true",
        "OPTION_ROLLING_BATCH": "disable",
        "OPTION_ASYNC_MODE": "true",
        "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service",
        "OPTION_ENABLE_STREAMING": "true"
    },
}
model_name_qwen2_5 = f"qwen2-5-coder-3b-tgi-{datetime.now().strftime('%y%m%d-%H%M%S')}"
# create SageMaker Model
sagemaker_client.create_model(
    ModelName=model_name_qwen2_5,
    ExecutionRoleArn=role,
    Containers=[qwen2_5_model],
)

{'ModelArn': 'arn:aws:sagemaker:ap-south-1:572285620711:model/qwen2-5-coder-3b-tgi-250707-114804',
 'ResponseMetadata': {'RequestId': 'd8799bb8-1192-49e3-a29f-e4a2b3b5b0b6',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'd8799bb8-1192-49e3-a29f-e4a2b3b5b0b6',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '97',
   'date': 'Mon, 07 Jul 2025 11:48:05 GMT'},
  'RetryAttempts': 0}}

We can now create the Inference Components which will deployed on the endpoint that you specify. Please note here that you can provide a SageMaker model or a container to specification. If you provide a container, you will need to provide an image and artifactURL as parameters. In this example we set it to the model name we prepared in the cells above. You can also set the `ComputeResourceRequirements` to supply SageMaker what should be reserved for each copy of the inference component. You can also set the copy count of the number of Inference Components you would like to deploy. These can be managed and scaled as the capabilities become available. 

Note that in this example we set the `NumberOfAcceleratorDevicesRequired` to a value of `4`. By doing so we reserve 4 accelerators for each copy of this inference component so that we can use tensor parallel. 

In [26]:
inference_component_name_qwen = f"{prefix}-IC-qwen3-30b-{datetime.now().strftime('%y%m%d-%H%M%S')}"
variant_name = "AllTraffic"

sagemaker_client.create_inference_component(
    InferenceComponentName=inference_component_name_qwen,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name_qwen2_5,
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 1,
            "NumberOfCpuCoresRequired": 1,
            "MinMemoryRequiredInMb": 1024,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)

{'InferenceComponentArn': 'arn:aws:sagemaker:ap-south-1:572285620711:inference-component/DEMO-1751885837-6db5-IC-qwen3-30b-250707-114811',
 'ResponseMetadata': {'RequestId': '34186504-be2b-431f-8acc-f3e8d2670d83',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '34186504-be2b-431f-8acc-f3e8d2670d83',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '137',
   'date': 'Mon, 07 Jul 2025 11:48:11 GMT'},
  'RetryAttempts': 0}}

Wait until the inference component is `InService`.

In [None]:
import time
# Let's see how much it takes
start_time = time.time()
while True:
    desc = sagemaker_client.describe_inference_component(
        InferenceComponentName=inference_component_name_qwen
    )
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)
total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

Creating
Creating
Creating
Creating
Creating
InService

Total time taken: 512.24 seconds (8.54 minutes)


#### Invoke endpoint with boto3
Now you can invoke the endpoint with boto3 `invoke_endpoint` or `invoke_endpoint_with_response_stream` runtime api calls. If you have an existing endpoint, you don't need to recreate the `predictor` and can follow below example to invoke the endpoint with an endpoint name.


In [65]:
import boto3
import json
import time
sagemaker_runtime = boto3.client('sagemaker-runtime')

prompt = {
    'messages':[
    {"role": "user", "content": "Generate a code to convert list to json"}
],
    'temperature':0.7,
    'top_p':0.8,
    'top_k':20,
    'max_tokens':512,
    #'system': 'Just share the code and no examples or explanation'
}

system_prompt = 'Just share the code and no examples or explanation'
# Create the payload
payload = {
    "inputs": 'Generate a code to convert list to json',
    "parameters": {
        "max_new_tokens": 512,
        "temperature": 0.7,
        "do_sample": False
    },
    "system": system_prompt  # Adding system prompt to the request
}

start_time = time.time()
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name_qwen,
    ContentType="application/json",
    Body=json.dumps(prompt)
)
end_time = time.time()
overall_latency = end_time - start_time

response_dict = json.loads(response['Body'].read().decode("utf-8"))
response_content = response_dict['choices'][0]['message']['content']
print(response_content)
print(overall_latency)

model_usage = response_content = response_dict['usage']
print(model_usage)

Certainly! In Python, you can convert a list to JSON using the `json` module. Here's a simple example:

```python
import json

# Example list
my_list = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25},
    {"name": "Charlie", "age": 35}
]

# Convert list to JSON
json_string = json.dumps(my_list)

# Print the JSON string
print(json_string)
```

This code will output:

```json
[{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}, {"name": "Charlie", "age": 35}]
```

The `json.dumps()` function converts a Python object into a JSON formatted string. If you want to write the JSON data to a file instead of printing it, you can use `json.dump()`, which writes the JSON representation of the object to a file:

```python
import json

# Example list
my_list = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25},
    {"name": "Charlie", "age": 35}
]

# Write JSON to a file
with open('output.json', 'w') as file:
    json.dump(my_list, file, indent=4)
```

This w

In [66]:
response

{'ResponseMetadata': {'RequestId': '77cb1759-650b-42fd-bbe9-7cea62ef990a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '77cb1759-650b-42fd-bbe9-7cea62ef990a',
   'x-amzn-invoked-production-variant': 'AllTraffic',
   'date': 'Mon, 14 Jul 2025 05:10:03 GMT',
   'content-type': 'application/json',
   'content-length': '1902',
   'connection': 'keep-alive'},
  'RetryAttempts': 0},
 'ContentType': 'application/json',
 'InvokedProductionVariant': 'AllTraffic',
 'Body': <botocore.response.StreamingBody at 0x7fce23d59300>}

In [67]:
response_dict
#response_dict['generated_text']

{'id': 'chatcmpl-7e5255922ed2436f8091a794d204c9ff',
 'object': 'chat.completion',
 'created': 1752469796,
 'model': 'lmi',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'reasoning_content': None,
    'content': 'Certainly! In Python, you can convert a list to JSON using the `json` module. Here\'s a simple example:\n\n```python\nimport json\n\n# Example list\nmy_list = [\n    {"name": "Alice", "age": 30},\n    {"name": "Bob", "age": 25},\n    {"name": "Charlie", "age": 35}\n]\n\n# Convert list to JSON\njson_string = json.dumps(my_list)\n\n# Print the JSON string\nprint(json_string)\n```\n\nThis code will output:\n\n```json\n[{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}, {"name": "Charlie", "age": 35}]\n```\n\nThe `json.dumps()` function converts a Python object into a JSON formatted string. If you want to write the JSON data to a file instead of printing it, you can use `json.dump()`, which writes the JSON representation of the object to a file:\n\n```pyt

#### Streaming response from the endpoint
Additionally, SGLang allows you to invoke the endpoint and receive streaming response. Below is an example of how to interact with the endpoint with streaming response.

In [68]:
import io
import json

# Example class that processes an inference stream:
class SmrInferenceStream:
    
    def __init__(self, sagemaker_runtime, endpoint_name, inference_component_name=None):
        self.sagemaker_runtime = sagemaker_runtime
        self.endpoint_name = endpoint_name
        self.inference_component_name = inference_component_name
        # A buffered I/O stream to combine the payload parts:
        self.buff = io.BytesIO() 
        self.read_pos = 0
        
    def stream_inference(self, request_body):
        # Gets a streaming inference response 
        # from the specified model endpoint:
        response = self.sagemaker_runtime\
            .invoke_endpoint_with_response_stream(
                EndpointName=self.endpoint_name, 
                InferenceComponentName=self.inference_component_name,
                Body=json.dumps(request_body), 
                ContentType="application/json"
        )
        # Gets the EventStream object returned by the SDK:
        event_stream = response['Body']
        for event in event_stream:
            # Passes the contents of each payload part
            # to be concatenated:
            self._write(event['PayloadPart']['Bytes'])
            # Iterates over lines to parse whole JSON objects:
            for line in self._readlines():
                try:
                    resp = json.loads(line)
                except:
                    continue
                if len(line)>0 and type(resp) == dict:
                    # if len(resp.get('choices')) == 0:
                    #     continue
                    part = resp.get('choices')[0]['delta']['content']
                    
                else:
                    part = resp
                # Returns parts incrementally:
                yield part
    
    # Writes to the buffer to concatenate the contents of the parts:
    def _write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)

    # The JSON objects in buffer end with '\n'.
    # This method reads lines to yield a series of JSON objects:
    def _readlines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            self.read_pos += len(line)
            yield line[:-1]

In [69]:
request_body = {
    'messages':[
        {"role": "user", "content": "Generate a code to convert list to json"},
    ],
    'temperature':0.9,
    'max_tokens':512,
    'stream': True,
}

smr_inference_stream = SmrInferenceStream(
    sagemaker_runtime, endpoint_name, inference_component_name_qwen)
stream = smr_inference_stream.stream_inference(request_body)
for part in stream:
    print(part, end='')

Certainly! To convert a list to JSON in Python, you can use the `json` module, which is part of the standard library. Here's a simple example:

```python
import json

# Example list
my_list = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25},
    {"name": "Charlie", "age": 35}
]

# Convert list to JSON
json_output = json.dumps(my_list, indent=4)

print(json_output)
```

In this example, `json.dumps()` is used to convert the list into a JSON-formatted string. The `indent` parameter is optional and is used to pretty-print the JSON output with an indentation of 4 spaces for better readability.

If you want to write the JSON data to a file, you can use the `json.dump()` function instead:

```python
import json

# Example list
my_list = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25},
    {"name": "Charlie", "age": 35}
]

# Write list to JSON file
with open('output.json', 'w') as json_file:
    json.dump(my_list, json_file, indent=4)
```

This will create

## Cleanup
  
Make sure to delete the endpoint and other artifacts that were created to avoid unnecessary cost. You can also go to SageMaker AI console to delete all the resources created in this example.

In [None]:
sagemaker_client.delete_inference_component(InferenceComponentName=inference_component_name_qwen)
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)