# LMI vLLM Qwen3-32B vLLM deployment guide

In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.

Please make sure the following permission granted before running the notebook:

- S3 bucket push access
- SageMaker access

## Step 1: Let's bump up SageMaker and import stuff

In [44]:
%pip install -U sagemaker

In [46]:
import os
from pathlib import Path
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment
sagemaker_default_bucket = sess.default_bucket()

## Step 2: Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

In [47]:
model_name = "Qwen/Qwen3-32B"

model_lineage = model_name.split("/")[0]
model_specific_name = model_name.split("/")[1]

s3url = "s3://sagemaker-us-west-2-831762732388/lmi/Qwen3-32B"

### Compress model artifacts

In [48]:
with open("serving.properties", "w") as wf:
    wf.write(f"""
engine=Python
option.entryPoint=djl_python.lmi_vllm.vllm_async_service
option.model_id={s3url}
option.async_mode=true
option.tensor_parallel_degree=4
option.rolling_batch=disable
#option.max_rolling_batch_size=8
option.gpu_memory_utilization=0.9
option.enable_auto_tool_choice=true
option.tool_call_parser=hermes
SERVING_FAIL_FAST=true
option.enable_prefix_caching=true
""")

In [49]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

mymodel/
mymodel/serving.properties


### Upload artifact on S3 and create SageMaker model

In [50]:
s3_code_prefix = "large-model-lmi/code-Qwen-Qwen3-32B"

bucketName = sess.default_bucket()  # bucket to house artifacts

code_artifact = sess.upload_data("mymodel.tar.gz", bucketName, s3_code_prefix)

print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-831762732388/large-model-lmi/code-Qwen-Qwen3-32B/mymodel.tar.gz


## Step 3: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### Getting the container image URI

For more versions or regions, you should checkout [Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)

In [51]:
image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.34.0-lmi16.0.0-cu128-v1.2"

# for China (Beijing) cn-north-1
# image_uri = "727897471807.dkr.ecr.cn-north-1.amazonaws.com.cn/djl-inference:0.34.0-lmi16.0.0-cu128-v1.2"

# for China (Ningxia) cn-northwest-1
# image_uri = "727897471807.dkr.ecr.cn-northwest-1.amazonaws.com.cn/djl-inference:0.34.0-lmi16.0.0-cu128-v1.2"

In [52]:
model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

## Step4: Create SageMaker endpoint

You need to specify the instance to use and endpoint names

In [None]:
instance_type = "ml.g5.48xlarge"
endpoint_name = sagemaker.utils.name_from_base(f"lmi-model-{model_lineage}-{model_specific_name}").replace(".", "-")

model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=1800
)

------------------------!

## Step 5: Test and benchmark the inference

In [54]:
import time
import json
import boto3

first_token_received = False
ttft = 0
token_count = 0

prompt = "tell me a long story."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
payload = {
    "messages": messages,
    "max_tokens": 4096,
    "temperature": 0.7,
    "top_p": 0.8,
    "stream": "true",
    #"chat_template_kwargs": {"enable_thinking": False},
}

# Create SageMaker Runtime client
sagemaker_runtime_client = boto3.client("sagemaker-runtime")

start_time = time.time()

# Invoke the model
response_stream = sagemaker_runtime_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
    CustomAttributes='accept_eula=false'
)

for event in response_stream['Body']:
    if 'PayloadPart' in event:
        chunk = event['PayloadPart']['Bytes'].decode()
        try:
            # Handle SSE format (data: prefix)
            if chunk.startswith('data: '):
                data = json.loads(chunk[6:])  # Skip "data: " prefix
            else:
                data = json.loads(chunk)
            # Extract token based on OpenAI format
            if 'choices' in data and len(data['choices']) > 0:
                if 'delta' in data['choices'][0] and 'content' in data['choices'][0]['delta']:
                    token_count += 1
                    token_text = data['choices'][0]['delta']['content']
                    # Record time to first token
                    if not first_token_received:
                        ttft = time.time() - start_time
                        first_token_received = True
                    print(token_text, end='', flush=True)
        
        except json.JSONDecodeError:
            continue
            
# Print metrics after completion
end_time = time.time()
total_latency = end_time - start_time

print("\n\nMetrics:")
print(f"Time to First Token (TTFT): {ttft:.2f} seconds" if ttft else "TTFT: N/A")
print(f"Total Tokens Generated: {token_count}")
print(f"Total Latency: {total_latency:.2f} seconds")
if token_count > 0 and total_latency > 0:
    print(f"Tokens per second: {token_count/total_latency:.2f}")


, user wants a long story. Let me think about what kind of story would be engaging. Maybe a fantasy adventure with some depth. I should create a unique world with interesting characters. Let me start by setting up a mystical land with some conflict. Maybe a hero's journey? That structure usually works well.

First, I need a setting. Let's go with a place called Elyndor, a realm where magic and nature are intertwined. There's a threat to the balance, maybe a dark force. The protagonist could be someone ordinary who discovers they have a special role. Let's name her Lira. She's a young woman living in a village near an ancient forest. 

Introduce some magical elements. The forest, Sylwen, is alive and sentient. The elders speak of a time when the forest and sky were connected through the Celestial Tree. Now, the connection is broken, causing chaos. Lira has a connection to this tree through her lineage. Maybe she finds a pendant that's a key to restoring the connection.

Conflict arises

### Tool calling
Ref: [https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/tool_calling.html](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/tool_calling.html)

In [56]:
import json
import boto3

# Create SageMaker Runtime client
sagemaker_runtime_client = boto3.client("sagemaker-runtime")

payload =  {
    "messages": [
        {
            "role": "user",
            "content": "Hi! How are you doing today?"
        }, 
        {
            "role": "assistant",
            "content": "I'm doing well! How can I help you?"
        }, 
        {
            "role": "user",
            "content": "Can you tell me what the temperate will be in Dallas, in fahrenheit?"
        }
    ],
    "tools": [{
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type":
                            "string",
                        "description":
                            "The city to find the weather for, e.g. 'San Francisco'"
                    },
                    "state": {
                        "type":
                            "string",
                        "description":
                            "the two-letter abbreviation for the state that the city is in, e.g. 'CA' which would mean 'California'"
                    },
                    "unit": {
                        "type": "string",
                        "description":
                            "The unit to fetch the temperature in",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["city", "state", "unit"]
            }
        }
    }],
}

response = sagemaker_runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
    CustomAttributes='accept_eula=false'
)

print(json.loads(response["Body"].read()))

{'id': 'chatcmpl-21b54e712f334f33a6623b9a60f7e789', 'object': 'chat.completion', 'created': 1761873507, 'model': 'lmi', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "<think>\nOkay, the user is asking for the temperature in Dallas in Fahrenheit. Let me check the tools available. There's a function called get_current_weather that requires city, state, and unit. The user mentioned Dallas, so the city is Dallas. The state for Dallas is Texas, which is abbreviated as TX. The unit they want is Fahrenheit. So I need to call the get_current_weather function with city: Dallas, state: TX, unit: fahrenheit. Let me make sure all required parameters are included. Yes, city, state, and unit are all there. Alright, I'll structure the tool call accordingly.\n</think>\n\n", 'refusal': None, 'annotations': None, 'audio': None, 'function_call': None, 'tool_calls': [{'id': 'chatcmpl-tool-d75310c1346d4e78998fb217ef53beb3', 'type': 'function', 'function': {'name': 'get_current_weathe

## Clean up the environment

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()