# SageMaker VLLM endpoint example

## 1. Create your container repository

create a repository for your container (or use aws console and : https://console.aws.amazon.com/ecr/create-repository)

for example `sagemaker_endpoint/vllm`

In [1]:
# set some variables
VLLM_VERSION = "v0.6.4.post1"
REPO_NAMESPACE = "sagemaker_endpoint/vllm"
ACCOUNT = !aws sts get-caller-identity --query Account --output text
REGION = !aws configure get region
ACCOUNT = ACCOUNT[0]
REGION = REGION[0]
CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com/{REPO_NAMESPACE}:{VLLM_VERSION}"
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"

## 2. Build the container

demo codes are in `app/`
build and push the docker with following commands:

**the docker only need to be built once**


In [2]:
cmd = f"VLLM_VERSION={VLLM_VERSION} REPO_NAMESPACE={REPO_NAMESPACE} ACCOUNT={ACCOUNT} REGION={REGION} bash ./build_and_push.sh"
print("Runging:", cmd)
!{cmd}

Runging: VLLM_VERSION=v0.6.4.post1 REPO_NAMESPACE=sagemaker_endpoint/vllm ACCOUNT=236995464743 REGION=us-west-2 bash ./build_and_push.sh
236995464743.dkr.ecr.us-west-2.amazonaws.com/sagemaker_endpoint/vllm:v0.6.4.post1
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
[1A[1B[0G[?25l[+] Building 0.0s (0/1)                                          docker:default
[?25h[1A[0G[?25l[+] Building 0.1s (9/9) FINISHED                                 docker:default
[34m => [internal] load build definition from dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 2.05kB                                     0.0s
[0m[34m => [internal] load metadata for docker.io/vllm/vllm-openai:v0.6.4.post1   0.0s
[0m[34m => [internal] load .dockerignore                                          0.0s
[0m[34m => => transferring context: 2B                                            0.0s
[0m[34m => [1/4] FROM docker.io/vllm/vllm-ope

## 3. Deploy on SageMaker

define the model and deploy on SageMaker


In [3]:
%pip install -U boto3 sagemaker transformers huggingface_hub modelscope

Note: you may need to restart the kernel to use updated packages.


### 3.1 Init SageMaker session

In [4]:
import os
import re
import json
from datetime import datetime
import time

import boto3
import sagemaker
from sagemaker import Model

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sess.default_bucket()

sagemaker_client = boto3.client("sagemaker")



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [5]:
model_name = MODEL_ID.replace("/", "-").replace(".", "-")
local_model_path = os.environ['HOME'] + "/models/" + model_name
s3_model_path = f"s3://{default_bucket}/models/" + model_name

%mkdir -p code {local_model_path}

print("local_model_path:", local_model_path)

local_model_path: /home/ec2-user/models/Qwen-Qwen2-5-0-5B-Instruct


### 3.2 Download and upload model file

##### Option 1: global region

In [6]:
!huggingface-cli download --resume-download {MODEL_ID} --local-dir {local_model_path}

Fetching 10 files: 100%|█████████████████████| 10/10 [00:00<00:00, 66156.21it/s]
/home/ec2-user/models/Qwen-Qwen2-5-0-5B-Instruct


##### Option 2: China region

In [7]:
# !modelscope download --local_dir {local_model_path} {MODEL_ID} 

#### upload to s3

In [8]:
!aws s3 sync {local_model_path} {s3_model_path}
print("s3_model_path:", s3_model_path)

upload: ../../../models/Qwen-Qwen2-5-0-5B-Instruct/.cache/huggingface/download/.gitattributes.lock to s3://sagemaker-us-west-2-236995464743/models/Qwen-Qwen2-5-0-5B-Instruct/.cache/huggingface/download/.gitattributes.lock
upload: ../../../models/Qwen-Qwen2-5-0-5B-Instruct/.cache/huggingface/download/tokenizer.json.lock to s3://sagemaker-us-west-2-236995464743/models/Qwen-Qwen2-5-0-5B-Instruct/.cache/huggingface/download/tokenizer.json.lock
upload: ../../../models/Qwen-Qwen2-5-0-5B-Instruct/.cache/huggingface/download/merges.txt.lock to s3://sagemaker-us-west-2-236995464743/models/Qwen-Qwen2-5-0-5B-Instruct/.cache/huggingface/download/merges.txt.lock
upload: ../../../models/Qwen-Qwen2-5-0-5B-Instruct/.cache/huggingface/download/tokenizer_config.json.lock to s3://sagemaker-us-west-2-236995464743/models/Qwen-Qwen2-5-0-5B-Instruct/.cache/huggingface/download/tokenizer_config.json.lock
upload: ../../../models/Qwen-Qwen2-5-0-5B-Instruct/.cache/huggingface/download/LICENSE.lock to s3://sagema

### 3.3 Prepare vllm start scripts

In [9]:
endpoint_model_name = sagemaker.utils.name_from_base(model_name, short=True)
local_code_path = endpoint_model_name
s3_code_path = f"s3://{default_bucket}/endpoint_code/vllm_byoc/{endpoint_model_name}.tar.gz"

%mkdir -p {local_code_path}

print("local_code_path:", local_code_path)

local_code_path: Qwen-Qwen2-5-0-5B-Instruct-241128-1505


In [10]:
with open(f"{local_code_path}/start.sh", "w") as f:
    f.write(f"""
#!/bin/bash

# port needs to be $SAGEMAKER_BIND_TO_PORT

# download model to local
aws s3 sync {s3_model_path} /opt/ml/modelfile/

# the start script need to be adjust as you needed
python3 -m vllm.entrypoints.openai.api_server \\
    --port $SAGEMAKER_BIND_TO_PORT \\
    --trust-remote-code \\
    --max-model-len 8192 \\
    --model /opt/ml/modelfile/
    """)

In [11]:
!rm -f {local_code_path}.tar.gz
!tar czvf {local_code_path}.tar.gz {local_code_path}/
!aws s3 cp {local_code_path}.tar.gz {s3_code_path}
print("s3_code_path:", s3_code_path)

Qwen-Qwen2-5-0-5B-Instruct-241128-1505/
Qwen-Qwen2-5-0-5B-Instruct-241128-1505/start.sh
upload: ./Qwen-Qwen2-5-0-5B-Instruct-241128-1505.tar.gz to s3://sagemaker-us-west-2-236995464743/endpoint_code/vllm_byoc/Qwen-Qwen2-5-0-5B-Instruct-241128-1505.tar.gz
s3_code_path: s3://sagemaker-us-west-2-236995464743/endpoint_code/vllm_byoc/Qwen-Qwen2-5-0-5B-Instruct-241128-1505.tar.gz


### 3.3 Deploy model

In [12]:
create_model_response = sagemaker_client.create_model(
    ModelName=endpoint_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": CONTAINER,
        "ModelDataUrl": s3_code_path
    },
    
)
print(create_model_response)
print("endpoint_model_name:", endpoint_model_name)

{'ModelArn': 'arn:aws:sagemaker:us-west-2:236995464743:model/Qwen-Qwen2-5-0-5B-Instruct-241128-1505', 'ResponseMetadata': {'RequestId': 'aad44c70-0abf-4c42-94fa-844bf9693f20', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'aad44c70-0abf-4c42-94fa-844bf9693f20', 'content-type': 'application/x-amz-json-1.1', 'content-length': '100', 'date': 'Thu, 28 Nov 2024 15:05:39 GMT'}, 'RetryAttempts': 0}}
endpoint_model_name: Qwen-Qwen2-5-0-5B-Instruct-241128-1505


In [13]:
endpoint_config_name = sagemaker.utils.name_from_base(model_name, short=True)

endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": endpoint_model_name,
            "InstanceType": "ml.g5.2xlarge",
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1000,
            # "EnableSSMAccess": True,
        },
    ],
)

print(endpoint_config_response)
print("endpoint_config_name:", endpoint_config_name)

{'EndpointConfigArn': 'arn:aws:sagemaker:us-west-2:236995464743:endpoint-config/Qwen-Qwen2-5-0-5B-Instruct-241128-1505', 'ResponseMetadata': {'RequestId': 'c65860b9-67aa-4d63-9de3-6ee0cd6ea597', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'c65860b9-67aa-4d63-9de3-6ee0cd6ea597', 'content-type': 'application/x-amz-json-1.1', 'content-length': '119', 'date': 'Thu, 28 Nov 2024 15:05:39 GMT'}, 'RetryAttempts': 0}}
endpoint_config_name: Qwen-Qwen2-5-0-5B-Instruct-241128-1505


In [None]:
endpoint_name = sagemaker.utils.name_from_base(model_name, short=True)

create_endpoint_response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_model_name
)
print(create_endpoint_response)
print("endpoint_config_name:", endpoint_config_name)
while 1:
    status = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)["EndpointStatus"]
    if status != "Creating":
        break
    print(datetime.now().strftime('%Y%m%d-%H:%M:%S') + " status: " + status)
    time.sleep(60)
print("Endpoint created:", endpoint_config_name)

{'EndpointArn': 'arn:aws:sagemaker:us-west-2:236995464743:endpoint/Qwen-Qwen2-5-0-5B-Instruct-241128-1505', 'ResponseMetadata': {'RequestId': '743cd93d-285f-4c8f-85d5-9f1828a92457', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '743cd93d-285f-4c8f-85d5-9f1828a92457', 'content-type': 'application/x-amz-json-1.1', 'content-length': '106', 'date': 'Thu, 28 Nov 2024 15:05:40 GMT'}, 'RetryAttempts': 0}}
endpoint_config_name: Qwen-Qwen2-5-0-5B-Instruct-241128-1505
20241128-15:05:40 status: Creating
20241128-15:06:40 status: Creating
20241128-15:07:40 status: Creating
20241128-15:10:41 status: Creating
Endpoint created: Qwen-Qwen2-5-0-5B-Instruct-241128-1505


## 4. Test

you can invoke your model with SageMaker SDK

In [None]:
messages = [{
        "role": "user",
        "content": "Write a quick sort in python"
}]

### 4.1 Message api non-stream mode

In [None]:
sagemaker_runtime = boto3.client('runtime.sagemaker')

payload = {
    "messages": messages,
    "max_tokens": 1024,
    "stream": False
}
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["message"]["content"])

Certainly! Here is a quicksort algorithm implemented in Python:
```python
def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)

# Example usage:
arr = [3, 6, 8, 10, 1, 2, 1]
sorted_arr = quicksort(arr)
print(sorted_arr)
```
This function will return a sorted list of the input list. It works by choosing a pivot element from the array and partitioning the remaining elements into three lists: elements less than the pivot, elements equal to the pivot, and elements greater than the pivot. It then recursively applies this process to the left and right partitions.
I hope this helps! Let me know if you have any questions.


### 4.2 Message api stream mode

In [None]:
payload = {
    "messages": messages,
    "max_tokens": 1024,
    "stream": True
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.span()[1]
            print(data["choices"][0]["delta"]["content"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]

Sure! Here is a Python implementation of the Quick Sort algorithm using recursion. The implementation is called "QuickSort" and it sorts a list of integers in ascending order.

```python
def quicksort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[len(arr) // 2]
        left = [x for x in arr if x < pivot]
        middle = [x for x in arr if x == pivot]
        right = [x for x in arr if x > pivot]
        return quicksort(left) + middle + quicksort(right)

# Example usage
arr = [3, 6, 8, 10, 1, 2, 1]
sorted_arr = quicksort(arr)
print("Sorted array:", sorted_arr)
```

### Explanation:
1. **Base Case**: If the array has 1 or 0 elements, it is already sorted, so we return it.
2. **Pivot Selection**: We choose the middle element of the array as the pivot.
3. **Partitioning**: We create three lists: `left` contains elements less than the pivot, `middle` contains elements equal to the pivot, and `right` contains elements greater than the pivot.
4. **Recursive C

### 4.3 Completion api non-stream mode

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

payload = {
    "prompt": prompt,
    "max_tokens": 1024,
    "stream": False
}

response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["text"])

To write a quick sort in Python, you can follow these steps:

1. Define the `quick_sort` function with two parameters: `arr` (the array to be sorted) and `low` (the index of the left array boundary).
2. Compare the middle element of the left and right arrays.
3. If the middle element is less than or equal to the rightmost element, swap them.
4. Otherwise, keep the left element as is and recurse on the remaining array.

The `quick_sort` function will recursively divide the array into smaller subarrays until the base case (`n == 1`) is reached, where `n` is the length of the array.

Here's an example implementation:

```python
def quick_sort(arr, low, high):
    if low < high:
        pi = partition(arr, low, high)
        quick_sort(arr, low, pi - 1)
        quick_sort(arr, pi + 1, high)

def partition(arr, low, high):
    pivot = arr[high]
    i = low - 1
    for j in range(low, high):
        if arr[j] < pivot:
            i += 1
            arr[i], arr[j] = arr[j], arr[i]
    arr[i +

### 4.4 Completion api stream mode

In [None]:
payload = {
    "prompt": prompt,
    "max_tokens": 1024,
    "stream": True
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.end()
            # print(data)
            print(data["choices"][0]["text"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]

Certainly! Below is a Python implementation of the quick sort algorithm. Quick sort is a divide-and-conquer algorithm that arranges elements in ascending order.

```python
def partition(arr, low, high):
    i = low - 1
    pivot = arr[high]
    
    for j in range(low, high):
        if arr[j] < pivot:
            i += 1
            arr[i], arr[j] = arr[j], arr[i]
    
    arr[i+1], arr[high] = arr[high], arr[i+1]
    return i + 1

def quick_sort(arr, low, high):
    if low < high:
        pi = partition(arr, low, high)
        quick_sort(arr, low, pi - 1)
        quick_sort(arr, pi + 1, high)

# Example usage
arr = [10, 7, 8, 9, 1, 5]
n = len(arr)
quick_sort(arr, 0, n - 1)
print("Sorted array is:", arr)
```

### Explanation:

1. **Partitioning**: The `partition` function selects the pivot element and rearranges the elements such that all elements less than the pivot are on its left side and all elements greater than the pivot are on its right side. The element surrounded by `i` is mov