# SageMaker Example

## 1. Create your container repository

open aws console and create a repository for your container: https://us-west-2.console.aws.amazon.com/ecr/create-repository?region=us-west-2

for example `236995464743.dkr.ecr.us-west-2.amazonaws.com/sagemaker_endpoint/vllm`

In [1]:
# login
!aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 236995464743.dkr.ecr.us-west-2.amazonaws.com

VLLM_VERSION = "v0.5.4"
REPO_NAME = "sagemaker_endpoint/vllm"
CONTAINER = f"236995464743.dkr.ecr.us-west-2.amazonaws.com/{REPO_NAME}:{VLLM_VERSION}"


https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


## 2. Build the container

demo codes are in `app/`
build and push the docker with following commands:

In [2]:
!docker build --build-arg VLLM_VERSION={VLLM_VERSION} -t {REPO_NAME}:{VLLM_VERSION} .
!docker tag {REPO_NAME}:{VLLM_VERSION} {CONTAINER}
!docker push {CONTAINER}

Sending build context to Docker daemon   51.2kB
Step 1/9 : ARG VLLM_VERSION
Step 2/9 : FROM vllm/vllm-openai:$VLLM_VERSION
 ---> 28a486908227
Step 3/9 : WORKDIR /app
 ---> Using cache
 ---> 26cd4f6c1062
Step 4/9 : RUN sed -i 's|/v1/chat/completions|/invocations|g' /usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py;     sed -i 's|/health|/ping|g' /usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py;
 ---> Using cache
 ---> a0213e02b6ca
Step 5/9 : COPY app/ /app
 ---> Using cache
 ---> 3bc86774b549
Step 6/9 : EXPOSE 8080
 ---> Using cache
 ---> b46a7dab5ab2
Step 7/9 : ENV PATH="/app:${PATH}"
 ---> Using cache
 ---> 179b301761c6
Step 8/9 : ENTRYPOINT []
 ---> Running in 5dbba2abcdb1
Removing intermediate container 5dbba2abcdb1
 ---> e31beea64c81
Step 9/9 : CMD ["serve"]
 ---> Running in c706e5e7532c
Removing intermediate container c706e5e7532c
 ---> 6cf4c83dc5c9
Successfully built 6cf4c83dc5c9
Successfully tagged sagemaker_endpoint/vllm:

## 3. Deploy on SageMaker

define the model and deploy on SageMaker


### 3.1 Init SageMaker session

In [3]:
import re
import json

import boto3
import sagemaker
from sagemaker import Model

sess = sagemaker.Session()
role = sagemaker.get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


### 3.2 Prepare model file

#### Option 1: deploy vllm by scripts

In [4]:
!echo the entrypoint of the endpoint is "start.sh"
!echo ====================================================
!cat vllm_by_scripts/start.sh
!echo ====================================================

!rm vllm_by_scripts.tar.gz
!tar czvf vllm_by_scripts.tar.gz vllm_by_scripts/


s3_code_prefix = f"sagemaker_endpoint/vllm/"
bucket = sess.default_bucket() 
code_artifact = sess.upload_data("vllm_by_scripts.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

the entrypoint of the endpoint is start.sh
#!/bin/bash

# port needs to be 8080

python3 -m vllm.entrypoints.openai.api_server \
    --port 8080 \
    --trust-remote-code \
    --model deepseek-ai/deepseek-coder-1.3b-instruct
vllm_by_scripts/
vllm_by_scripts/start.sh
S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-236995464743/sagemaker_endpoint/vllm//vllm_by_scripts.tar.gz


#### Option 2: deploy vllm by model_id

In [5]:
!echo write the model_id to file "model_id"
!echo ====================================================
!cat vllm_by_model_id/model_id
!echo ====================================================
!echo 
!echo write envs to file ".env"
!echo ====================================================
!cat vllm_by_model_id/.env
!echo ====================================================

!rm vllm_by_model_id.tar.gz
!tar czvf vllm_by_model_id.tar.gz vllm_by_model_id/


s3_code_prefix = f"sagemaker_endpoint/vllm/"
bucket = sess.default_bucket() 
code_artifact = sess.upload_data("vllm_by_model_id.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

write the model_id to file model_id
deepseek-ai/deepseek-coder-1.3b-instruct

write envs to file .env
# Environment Variables: https://docs.vllm.ai/en/latest/serving/env_vars.html
export HF_TOKEN="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"
vllm_by_model_id/
vllm_by_model_id/.env
vllm_by_model_id/model_id
S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-236995464743/sagemaker_endpoint/vllm//vllm_by_model_id.tar.gz


### 3.3 Deploy model

In [6]:
model = Model(
    name="sagemaker-vllm",
    model_data=code_artifact,
    image_uri=CONTAINER,
    role=role,
)

# 部署模型到endpoint
endpoint_name = sagemaker.utils.name_from_base("sagemaker-vllm")
print(f"endpoint_name: {endpoint_name}")
predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.g5.2xlarge',
    endpoint_name=endpoint_name,
)

endpoint_name: sagemaker-vllm-2024-08-29-10-10-27-438


Using already existing model: sagemaker-vllm


-----------!

## 4. Test

you can invoke your model with SageMaker SDK

### 4.1 Non-stream mode

In [7]:
runtime = boto3.client('runtime.sagemaker')

payload = {
    "model": "deepseek-ai/deepseek-coder-1.3b-instruct",
    "messages": [
    {
        "role": "user",
        "content": "Write a quick sort in python"
    }
    ],
    "max_tokens": 1024,
    "stream": False
}
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["message"]["content"])

Sure, here is a basic implementation of Quick Sort in Python:

```python
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        less_than_pivot = [x for x in arr[1:] if x <= pivot]
        greater_than_pivot = [x for x in arr[1:] if x > pivot]
        return quick_sort(less_than_pivot) + [pivot] + quick_sort(greater_than_pivot)

# Test the function
print(quick_sort([3,6,8,10,1,2,1]))
# Output: [1, 1, 2, 3, 6, 8, 10]
```

This implementation works by choosing the first element of the list as the pivot, and then dividing the rest of the list into two lists, one with elements less than the pivot and one with elements greater than the pivot. The pivot is then in its final position in the sorted list. The function is then recursively called on the two lists.



### 4.2 Stream mode

In [20]:
payload = {
    "model": "deepseek-ai/deepseek-coder-1.3b-instruct",
    "messages": [
    {
        "role": "user",
        "content": "Write a quick sort in python"
    }
    ],
    "max_tokens": 1024,
    "stream": True
}

response = runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

for t in response['Body']:
    try:
        data = json.loads(re.sub(r"^data: ", "", t["PayloadPart"]["Bytes"].decode())) 
        print(data["choices"][0]["delta"]["content"], end="")
    except Exception:
        pass

Sure, here is a simple implementation of the Quick Sort algorithm in Python:

```python
def quicksort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[len(arr) // 2]
        left = [x for x in arr if x < pivot]
        middle = [x for x in arr if x == pivot]
        right = [x for x in arr if x > pivot]
        return quicksort(left) + middle + quicksort(right)

# Testing the function
print(quicksort([3,6,8,10,1,2,1]))
# Output: [1, 1, 2, 3, 6, 8, 10]
```

In this code, we first check if the length of the array is less than or equal to 1, in which case the array is already sorted, so we return it directly. If the array is longer, we choose a pivot element from the middle of the array. We then create three lists: one for elements less than the pivot, one for elements equal to the pivot, and one for elements greater than the pivot. We then recursively sort the elements less than the pivot and those greater than the pivot, and concatenate the results with the e