# SageMaker Example

## 1. Create your container repository

open aws console and create a repository for your container: https://us-west-2.console.aws.amazon.com/ecr/create-repository?region=us-west-2

for example `236995464743.dkr.ecr.us-west-2.amazonaws.com/sagemaker_endpoint/vllm`

In [1]:
# login
!aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 236995464743.dkr.ecr.us-west-2.amazonaws.com

VLLM_VERSION = "v0.5.5"
REPO_NAME = "sagemaker_endpoint/vllm"
CONTAINER = f"236995464743.dkr.ecr.us-west-2.amazonaws.com/{REPO_NAME}:{VLLM_VERSION}"


https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


## 2. Build the container

demo codes are in `app/`
build and push the docker with following commands:

In [2]:
!docker build --build-arg VLLM_VERSION={VLLM_VERSION} -t {REPO_NAME}:{VLLM_VERSION} .
!docker tag {REPO_NAME}:{VLLM_VERSION} {CONTAINER}
!docker push {CONTAINER}

Sending build context to Docker daemon  76.29kB
Step 1/9 : ARG VLLM_VERSION
Step 2/9 : FROM vllm/vllm-openai:$VLLM_VERSION
 ---> d55dc98813f3
Step 3/9 : WORKDIR /app
 ---> Using cache
 ---> a35af76520a9
Step 4/9 : RUN sed -i '/if __name__ == "__main__":/i@router.get("/ping")\nasync def ping() -> Response:\n    return await health()\n\nfrom typing import Union\n@router.post("/invocations")\nasync def invocations(request: Union[ChatCompletionRequest, CompletionRequest],\n                                 raw_request: Request):\n    if isinstance(request, ChatCompletionRequest):\n        return await create_chat_completion(request, raw_request)\n    elif isinstance(request, CompletionRequest):\n        return await create_completion(request, raw_request)\n    else:\n        return JSONResponse("unknow request paras",\n                            status_code=HTTPStatus.BAD_REQUEST)\n' /usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py;
 ---> Using cache
 ---> 0f3a

## 3. Deploy on SageMaker

define the model and deploy on SageMaker


### 3.1 Init SageMaker session

In [3]:
!pip install boto3 sagemaker transformers
import re
import json

import boto3
import sagemaker
from sagemaker import Model

sess = sagemaker.Session()
role = sagemaker.get_execution_role()

Collecting attrs<24,>=23.1.0 (from sagemaker)
  Using cached attrs-23.2.0-py3-none-any.whl.metadata (9.5 kB)
Collecting boto3
  Downloading boto3-1.35.9-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore<1.36.0,>=1.35.9 (from boto3)
  Downloading botocore-1.35.9-py3-none-any.whl.metadata (5.7 kB)
Collecting s3transfer<0.11.0,>=0.10.0 (from boto3)
  Using cached s3transfer-0.10.2-py3-none-any.whl.metadata (1.7 kB)
Downloading boto3-1.35.9-py3-none-any.whl (139 kB)
Using cached attrs-23.2.0-py3-none-any.whl (60 kB)
Downloading botocore-1.35.9-py3-none-any.whl (12.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.5/12.5 MB[0m [31m136.0 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached s3transfer-0.10.2-py3-none-any.whl (82 kB)
Installing collected packages: attrs, botocore, s3transfer, boto3
  Attempting uninstall: attrs
    Found existing installation: attrs 24.2.0
    Uninstalling attrs-24.2.0:
      Successfully uninstalled attrs-24.2.0
  Attempting uninstall:

### 3.2 Prepare model file

#### Option 1: deploy vllm by scripts

In [4]:
!echo the entrypoint of the endpoint is "start.sh"
!echo ====================================================
!cat vllm_by_scripts/start.sh
!echo ====================================================

!rm vllm_by_scripts.tar.gz
!tar czvf vllm_by_scripts.tar.gz vllm_by_scripts/


s3_code_prefix = f"sagemaker_endpoint/vllm/"
bucket = sess.default_bucket() 
code_artifact = sess.upload_data("vllm_by_scripts.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

the entrypoint of the endpoint is start.sh
#!/bin/bash

# port needs to be 8080

python3 -m vllm.entrypoints.openai.api_server \
    --port 8080 \
    --trust-remote-code \
    --model deepseek-ai/deepseek-coder-1.3b-instruct
vllm_by_scripts/
vllm_by_scripts/start.sh
vllm_by_scripts/.ipynb_checkpoints/
vllm_by_scripts/.ipynb_checkpoints/start-checkpoint.sh
S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-236995464743/sagemaker_endpoint/vllm//vllm_by_scripts.tar.gz


#### Option 2: deploy vllm by model_id

In [5]:
!echo write the model_id to file "model_id"
!echo ====================================================
!cat vllm_by_model_id/model_id
!echo ====================================================
!echo 
!echo write envs to file ".env"
!echo ====================================================
!cat vllm_by_model_id/.env
!echo ====================================================

!rm vllm_by_model_id.tar.gz
!tar czvf vllm_by_model_id.tar.gz vllm_by_model_id/


s3_code_prefix = f"sagemaker_endpoint/vllm/"
bucket = sess.default_bucket() 
code_artifact = sess.upload_data("vllm_by_model_id.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

write the model_id to file model_id
deepseek-ai/deepseek-coder-1.3b-instruct

write envs to file .env
# Environment Variables: https://docs.vllm.ai/en/latest/serving/env_vars.html
export HF_TOKEN="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"
vllm_by_model_id/
vllm_by_model_id/.env
vllm_by_model_id/.env.swp
vllm_by_model_id/.ipynb_checkpoints/
vllm_by_model_id/.ipynb_checkpoints/model_id-checkpoint
vllm_by_model_id/model_id
S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-236995464743/sagemaker_endpoint/vllm//vllm_by_model_id.tar.gz


### 3.3 Deploy model

In [6]:
model = Model(
    name="sagemaker-vllm",
    model_data=code_artifact,
    image_uri=CONTAINER,
    role=role,
)

# 部署模型到endpoint
endpoint_name = sagemaker.utils.name_from_base("sagemaker-vllm")
print(f"endpoint_name: {endpoint_name}")
predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.g5.2xlarge',
    endpoint_name=endpoint_name,
)

endpoint_name: sagemaker-vllm-2024-08-30-09-08-19-576


Using already existing model: sagemaker-vllm


------------!

## 4. Test

you can invoke your model with SageMaker SDK

### 4.1 Message api non-stream mode

In [7]:
runtime = boto3.client('runtime.sagemaker')

payload = {
    "model": "deepseek-ai/deepseek-coder-1.3b-instruct",
    "messages": [
    {
        "role": "user",
        "content": "Write a quick sort in python"
    }
    ],
    "max_tokens": 1024,
    "stream": False
}
response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["message"]["content"])

Sure, here is a basic implementation of Quick Sort in Python:

```python
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quick_sort(left) + middle + quick_sort(right)
```

This function works by selecting a 'pivot' element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The sub-arrays are then recursively sorted.

Here's how you can use it:

```python
print(quick_sort([3,6,8,10,1,2,1]))
# Output: [1, 1, 2, 3, 6, 8, 10]
```



### 4.2 Message api stream mode

In [8]:
payload = {
    "model": "deepseek-ai/deepseek-coder-1.3b-instruct",
    "messages": [
    {
        "role": "user",
        "content": "Write a quick sort in python"
    }
    ],
    "max_tokens": 1024,
    "stream": True
}

response = runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.span()[1]
            print(data["choices"][0]["delta"]["content"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]

Sure, here is a simple implementation of a quicksort algorithm in Python:

```python
def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)

print(quicksort([3,6,8,10,1,2,1]))
```

In this code, the quicksort function takes an array as input. If the input array is of length 0 or 1, it is already sorted, so it returns the input array.

Otherwise, it chooses a pivot element from the array. It then creates three lists: one for elements less than the pivot, one for elements equal to the pivot, and one for elements greater than the pivot. It then recursively sorts the elements less than and greater than the pivot and concatenates the results and the elements equal to the pivot.

The final sorted array is returned by the function. The last two lines of the code call the quicks

### 4.3 Completion api non-stream mode

In [9]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True)
messages=[
    { 'role': 'user', 'content': "write a quick sort algorithm in python."}
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

payload = {
    "model": "deepseek-ai/deepseek-coder-1.3b-instruct",
    "prompt": prompt,
    "max_tokens": 1024,
    "stream": False
}

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["text"])

Sure, I can provide a basic quick sort algorithm in Python. Here's a plain version:

```python
def quickSort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        less_than_pivot = [x for x in arr[1:] if x <= pivot]
        greater_than_pivot = [x for x in arr[1:] if x > pivot]
        return quickSort(less_than_pivot) + [pivot] + quickSort(greater_than_pivot)

# Example usage:
arr = [3,6,8,10,1,2,1]
print(quickSort(arr))
# Output: [1, 1, 2, 3, 6, 8, 10]
```

This algorithm works by selecting a 'pivot' element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The sub-arrays are then recursively sorted.



### 4.4 Completion api stream mode

In [10]:
payload = {
    "model": "deepseek-ai/deepseek-coder-1.3b-instruct",
    "prompt": prompt,
    "max_tokens": 1024,
    "stream": True
}

response = runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.end()
            # print(data)
            print(data["choices"][0]["text"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]


Sure, here is a simple implementation of the Quick Sort algorithm in Python. The algorithm is a bit modified to handle the last step where you need to compare the pivot with the rightmost element (Hopefully, to make it more compatible with Python's sorting behavior).

```python
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[-1]
        left = [x for x in arr[:-1] if x < pivot]
        middle = [x for x in arr if x == pivot]
        right = [x for x in arr[0:-1] if x > pivot]
        return quick_sort(left) + middle + quick_sort(right)

# Testing the function
print(quick_sort([3,6,8,10,1,2,1]))
# Output: [1, 1, 2, 3, 6, 8, 10]
```

This function works by selecting a pivot element from the array and partitioning the other elements into two sub-arrays, according to whether they are less than or greater than the pivot. The sub-arrays are then recursively sorted.
