# vLLM Qwen1.5-72B-Chat-AWQ vLLM deployment guide
In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.

Please make sure the following permission granted before running the notebook:

- S3 bucket push access
- SageMaker access

## Step 1: Let's bump up SageMaker and import stuff

In [1]:
%pip install sagemaker --upgrade  --quiet
%pip install boto3==1.34.101

Note: you may need to restart the kernel to use updated packages.
Collecting boto3==1.34.101
  Downloading boto3-1.34.101-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore<1.35.0,>=1.34.101 (from boto3==1.34.101)
  Downloading botocore-1.34.111-py3-none-any.whl.metadata (5.7 kB)
Downloading boto3-1.34.101-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading botocore-1.34.111-py3-none-any.whl (12.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m74.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hInstalling collected packages: botocore, boto3
  Attempting uninstall: botocore
    Found existing installation: botocore 1.34.51
    Uninstalling botocore-1.34.51:
      Successfully uninstalled botocore-1.34.51
  Attempting uninstall: boto3
    Found existing installation: boto3 1.34.51
    Uninstalling boto3-1.34.51:
      Success

In [12]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

## Step 2: Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

In [13]:
%%writefile serving.properties
engine=Python
option.model_id=Qwen/Qwen1.5-72B-Chat-AWQ
option.task=text-generation
option.trust_remote_code=true
option.tensor_parallel_degree=4
option.rolling_batch=vllm
option.quantize=awq
option.dtype=fp16
option.max_rolling_batch_size=64
option.max_model_len=10272

Writing serving.properties


In [14]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

mymodel/
mymodel/serving.properties


## Step 3: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### Getting the container image URI

[Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)


In [15]:
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.27.0"
    )

### Upload artifact on S3 and create SageMaker model

In [17]:
s3_code_prefix = "large-model-lmi/code-qwen1.5-72B"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-434444145045/large-model-lmi/code-qwen1.5-72B/mymodel.tar.gz


### 4.2 Create SageMaker endpoint

You need to specify the instance to use and endpoint names

In [18]:
instance_type = "ml.g5.12xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model-qwen1-5-72B")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             # container_startup_health_check_timeout=3600
            )



--------------!

In [19]:
# endpoint_name = 'lmi-model-qwen1-5-72B-2024-05-23-09-10-23-101'
print(endpoint_name)

lmi-model-qwen1-5-72B-2024-05-23-13-32-16-606


In [20]:
# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

## Step 5: Test and benchmark the inference

In [12]:
!pip install transformers -U

Collecting transformers
  Downloading transformers-4.41.1-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.8/43.8 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.23.0 (from transformers)
  Downloading huggingface_hub-0.23.1-py3-none-any.whl.metadata (12 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.41.1-py3-none-any.whl (9.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m74.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading huggingface_hub-0.23.1-py3-none-any.whl (401 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.3/401.3 kB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   

In [21]:
from transformers import AutoTokenizer

MODEL_DIR = "Qwen/Qwen1.5-72B-Chat-AWQ"
# model = AutoModelForCausalLM.from_pretrained(MODEL_DIR, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, use_fast=False)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [22]:
prompt = "世界上第二高峰是哪座？"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
parameters = {
        "max_new_tokens":1024, 
        "do_sample": True,
        "stop_token_ids":[151645,151643],
        "repetition_penalty": 1.05,
        "temperature": 0.7,
        "top_p": 0.8,
        "top_k": 20
    }
response = predictor.predict(
    {"inputs": inputs, "parameters": parameters}
)
# text = str(response, "utf-8")
print(response)

{'generated_text': '世界第二高峰是乔戈里峰，又称K2峰，海拔8,611米，位于巴基斯坦和中国交界的喀喇昆仑山脉。'}


# Streaming

In [23]:
import json
import boto3

smr_client = boto3.client("sagemaker-runtime")

In [24]:
import io
import json

class TokenIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            
            # print(line)
            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                full_line = line[:-1].decode("utf-8")
                # print(full_line)
                line_data = json.loads(full_line.lstrip("data:").rstrip("/n"))
                return line_data["token"].get("text", "")
            chunk = next(self.byte_iterator)
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])
        
def get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload):
    response_stream = sagemaker_runtime.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(payload),
        ContentType="application/json",
        CustomAttributes='accept_eula=false'
    )
    return response_stream

In [25]:
prompt = "世界上第二高峰是哪座？"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

parameters = {
        "max_new_tokens":1024, 
        "do_sample": True,
        "stop_token_ids":[151645,151643],
        "repetition_penalty": 1.05,
        "temperature": 0.7,
        "top_p": 0.8,
        "top_k": 20,
    }

payload = {
    "inputs":  inputs,
    "parameters": parameters,
    "stream": True ## <-- to have response stream.
}
response_stream = get_realtime_response_stream(smr_client, endpoint_name, payload)
# print_response_stream(response_stream)
for token in TokenIterator(response_stream["Body"]):
    # pass
    print(token, end="")

世界第二高峰是乔戈里峰，又称K2峰，海拔8,611米，位于巴基斯坦和中国交界的喀喇昆仑山脉。

## Performance test

In [26]:
%pip install langchain

Note: you may need to restart the kernel to use updated packages.


In [27]:
from langchain_core.runnables import RunnableLambda

In [28]:
text1 = \
"""你是一名小说家，热衷于创意写作和编写故事。 
请帮我编写一个故事，对象是10-12岁的小学生
故事背景：
讲述一位名叫莉拉的年轻女子发现自己有控制天气的能力。她住在一个小镇上，每个人都互相认识。
其他要求：
-避免暴力，色情，粗俗的语言
-长度要求不少于500字
请开始：
"""

In [55]:
def invoke_sagemaker(prompt):
    messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    parameters = {
            "max_new_tokens":100, 
            "do_sample": True,
            "stop_token_ids":[151645,151643],
            "repetition_penalty": 1.05,
            "temperature": 0.7,
            "top_p": 0.8,
            "top_k": 250
        }
    response = predictor.predict(
        {"inputs": inputs, "parameters": parameters}
    )
    print(response)
    return response['generated_text']
    

In [56]:
chain = RunnableLambda(invoke_sagemaker)

In [57]:
import time

In [60]:
time1 = time.time()
await chain.abatch([text1]*1)
print(f'time cost:{time.time()-time1}')

{'generated_text': '在一个宁静的小镇上，住着一个活泼开朗的小女孩，名叫莉拉。莉拉是个十岁的小学生，她的眼睛像两颗闪烁的星星，总是充满了好奇和探索的光芒。这个小镇被绿色的田野和蔚蓝的天空环绕，生活平静而和谐。\n\n有一天，莉拉在放学回家的路上，突然天空乌云密布，电闪雷鸣。她看到一只小鸟在风雨中挣扎，试图找到避雨的地方。莉拉心中'}
time cost:3.8615078926086426


In [61]:
time1 = time.time()
await chain.abatch([text1]*5)
print(f'time cost:{time.time()-time1}')

{'generated_text': '在一个宁静的小村庄里，住着一个活泼开朗的小女孩，名叫莉拉。莉拉有一头金色的卷发，眼睛像晴朗的天空一样湛蓝。她的笑容总是能照亮整个小镇，而她的秘密，就像天空中的云朵，神秘又迷人。\n\n有一天，莉拉在森林里玩耍，无意中发现了自己的特殊能力——她可以控制天气。当她心情愉快时，天空就会放晴，阳光洒满大地；当她感到伤心'}
{'generated_text': '在阳光明媚的艾尔小镇上，住着一个活泼开朗的小女孩，名叫莉拉。莉拉是个普通的小学生，喜欢画画，热爱大自然，尤其是那些变幻莫测的云朵。然而，她的生活在一个普通的午后发生了改变。\n\n那天，莉拉在后院的苹果树下看书，突然一片乌云遮住了太阳，让她感到惊讶。她抬头看去，只见一朵巨大的乌云正快速向小镇飘来。她疑惑地伸出手'}
{'generated_text': '在一个名叫晴空镇的小地方，住着一个活泼可爱的10岁女孩，名叫莉拉。莉拉拥有一头亮丽的金色卷发和一双闪烁着好奇光芒的湛蓝眼睛。这个小镇以其四季如春的气候闻名，人们总是笑容满面，和谐共处。\n\n一天，莉拉在森林里的秘密基地玩耍，突然天空乌云密布，电闪雷鸣。她惊讶地发现，只要她心里想着晴天，乌云'}
{'generated_text': '在一个名叫晴空镇的小镇上，住着一个活泼可爱的小女孩，名叫莉拉。莉拉有着明亮的蓝眼睛和一头乱蓬蓬的金发，她的笑容总是像阳光一样温暖。这个小镇以四季如春的气候闻名，人们在这里过着宁静而和谐的生活。\n\n一天，莉拉在后院玩耍时，无意中发现了一个神奇的秘密。当她全心全意地想象着雨滴落在手中的情景，天空突然乌云'}
{'generated_text': '故事标题：莉拉与天空的秘密\n\n在宁静的艾尔文小镇上，住着一个活泼可爱的小女孩，名叫莉拉。她有一头金色的卷发，眼睛像湖水一样湛蓝，总是闪烁着好奇的光芒。莉拉是个特别的孩子，她对天空有着异乎寻常的热爱，每天放学后，她都会坐在后院的大橡树下，仰望那片无尽的蓝色。\n\n有一天，当莉拉如常地'}
time cost:6.017059326171875


In [63]:
100/6

16.666666666666668

## Clean up the environment

In [11]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
# model.delete_model()

ClientError: An error occurred (ValidationException) when calling the DeleteEndpoint operation: Could not find endpoint "lmi-model-qwen1-5-72B-2024-05-23-09-10-23-101".