# vLLM Qwen1.5-72B-Chat-AWQ vLLM deployment guide
In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.

Please make sure the following permission granted before running the notebook:

- S3 bucket push access
- SageMaker access

## Step 1: Let's bump up SageMaker and import stuff

In [20]:
%pip install sagemaker --upgrade  --quiet
%pip install boto3==1.34.101

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [21]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

## Step 2: Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

In [22]:
%%writefile serving.properties
engine=Python
option.model_id=Qwen/Qwen1.5-110B-Chat-AWQ
option.task=text-generation
option.trust_remote_code=true
option.tensor_parallel_degree=8
option.rolling_batch=vllm
option.quantize=awq
option.dtype=fp16
option.max_model_len=10272

Writing serving.properties


In [23]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

mymodel/
mymodel/serving.properties


## Step 3: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### Getting the container image URI

[Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)


In [24]:
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.27.0"
    )

### Upload artifact on S3 and create SageMaker model

In [25]:
s3_code_prefix = "large-model-lmi/code-qwen1.5-110B-8GPU"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-475089398927/large-model-lmi/code-qwen1.5-110B-8GPU/mymodel.tar.gz


### 4.2 Create SageMaker endpoint

You need to specify the instance to use and endpoint names

In [None]:
instance_type = "ml.g5.48xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model-qwen1-5-110B-8GPU")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             # container_startup_health_check_timeout=3600
            )

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

--------------------!

## Step 5: Test and benchmark the inference

In [27]:
!pip install transformers



In [28]:
from transformers import AutoTokenizer

MODEL_DIR = "Qwen/Qwen1.5-110B-Chat-AWQ"
# model = AutoModelForCausalLM.from_pretrained(MODEL_DIR, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, use_fast=False)

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [29]:
prompt = "列出世界前十大高峰？"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
parameters = {
        "max_new_tokens":1024, 
        "do_sample": True,
        "stop_token_ids":[151645,151643],
        "repetition_penalty": 1.05,
        "temperature": 0.7,
        "top_p": 0.8,
        "top_k": 20
    }
response = predictor.predict(
    {"inputs": inputs, "parameters": parameters}
)
# text = str(response, "utf-8")
print(response)

{'generated_text': '世界前十大高峰依次是：珠穆朗玛峰、喜马拉雅山、洛子峰、马卡鲁峰、乔戈里峰、卓奥友峰、道拉吉利峰、马纳斯卢峰、南伽帕尔巴特峰和安纳普尔那峰。这些山峰都位于亚洲的喜马拉雅山脉和喀喇昆仑山脉，其中珠穆朗玛峰是世界最高峰，位于中国与尼泊尔边境上。'}


# Streaming

In [30]:
import json
import boto3

smr_client = boto3.client("sagemaker-runtime")

In [31]:
import io
import json

class TokenIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            
            # print(line)
            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                full_line = line[:-1].decode("utf-8")
                # print(full_line)
                line_data = json.loads(full_line.lstrip("data:").rstrip("/n"))
                return line_data["token"].get("text", "")
            chunk = next(self.byte_iterator)
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])
        
def get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload):
    response_stream = sagemaker_runtime.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(payload),
        ContentType="application/json",
        CustomAttributes='accept_eula=false'
    )
    return response_stream

In [36]:
# 设定医生和信息
doctorName = "谭鹏程"
filePath = './信达HCPDCR-20240122-021-2.txt'
with open(filePath, 'r', encoding='utf-8') as file:
    # 读取文件内容
    content = file.read()

# 尝试将内容解码为UTF-8编码的字符串
pageContent = content.encode('utf-8').decode('utf-8')

prompt = f"你是一个经验丰富的医疗方面文章的信息抽取专家，你可以从一篇网页中抽取出{doctorName}医生的姓名，性别，医院，科室，职称，学术头衔，行政职务\n"
prompt += "\n输出的格式为: \n{'姓名': 'XX', '医院': 'xx', '性别': 'XX', '科室': 'XX', '职称': 'XX', '学术头衔': 'XX', '行政职务': 'XX'} "
prompt += f"\n如果在网页信息中没有发现{doctorName}医生的任何信息, "
prompt += "输出的内容为: {'姓名': 查无此人, '医院': '查无此人', '性别': '查无此人', '科室': '查无此人', '职称': '查无此人', '学术头衔': '查无此人', '行政职务': '查无此人'}"
prompt += f"\n如果在网页信息中只包括{doctorName}医生的部分信息，在输出的对应字段后面输出\"未知\""
prompt += "\n不要输出任何其他内容，不要输出推理过程"
prompt += f"\n请提取{doctorName}医生的相关信息，包含{doctorName}医生的网页信息如下: \n"
prompt += pageContent


print (prompt)

你是一个经验丰富的医疗方面文章的信息抽取专家，你可以从一篇网页中抽取出谭鹏程医生的姓名，性别，医院，科室，职称，学术头衔，行政职务

输出的格式为: 
{'姓名': 'XX', '医院': 'xx', '性别': 'XX', '科室': 'XX', '职称': 'XX', '学术头衔': 'XX', '行政职务': 'XX'} 
如果在网页信息中没有发现谭鹏程医生的任何信息, 输出的内容为: {'姓名': 查无此人, '医院': '查无此人', '性别': '查无此人', '科室': '查无此人', '职称': '查无此人', '学术头衔': '查无此人', '行政职务': '查无此人'}
如果在网页信息中只包括谭鹏程医生的部分信息，在输出的对应字段后面输出"未知"
不要输出任何其他内容，不要输出推理过程
请提取谭鹏程医生的相关信息，包含谭鹏程医生的网页信息如下: 
内科系统-烟台市中医医院官方网站ENGLISH中文版专家介绍◌创伤显微外科、手足外科◌中医药预防保健中心◌莱山区扶正堂门诊部◌开发区扶正堂门诊部难治性肠癌、难治耐药性肿瘤、难治性肝胆胰恶性肿瘤的微创介入治疗及各实体肿瘤中西医结合综合治疗擅长中西医治疗各系统恶性肿瘤，如结直肠癌、胃癌、肺癌、乳腺癌、肝胆肿瘤、胰腺癌、前列腺癌、卵巢癌、子宫内膜癌、宫颈癌、小细胞癌、肉瘤等。尤擅长将特色中医和现代医学有机结合，中西医结合治疗肿瘤的不同阶段，针对肿瘤的癌前病变、术后促恢复、防复发转移、联合西医减毒增效、晚期肿瘤患者长期带瘤生存、提高生活质量、延长生存期等方面疗效显著。擅长肺癌、胃癌、结直肠癌、肝癌、乳腺癌、卵巢癌、胰腺癌等各系统恶性肿瘤的中西医结合诊疗及肿瘤的精准穿刺活检与介入治疗。临证诊疗中，秉承中西医结合优势互补原则，通过中医药与其他治癌手段综合有序的配合，以达到中西医结合防治肿瘤复发转移，增加放化疗、靶向治疗、免疫治疗等的疗效、减轻其毒性，延长生存期，提高生活质量的目的。擅于中西医结合治疗消化系统肿瘤、泌尿系统肿瘤、乳腺肿瘤、呼吸道肿瘤。善于中医药治疗临床常见病、多发病。在中医药配合手术、放化疗、靶向治疗等减毒增效、预防复发转移及中医药治疗晚期恶性肿瘤方面具有丰富的经验。擅长胃癌、肺癌、食管癌、肝癌、肠癌、乳腺癌、妇科肿瘤等的中西医结合治疗，尤其擅长以中医中药治疗化疗药物、靶向药物、免疫药物等抗肿瘤药物产

In [37]:
# messages = [
#     {"role": "system", "content": "You are a helpful assistant."},
#     {"role": "user", "content": prompt}
# ]
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

parameters = {
        "max_new_tokens":1024, 
        "do_sample": True,
        "stop_token_ids":[151645,151643],
        "repetition_penalty": 1.05,
        "temperature": 0.7,
        "top_p": 0.8,
        "top_k": 20,
    }

payload = {
    "inputs":  inputs,
    "parameters": parameters,
    "stream": True ## <-- to have response stream.
}
response_stream = get_realtime_response_stream(smr_client, endpoint_name, payload)
# print_response_stream(response_stream)
for token in TokenIterator(response_stream["Body"]):
    # pass
    print(token, end="")

{'姓名': '谭鹏程', '医院': '烟台市中医医院', '性别': '未知', '科室': '未知', '职称': '未知', '学术头衔': '未知', '行政职务': '未知'}

In [38]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



## Clean up the environment

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()