### 1. 设置模型的S3路径(由于网络问题，建议提前从Hugging Face下载好并上传到S3)
[S3模型文件参考截图](https://github.com/sundamu/aws-sagemaker-llm/blob/main/chatglm2-6b/s3_model_path_sample.png)

In [18]:
# 可根据需要升级awscli和sagemaker，目前不升级保持默认版本可以部署成功

import sagemaker
import boto3

role = sagemaker.get_execution_role()  
sess = sagemaker.session.Session()  
bucket = "llm-chatglm2-6b-sundamu"  # 修改成自己的S3 bucket

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

In [19]:
# 提前手动在上一步设置的S3 bucket创建下面s3_model_prefix变量所设置的目录，并上传模型文件到该目录
# chatglm2-6b 模型HF地址 https://huggingface.co/THUDM/chatglm2-6b/tree/main

# 设置chatglm2-6b的模型文件目录 
s3_model_prefix = "chatglm2_6b_model"  
# 部署脚本的S3目录，部署脚本在此notebook后面动态生成
s3_code_prefix = "deploy_code"

### 2. 生成模型部署脚本

In [20]:
# 本地临时目录，用于存放部署脚本及模型配置
!mkdir -p llm_chatglm2_deploy_code

In [21]:
%%writefile llm_chatglm2_deploy_code/model.py
from djl_python import Input, Output
import torch
import logging
import math
import os

from transformers import pipeline, AutoModel, AutoTokenizer

model = None
tokenizer = None

def load_model(properties):
    tensor_parallel = properties["tensor_parallel_degree"]
    model_location = properties['model_dir']
    if "model_id" in properties:
        model_location = properties['model_id']
    logging.info(f"Loading model in {model_location}")
    
    tokenizer = AutoTokenizer.from_pretrained(model_location, trust_remote_code=True)
    model = AutoModel.from_pretrained(model_location, trust_remote_code=True).half().cuda()
    
    logging.info(f"Finished Loading model in {model_location}")
    
    return model, tokenizer

def handle(inputs: Input):
    logging.info("Start inference request")
    
    global model, tokenizer
    if not model:
        model, tokenizer = load_model(inputs.get_properties())

    if inputs.is_empty():
        return None
    data = inputs.get_as_json()
    
    input_sentences = data["inputs"]
    params = data["parameters"]
    history = data["history"]
    
    response, history = model.chat(tokenizer, input_sentences, history=history, **params)
    result = {"outputs": response, "history" : history}
    
    logging.info("Finished inference request")
    
    return Output().add_as_json(result)

Writing llm_chatglm2_deploy_code/model.py


In [22]:
print(f"option.s3url ==> s3://{bucket}/{s3_model_prefix}/")

option.s3url ==> s3://llm-chatglm2-6b-sundamu/chatglm2_6b_model/


#### 注意: 下面的option.s3url 需要修改成自己的S3路径, 可以直接拷贝上一个cell的输出

In [23]:
%%writefile llm_chatglm2_deploy_code/serving.properties
engine=Python
option.tensor_parallel_degree=1
option.s3url = s3://llm-chatglm2-6b-sundamu/chatglm2_6b_model/

Writing llm_chatglm2_deploy_code/serving.properties


#### 升级transformers  [Issue344](https://github.com/THUDM/ChatGLM-6B/issues/344)

In [24]:
%%writefile llm_chatglm2_deploy_code/requirements.txt
-i https://pypi.tuna.tsinghua.edu.cn/simple
transformers==4.28.1

Writing llm_chatglm2_deploy_code/requirements.txt


In [25]:
# 本地打包并上传部署脚本到S3
!rm model.tar.gz
!cd llm_chatglm2_deploy_code && rm -rf ".ipynb_checkpoints"
!tar czvf model.tar.gz llm_chatglm2_deploy_code

s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

rm: cannot remove ‘model.tar.gz’: No such file or directory
llm_chatglm2_deploy_code/
llm_chatglm2_deploy_code/requirements.txt
llm_chatglm2_deploy_code/serving.properties
llm_chatglm2_deploy_code/model.py
S3 Code or Model tar ball uploaded to --- > s3://llm-chatglm2-6b-sundamu/deploy_code/model.tar.gz


### 3. 部署模型

In [26]:
# 默认容器镜像，无特殊需求不用改动
inference_image_uri = (
     f"727897471807.dkr.ecr.{region}.amazonaws.com.cn/djl-inference:0.21.0-deepspeed0.8.3-cu117"
 )

In [27]:
# 创建模型
from sagemaker.utils import name_from_base
import boto3

model_name = name_from_base("chatglm2") 
print(model_name)
print(f"Image going to be used is ---- > {inference_image_uri}")

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact
    },
    
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

chatglm2-2023-08-08-03-48-32-618
Image going to be used is ---- > 727897471807.dkr.ecr.cn-north-1.amazonaws.com.cn/djl-inference:0.21.0-deepspeed0.8.3-cu117
Created Model: arn:aws-cn:sagemaker:cn-north-1:086238767671:model/chatglm2-2023-08-08-03-48-32-618


In [28]:
# 创建endpoint config
# chatglm2-6b建议选择ml.g4dn.2xlarge或者ml.p3.2xlarge及以上机型部署(16G显存)
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "poctest",
            "ModelName": model_name,
            "InstanceType": "ml.g4dn.2xlarge",
            "InitialInstanceCount": 1,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws-cn:sagemaker:cn-north-1:086238767671:endpoint-config/chatglm2-2023-08-08-03-48-32-618-config',
 'ResponseMetadata': {'RequestId': '846a9faf-3f05-4bb0-8a3e-4f0946d02490',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '846a9faf-3f05-4bb0-8a3e-4f0946d02490',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '124',
   'date': 'Tue, 08 Aug 2023 03:48:34 GMT'},
  'RetryAttempts': 0}}

In [29]:
# 部署模型
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws-cn:sagemaker:cn-north-1:086238767671:endpoint/chatglm2-2023-08-08-03-48-32-618-endpoint


#### 持续检测模型部署进度(一般不超过10分钟)

In [30]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws-cn:sagemaker:cn-north-1:086238767671:endpoint/chatglm2-2023-08-08-03-48-32-618-endpoint
Status: InService


### 4. 测试模型

In [31]:
import json
import boto3

smr_client = boto3.client("sagemaker-runtime")

parameters = {
  "max_length": 2048,
  "temperature": 0.01,
  "num_beams": 1,
  "do_sample": False,
  "top_p": 0.7,
  "logits_processor" : None,
}

In [34]:
prompt_test = "请介绍下你自己"
response_model = smr_client.invoke_endpoint(
            EndpointName=endpoint_name,
            Body=json.dumps(
            {
                "inputs": prompt_test,
                "parameters": parameters,
                "history" : []
            }
            ),
            ContentType="application/json",
        )

response_model['Body'].read().decode('utf8')

'{\n  "outputs":"我是一名人工智能语言模型,由清华大学 KEG 实验室和智谱 AI 公司于 2022年共同训练的语言模型 GLM-130B 开发而来。我的任务是针对用户的问题和要求提供适当的答复和支持。我可以说中文和英文两种语言,并且可以进行文本生成、对话和翻译等多种任务。",\n  "history":[\n    [\n      "请介绍下你自己",\n      "我是一名人工智能语言模型,由清华大学 KEG 实验室和智谱 AI 公司于 2022年共同训练的语言模型 GLM-130B 开发而来。我的任务是针对用户的问题和要求提供适当的答复和支持。我可以说中文和英文两种语言,并且可以进行文本生成、对话和翻译等多种任务。"\n    ]\n  ]\n}'

### 5. 清理资源(也可以登录SageMaker Console手动删除)

In [None]:
# !aws sagemaker delete-endpoint --endpoint-name <your endpoint name> 

In [None]:
# !aws sagemaker delete-endpoint-config --endpoint-config-name <your endpoint config name> 

In [None]:
# !aws sagemaker delete-model --model-name <your model name> 