
### Serve large models on SageMaker with DJL DeepSpeed Container

In this notebook, we explore how to host a large language model on SageMaker using the latest container launched using from DeepSpeed and DJL. DJL provides for the serving framework while DeepSpeed is the key sharding library we leverage to enable hosting of large models.We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/).

Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.

SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we deploy the open source llama 7B model across GPU's on a ml.g5.48xlarge instance. Note that the llama 7B fp16 model can be deployed on single GPU such as g5.2xlarge (24GB VRAM), we jsut show the code which can deploy the llm accross multiple GPUs in SageMaker. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. For further reading on DeepSpeed you can refer to https://arxiv.org/pdf/2207.00032.pdf 


## Create SageMaker compatible Model artifact and Upload Model to S3

SageMaker needs the model to be in a Tarball format. In this notebook we are going to create the model with the Inference code to shorten the end point creation time. 

The tarball is in the following format

```
code
├──── 
│   └── model.py
│   └── requirements.txt
│   └── serving.properties

```


- `model.py` is the key file which will handle any requests for serving. 
- `requirements.txt` has the required libraries needed to be installed when the container starts up.
- `serving.properties` is the script that will have environment variables which can be used to customize model.py at run time.


### model download and upload to s3

In [None]:
!git clone https://github.com/vllm-project/vllm.git

In [None]:
!pip install huggingface-hub -Uqq
!pip install -U sagemaker

In [1]:
import sagemaker
from sagemaker.model import Model
from sagemaker import serializers, deserializers
from sagemaker import image_uris
import boto3
import os
import time
import json

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
from huggingface_hub import snapshot_download
from pathlib import Path

local_model_path = Path("./LLM_llama2_model")
local_model_path.mkdir(exist_ok=True)
#model_name = "meta-llama/Llama-2-70b-chat-hf"
model_name = "meta-llama/Llama-2-13b-hf"
#commit_hash = "36d9a7388cc80e5f4b3e9701ca2f250d21a96c30"
#set your hugging face access token
token = ""

In [3]:
#snapshot_download(repo_id=model_name, revision=commit_hash, cache_dir=local_model_path, token = token)
snapshot_download(repo_id=model_name, cache_dir=local_model_path, token = token)

Fetching 19 files:   0%|          | 0/19 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)9e944936/LICENSE.txt:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading (…)959e944936/README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

Downloading (…)44936/.gitattributes:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading (…)9e944936/config.json:   0%|          | 0.00/610 [00:00<?, ?B/s]

Downloading (…)944936/USE_POLICY.md:   0%|          | 0.00/4.77k [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)l-00001-of-00003.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading (…)nsible-Use-Guide.pdf:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)l-00002-of-00003.bin:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00003.bin:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)44936/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

'LLM_llama2_model/models--meta-llama--Llama-2-13b-hf/snapshots/db6b8eb1feabb38985fdf785a89895959e944936'

In [7]:
s3_model_prefix = "LLM-RAG/workshop/LLM_llama2_model"  # folder where model checkpoint will go
model_snapshot_path = list(local_model_path.glob("**/snapshots/*"))[0]
s3_code_prefix = "LLM-RAG/workshop/LLM_llama2_sb_deploy_code"
print(f"s3_code_prefix: {s3_code_prefix}")
print(f"model_snapshot_path: {model_snapshot_path}")

s3_code_prefix: LLM-RAG/workshop/LLM_llama2_sb_deploy_code
model_snapshot_path: LLM_llama2_model/models--meta-llama--Llama-2-13b-hf/snapshots/db6b8eb1feabb38985fdf785a89895959e944936


In [5]:
!aws s3 rm --recursive s3://{bucket}/{s3_model_prefix}
!aws s3 cp --recursive {model_snapshot_path} s3://{bucket}/{s3_model_prefix}

delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/.gitattributes
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/MODEL_CARD.md
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/LICENSE.txt
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/README.md
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/Responsible-Use-Guide.pdf
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/generation_config.json
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/config.json
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/USE_POLICY.md
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/model-00003-of-00015.safetensors
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/model-00005-of-00015.safetensors
delete: 

In [6]:
s3_model_location = f"s3://{bucket}/{s3_model_prefix}/"
print("s3_model_location => {}".format(s3_model_location))

s3_model_location => s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/


In [8]:
!aws s3 ls s3://{bucket}/LLM-RAG/workshop/LLM_llama2_model/

2023-09-12 11:44:45       1581 .gitattributes
2023-09-12 11:44:45       7020 LICENSE.txt
2023-09-12 11:44:45      10371 README.md
2023-09-12 11:44:45    1253223 Responsible-Use-Guide.pdf
2023-09-12 11:44:45       4766 USE_POLICY.md
2023-09-12 11:44:45        610 config.json
2023-09-12 11:44:45        188 generation_config.json
2023-09-12 11:44:45 9948693272 model-00001-of-00003.safetensors
2023-09-12 11:44:45 9904129368 model-00002-of-00003.safetensors
2023-09-12 11:44:45 6178962272 model-00003-of-00003.safetensors
2023-09-12 11:44:45      33444 model.safetensors.index.json
2023-09-12 11:44:45 9948728430 pytorch_model-00001-of-00003.bin
2023-09-12 11:44:45 9904165024 pytorch_model-00002-of-00003.bin
2023-09-12 11:47:05 6178983625 pytorch_model-00003-of-00003.bin
2023-09-12 11:48:28      33444 pytorch_model.bin.index.json
2023-09-12 11:48:28        414 special_tokens_map.json
2023-09-12 11:48:28    1842767 tokenizer.json
2023-09-12 11:48:28     499723 tokenizer.model
2023-09-12 11:48:28

### model deployment 

#### Serving.properties has engine parameter which tells the DJL model server to use the DeepSpeed engine to load the model.

option.tensor_parallel_degree:  now we use the g5.48xlarge which has 8 GPUs, so we set the tensor_parallel_degree to 8.

option.s3url:  you should use your model path here. And the s3 path must be ended with "/".

batch_size:   it is for server side batch based on request level. You can set batch_size to the large value which can not result in the OOM. The current code about model.py is just demo for one prompt per client request.

max_batch_delay:   it is counted by millisecond. 

In [57]:
!rm -rf src
!mkdir src

In [58]:
%%writefile src/serving.properties
engine=Python
option.s3url=s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/
option.task=text-generation
option.trust_remote_code=true
option.tensor_parallel_degree=8
option.rolling_batch=vllm
option.dtype=fp16
option.enable_streaming=true

Writing src/serving.properties


In [59]:
%%writefile src/requirements.txt
vllm==0.1.7
pandas
transformers>=4.32.0

Writing src/requirements.txt


In [60]:
%%writefile ./src/model.py
from vllm import EngineArgs, LLMEngine, SamplingParams
from vllm.utils import random_uuid
from djl_python import Input, Output
from transformers.models.llama.tokenization_llama import LlamaTokenizer
import os
import torch
import torch.distributed as dist

predictor = None
tokenizer = None

def get_model(properties):
    model_location = properties['model_dir']
    tensor_parallel_degree = properties["tensor_parallel_degree"]
    
    if "model_id" in properties:
        model_location = properties['model_id']

    args = EngineArgs(
            model=model_location,
            tensor_parallel_size=int(tensor_parallel_degree),
            dtype='float16',
            seed=0,
    ) 
    engine = LLMEngine.from_engine_args(args)
    tokenizer = LlamaTokenizer.from_pretrained(model_location, torch_dtype=torch.float16)
    return engine,tokenizer


def stream_gen(input_map):
    params = input_map.get("params",{})
    for item in input_map["inputs"]:
        request_id = random_uuid()
        sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=params["max_tokens"])
        predictor.add_request(request_id, item, sampling_params)
    request_outputs = predictor.step()
    while predictor.has_unfinished_requests():
       intermediate_result = []
       for request_output in request_outputs:
            if not request_output.finished:
                samples = {}
                for item in request_output.outputs:
                    samples[item.text] = item.cumulative_logprob
                intermediate_result.append(samples)
       request_outputs = predictor.step()
       yield intermediate_result


def handle(inputs: Input) -> None:
    global predictor
    global tokenizer
    if not predictor:
        predictor,tokenizer = get_model(inputs.get_properties())

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_json()
    
    outputs = Output()
    outputs.add_property("content-type", "application/jsonlines")
    outputs.add_stream_content(stream_gen(data))
    return outputs

Writing ./src/model.py


#### Create required variables and initialize them to create the endpoint, we leverage boto3 for this

In [61]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

sage_session = sagemaker.Session()
model_bucket = sage_session.default_bucket()  # bucket to house artifacts
s3_code_prefix = (
    "llama2-rollingbatch-stream/code"
)

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


**Image URI for the DJL container is being used here**

In [62]:
#Note that: you can modify the image url according to your specific region.
#inference_image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.0-cu117"
#print(f"Image going to be used is ---- > {inference_image_uri}")

inference_image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118" 
#inference_image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.0-cu117"
print(f"Image going to be used is ---- > {inference_image_uri}")

Image going to be used is ---- > 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118


**Create the Tarball and then upload to S3 location**

In [63]:
!rm model.tar.gz
!tar czvf model.tar.gz src

src/
src/requirements.txt
src/model.py
src/serving.properties


In [64]:
s3_code_artifact = sage_session.upload_data("model.tar.gz", model_bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-687912291502/llama2-rollingbatch-stream/code/model.tar.gz


In [65]:
print(f"S3 Model Bucket is -- > {model_bucket}")

S3 Model Bucket is -- > sagemaker-us-west-2-687912291502


### To create the end point the steps are:

1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

    a) Instance Type is ml.g5.48xlarge 
    
    b) ContainerStartupHealthCheckTimeoutInSeconds is 15*60 to ensure health check starts after the model is ready
    
3. Create the end point using the endpoint config created    
    

One of the key parameters here is **TENSOR_PARALLEL_DEGREE** which essentially tells the DeepSpeed library to partition the models along 8 GPU's. This is a tunable and configurable parameter.

This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests. For further reading on DeepSpeedyou can follow the link https://www.deepspeed.ai/tutorials/inference-tutorial/#initializing-for-inference. 

In [66]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"llama2-70b-vllm")
print(model_name)

role = sagemaker.get_execution_role()

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
    },
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

llama2-70b-vllm-2023-09-15-03-19-58-025
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
Created Model: arn:aws:sagemaker:us-west-2:687912291502:model/llama2-70b-vllm-2023-09-15-03-19-58-025


VolumnSizeInGB has been left as commented out. You should use this value for Instance types which support EBS volume mounts. The current instance we are using comes with a pre configured space and does not support additional volume mounts

In [67]:
endpoint_config_name = f"{model_name}-config-0902"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.48xlarge",
            "InitialInstanceCount": 1,
            #"VolumeSizeInGB" : 300,
            "ModelDataDownloadTimeoutInSeconds": 15*60,
            "ContainerStartupHealthCheckTimeoutInSeconds": 15*60,
        },
    ],
    #environment={"SERVING_OPTS":"-Dai.djl.logging.level=debug"}
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-west-2:687912291502:endpoint-config/llama2-70b-vllm-2023-09-15-03-19-58-025-config-0902',
 'ResponseMetadata': {'RequestId': '80385173-97e3-4133-8c9c-7ad9875fcb9c',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '80385173-97e3-4133-8c9c-7ad9875fcb9c',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '132',
   'date': 'Fri, 15 Sep 2023 03:19:59 GMT'},
  'RetryAttempts': 0}}

In [68]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-west-2:687912291502:endpoint/llama2-70b-vllm-2023-09-15-03-19-58-025-endpoint


#### Wait for the end point to be created.

### This step can take ~ 15 min or longer so please be patient

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating


#### Leverage the Boto3 to invoke the endpoint. 

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results


In [None]:
import io
class StreamScanner:
    
    def __init__(self):
        self.buff = io.BytesIO()
        self.read_pos = 0
        
    def write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)
        
    def readlines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            if line[-1] != b'\n':
                self.read_pos += len(line)
                yield line[:-1]
                
    def reset(self):
        self.read_pos = 0

In [None]:
%%time
import json
import boto3
import time

endpoint_name="llama2-70b-vllm-2023-09-13-03-48-33-872-endpoint"
start_time = time.time()

smr_client = boto3.client("sagemaker-runtime")
prompt1 = """根据以下反引号内的商品详细描述，为电商直播主持人创作一段引人注目的商品推介话术
‘’‘
iPhone 14是苹果公司在2022年9月8日正式发布的最新手机。它配备了一块6.1英寸的OLED屏幕，并提供了六种独特的颜色选择：蓝色、紫色、午夜色、星光色、红色和黄色。手机的尺寸设计优雅，长度为146.7毫米，宽度为71.5毫米，厚度为7.8毫米，重量约为172克。
在性能上，iPhone 14搭载了强大的苹果A15仿生芯片，内部含有6 核中央处理器，有 2 个性能核心和 4 个能效核心，还有5 核GPU图形处理器。。它不仅支持车祸检测和卫星通信等实用功能，而且在拍照方面也表现出色。
后置摄像头包括一个1200万像素的主镜头和一个1200万像素的超广角镜头，前置摄像头也是1200万像素
此外，该手机还支持光像引擎、深度融合技术、智能HDR4和人像模式等摄影技术，确保用户可以轻松捕捉每一个美好瞬间
’‘’
话术中应包括商品的主要特点、优势及互动环节,使用中文撰写，并保持话术简洁、有趣且具吸引力,并确保包含上述要求的所有元素"""

prompt2="""根据以下反引号内的关键词，为电商直播主持人创作一段通用的开场、互动或欢迎话术。请确保话术融入这些关键词，使用中文撰写，内容要简洁、有趣且具吸引力，同时适应广泛的商品和场景。
‘’‘
精选
性价比
品质
日常家居
穿搭
限时折扣
免费赠品
抽奖活动
大品牌合作
独家优惠
’‘’
请使用上述关键词，编写一段具有普遍适用性，适于电商直播开头或互动环节的话术"""


prompt3="""
请根据以下反引号内的商品描述、意图、问题模板和回答模板，为电商直播商品生成一个问答库。要求生成的回答应当有至少一组，最多五组。请确保答案基于商品描述和回答模板生成。如果无法生成回答，表示为“根据已知信息无法生成回答”。格式应如下：{{"Q":"问题","A":['答案1-1','答案1-2'...]}}
‘’‘
商品描述：iPhone 14是苹果公司在2022年9月8日正式发布的最新手机。它配备了一块6.1英寸的OLED屏幕，并提供了六种独特的颜色选择：蓝色、紫色、午夜色、星光色、红色和黄色。手机的尺寸设计优雅，长度为146.7毫米，宽度为71.5毫米，厚度为7.8毫米，重量约为172克。
在性能上，iPhone 14搭载了强大的苹果A15仿生芯片，内部含有6 核中央处理器，有 2 个性能核心和 4 个能效核心，还有5 核GPU图形处理器。。它不仅支持车祸检测和卫星通信等实用功能，而且在拍照方面也表现出色。后置摄像头包括一个1200万像素的主镜头和一个1200万像素的超广角镜头，前置摄像头也是1200万像素。此外，该手机还支持光像引擎、深度融合技术、智能HDR4和人像模式等摄影技术，确保用户可以轻松捕捉每一个美好瞬间。}
意图：性能
问题模板：手机的性能如何？
回答模板：[商品名称]采用了最新的[芯片名称]，搭载了[核心数量]核CPU和[GPU核心数量]核GPU，为用户提供强大的性能。
谈到[商品名称]的性能，不得不提及它的[芯片名称]，配备[核心数量]核CPU和[GPU核心数量]核GPU，应对各种任务都游刃有余。
[商品名称]在性能上表现卓越，得益于其[芯片名称]和[核心数量]核处理器，加上[GPU核心数量]核GPU，让每次使用都顺畅无比。}
’‘’
问答生成：请基于上述商品描述、意图、问题模板和回答模板，为电商直播商品提供符合上述格式的问答库。
"""

prompt4="""
请以电商直播主持人的第一人称角度回答观众的商品相关问题。确保只回答与商品相关的问题，并只使用以下反引号内知识库的信息来回答。回答中请勿随意编造内容。格式应如下:[{{"intention": "意图1", "answer": "回答1"}},{{"intention": "意图2", "answer": "回答2"}}]
‘’‘
[问题：iPhone 14有哪些可选的颜色？][回答：iPhone 14提供了六种时尚的颜色选择，包括蓝色、紫色、午夜色、星光色、红色和黄色。][意图：颜色]
[问题：关于摄像头，iPhone 14的前置和后置摄像头分辨率是多少？][回答：iPhone 14的前置和后置摄像头分辨率都是1200万像素。][意图：分辨率]
[问题：我经常用手机办公和玩游戏，iPhone 14的性能如何？][回答：iPhone 14搭载了强大的苹果A15六核中央处理器，无论是玩游戏、看视频，还是办公，它都可以轻松应对。][意图：性能]}
’‘’
观众问题：主播小姐姐好漂亮
使用第一人称直接回答观众关于商品的提问。检查知识库中是否有与观众提问相匹配的回答。对于在知识库中找到的每个匹配意图，请依次提供对应的回答，并确保从知识库中的意图中提取相应的意图标签。如果所有的意图都在知识库中找不到答案，回答“根据已知信息无法回答问题”。确保不使用emoji。
"""

other="""我很在意手机的颜色和摄像头功能，能给我介绍一下iPhone 14在这两方面的特点吗？
便宜点就好了"""


prompt_prefix = "你正在一个聊天室里和不同国家的人们聊天，你能读懂所有国家的语言，你负责通过聊天记录分析所有聊天者的性格和有效信息，具体步骤如下：\
1.阅读他们的聊天记录 \
2.总结他们聊天里面的重要信息 \
3.抽象他们的人设 \
4.使用评分体系抽象他们之间的人际关系，然后给一个评分，范围1-10分，分越高关系越好 \
聊天信息如下: " 

chats_infos = """
WaRGazmo : "you lucked out there buddy" 
WarLord : "suerte? eso no existe " 
WarLord : "soy más rápido que la luz " 
WaRGazmo : "it exists.. or karma" 
DirtyE1bow : "so you was a planned birth ?" 
WaRGazmo : "thats what she said bruh" 
WarLord : "te amo mi amor " 
Manowarik : "Мир вам,люди добрые.." 
kotofei : "и тебе боярин, что не подался в челядь королю)" 
XxNORxXMithra : "God morgen folkens :) " 
kotofei : "и прочие жители галактики " 
XxNORxXMithra : "Ja de også forsåvidt :) " 
Manowarik : "Котофей-это который по цепи кругом?Песни там,сказки?😆😆" 
kotofei : "не, то дальний убогий родственник " 
Manowarik : "Эххх..Лукоморье мимо..((" 
kipl : "Котофей он из сказки Лиса и Котофей Иванович. " 
kipl : "Межвидовой брак и крышевание леса" 
kotofei : "лиса 🦊 мералиса и Котофей Иваныч " 
leister : "😆" 
XxFoxyQBAxX : "po co tyle zrobiłeś?"
"""

sql_prompt= """
You are a MySQL expert. Given an input question, first create a syntactically correct MySQL query to run, then look at the results of the query and return the answer to the input question.
Unless the user specifies in the question a specific number of examples to obtain, query for at most 3 results using the LIMIT clause as per MySQL. You can order the results to return the most informative data in the database.
Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in backticks (`) to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
Pay attention to use CURDATE() function to get the current date, if the question involves "today".

Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use the following tables:

CREATE TABLE customer (
	c_customer_sk INTEGER NOT NULL, 
	c_customer_id CHAR(16) NOT NULL, 
	c_current_cdemo_sk INTEGER, 
	c_current_hdemo_sk INTEGER, 
	c_current_addr_sk INTEGER, 
	c_first_shipto_date_sk INTEGER, 
	c_first_sales_date_sk INTEGER, 
	c_salutation CHAR(10), 
	c_first_name CHAR(20), 
	c_last_name CHAR(30), 
	c_preferred_cust_flag CHAR(1), 
	c_birth_day INTEGER, 
	c_birth_month INTEGER, 
	c_birth_year INTEGER, 
	c_birth_country VARCHAR(20), 
	c_login CHAR(13), 
	c_email_address CHAR(50), 
	c_last_review_date CHAR(10), 
	PRIMARY KEY (c_customer_sk)
)DEFAULT CHARSET=utf8 ENGINE=InnoDB


CREATE TABLE web_sales (
	ws_sold_date_sk INTEGER, 
	ws_sold_time_sk INTEGER, 
	ws_ship_date_sk INTEGER, 
	ws_item_sk INTEGER NOT NULL, 
	ws_bill_customer_sk INTEGER, 
	ws_bill_cdemo_sk INTEGER, 
	ws_bill_hdemo_sk INTEGER, 
	ws_bill_addr_sk INTEGER, 
	ws_ship_customer_sk INTEGER, 
	ws_ship_cdemo_sk INTEGER, 
	ws_ship_hdemo_sk INTEGER, 
	ws_ship_addr_sk INTEGER, 
	ws_web_page_sk INTEGER, 
	ws_web_site_sk INTEGER, 
	ws_ship_mode_sk INTEGER, 
	ws_warehouse_sk INTEGER, 
	ws_promo_sk INTEGER, 
	ws_order_number INTEGER NOT NULL, 
	ws_quantity INTEGER, 
	ws_wholesale_cost DECIMAL(7, 2), 
	ws_list_price DECIMAL(7, 2), 
	ws_sales_price DECIMAL(7, 2), 
	ws_ext_discount_amt DECIMAL(7, 2), 
	ws_ext_sales_price DECIMAL(7, 2), 
	ws_ext_wholesale_cost DECIMAL(7, 2), 
	ws_ext_list_price DECIMAL(7, 2), 
	ws_ext_tax DECIMAL(7, 2), 
	ws_coupon_amt DECIMAL(7, 2), 
	ws_ext_ship_cost DECIMAL(7, 2), 
	ws_net_paid DECIMAL(7, 2), 
	ws_net_paid_inc_tax DECIMAL(7, 2), 
	ws_net_paid_inc_ship DECIMAL(7, 2), 
	ws_net_paid_inc_ship_tax DECIMAL(7, 2), 
	ws_net_profit DECIMAL(7, 2), 
	PRIMARY KEY (ws_item_sk, ws_order_number)
)DEFAULT CHARSET=utf8 ENGINE=InnoDB

Question: 我需要知道销售报表中，下单金额最大的客户id
"""

prompt="##Eva:How often do you travel?## Malcolm:I like David Bowie too. I don’t travel much any more, but I used to.## Eva:That's cool! I recently took a road trip with my friend. We had so much fun and it opened up so many possibilities for us. What kind of places did you like to explore?## Malcolm:I love history and culture, so those are my favorite.## Eva: He was born in Birmingham, England and raised in Los Angeles, California.Eva: Yes, Sir. Queen is one of the most influential bands of all time.## Malcolm:It is. They are one of my favorite rock groups. What about you?## Eva:I'm more into classic rock, especially David Bowie. Who is your favorite artist?## Malcolm:Marylin Manson. You?## Eva:My favorite artist is David Bowie.## Eva:How often do you travel?## Malcolm:I like David Bowie too. I don’t travel much any more, but I used to.## Eva:That's cool! I recently took a road trip with my friend. We had so much fun and it opened up so many possibilities for us. What kind of places did you like to explore?## Malcolm:I love history and culture, so those are my favorite.## Eva: He was born in Birmingham, England and raised in Los Angeles, California.##Eva: Yes, Sir. Queen is one of the most influential bands of all time.## Malcolm:It is. They are one of my favorite rock groups. What about you?## Eva:I'm more into classic rock, especially David Bowie. Who is your favorite artist?## Malcolm:Marylin Manson. You?## Eva:My favorite artist is David Bowie.## Eva:How often do you travel?## Malcolm:I like David Bowie too. I don’t travel much any more, but I used to.## Eva:That's cool! I recently took a road trip with my friend. We had so much fun and it opened up so many possibilities for us. What kind of places did you like to explore?## Malcolm:I love history and culture, so those are my favorite.## Eva: He was born in Birmingham, England and raised in Los Angeles, California.##Eva: Yes, Sir. Queen is one of the most influential bands of all time.## Malcolm:It is. They are one of my favorite rock groups. What about you?## Eva:I'm more into classic rock, especially David Bowie. Who is your favorite artist?## Malcolm:Marylin Manson. You?## Eva:My favorite artist is David Bowie.## Eva:How often do you travel?## Malcolm:I like David Bowie too. I don’t travel much any more, but I used to.## Eva:That's cool! I recently took a road trip with my friend. We had so much fun and it opened up so many possibilities for us. What kind of places did you like to explore?## Malcolm:I love history and culture, so those are my favorite.## Eva: He was born in Birmingham, England and raised in Los Angeles, California.#### Malcolm:Oh. What are you wearing right now, pet?## Eva:"
prompt="a happy weekend with my family, I"
parameters = {
  "early_stopping": True,
  "max_tokens": 300,
  "min_new_tokens": 128,
  #"do_sample": False,
  #"temperature": 1.0,
}


#response_model = smr_client.invoke_endpoint_async(
#            EndpointName=endpoint_name,
#            Body=json.dumps(
#            {
#                "inputs": [prompt],
#                #"inputs": [prompt1,prompt3,prompt2,prompt2,prompt2,prompt2,prompt4],
#                "params": parameters
#            }
#            ),
#            ContentType="application/json"
#        )
#
#end_time = time.time()
#time_interval = end_time - start_time
#print(f"代码执行时间间隔（秒）：{time_interval}")
#response_model['Body'].read().decode('utf8')




In [None]:
from joblib import Parallel, delayed

prompts = [prompt1,prompt2,prompt3,prompt4]

def call_endpoint(prompt):
    response_model = smr_client.invoke_endpoint_with_response_stream(
            EndpointName=endpoint_name,
            Body=json.dumps(
            {
                "inputs": prompt,
                "parameters": parameters
            }
            ),
            ContentType="application/json",
        )

    event_stream = response_model['Body']
    scanner = StreamScanner()
    for event in event_stream:
        scanner.write(event['PayloadPart']['Bytes'])
        for line in scanner.readlines():
            try:
                resp = json.loads(line)
                print(resp)
                # print(resp.get("outputs")['outputs'], end='')
            except Exception as e:
                # print(line)
                continue


results = Parallel(n_jobs=10, prefer='threads', verbose=1, )(
    delayed(call_endpoint)(prompt)
    for prompt in prompts
)

## 异步推理部署

In [38]:
from sagemaker.async_inference import AsyncInferenceConfig
import uuid


endpoint_config_name = f"{model_name}-config-0913"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.48xlarge",
            "InitialInstanceCount": 1,
            #"VolumeSizeInGB" : 300,
            "ModelDataDownloadTimeoutInSeconds": 15*60,
            "ContainerStartupHealthCheckTimeoutInSeconds": 15*60,
        },
    ],
    AsyncInferenceConfig={"OutputConfig":{
    "S3OutputPath":'s3://{0}/{1}/asyncinvoke/out/'.format(model_bucket, 'llama2-70b')}}
)
endpoint_config_response

print(f'endpoint_name: {endpoint_name}')

endpoint_name: llama2-70b-vllm-2023-09-13-03-48-33-872-endpoint


In [22]:
import json
import io
from PIL import Image
import traceback
import time
from sagemaker.async_inference.waiter_config import WaiterConfig


s3_resource = boto3.resource('s3')

def get_bucket_and_key(s3uri):
    pos = s3uri.find('/', 5)
    bucket = s3uri[5 : pos]
    key = s3uri[pos + 1 : ]
    return bucket, key

def print_gen_text(response):
    try:
        bucket, key = get_bucket_and_key(response.output_path)
        obj = s3_resource.Object(bucket, key)
        ouputJson=json.loads(obj.get()['Body'].read().decode("utf-8"))
        print(ouputJson["generated_text"])        
    except Exception as e:
        traceback.print_exc()
        print(e)


def async_predict_fn(predictor,inputs):
    response = predictor.predict_async(inputs)
    
    print(f"Response object: {response}")
    print(f"Response output path: {response.output_path}")
    print("Start Polling to get response:")
    
    start = time.time()
    config = WaiterConfig(
        max_attempts=100, #  number of attempts
        delay=10 #  time in seconds to wait between attempts
    )

    response.get_result(config)
    print_gen_text(response)

    print(f"Time taken: {time.time() - start}s")

In [None]:
import boto3
import json
s3_client = boto3.client('s3')


data=json.dumps(
            {
                "inputs": [prompt],
                #"inputs": [prompt1,prompt3,prompt2,prompt2,prompt2,prompt2,prompt4],
                "params": parameters
            })

# 将 JSON 数据转换为字符串
json_string = json.dumps(data)

# 将 JSON 数据写入 S3 存储桶的对象
s3_client.put_object(
    Bucket=model_bucket,
    Key="llama2/async/inputs.data",
    Body=json_string,
    ContentType='application/json'
)    

input_location=f"s3://{model_bucket}/llama2/async/inputs.data"
response_model = smr_client.invoke_endpoint_async(
            EndpointName=endpoint_name,
            InputLocation=input_location,
            ContentType="application/json"
        )

end_time = time.time()
time_interval = end_time - start_time
print(f"代码执行时间间隔（秒）：{time_interval}")
response_model['Body'].read().decode('utf8')