
### Serve large models on SageMaker with DJL DeepSpeed Container

In this notebook, we explore how to host a large language model on SageMaker using the latest container launched using from DeepSpeed and DJL. DJL provides for the serving framework while DeepSpeed is the key sharding library we leverage to enable hosting of large models.We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/).

Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.

SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we deploy the open source llama 7B model across GPU's on a ml.g5.48xlarge instance. Note that the llama 7B fp16 model can be deployed on single GPU such as g5.2xlarge (24GB VRAM), we jsut show the code which can deploy the llm accross multiple GPUs in SageMaker. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. For further reading on DeepSpeed you can refer to https://arxiv.org/pdf/2207.00032.pdf 


## Create SageMaker compatible Model artifact and Upload Model to S3

SageMaker needs the model to be in a Tarball format. In this notebook we are going to create the model with the Inference code to shorten the end point creation time. 

The tarball is in the following format

```
code
├──── 
│   └── model.py
│   └── requirements.txt
│   └── serving.properties

```


- `model.py` is the key file which will handle any requests for serving. 
- `requirements.txt` has the required libraries needed to be installed when the container starts up.
- `serving.properties` is the script that will have environment variables which can be used to customize model.py at run time.


### model download and upload to s3

In [1]:
!git clone https://github.com/vllm-project/vllm.git

fatal: destination path 'vllm' already exists and is not an empty directory.


In [1]:
!pip install huggingface-hub -Uqq
!pip install -U sagemaker

Collecting sagemaker
  Downloading sagemaker-2.184.0.tar.gz (884 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m884.6/884.6 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.184.0-py2.py3-none-any.whl size=1185356 sha256=3a3fd9dfca664b033af389d1dba493f84dc4da1674fddb8e0c0a55778b0b9c15
  Stored in directory: /home/ec2-user/.cache/pip/wheels/e0/88/68/6fe23600506acbeffff228ff9b6a0b8d523b64d5d0ffd24654
Successfully built sagemaker
Installing collected packages: sagemaker
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.179.0
    Uninstalling sagemaker-2.179.0:
      Successfully uninstalled sagemaker-2.179.0
Successfully installed sagemaker-2.184.0


In [2]:
import sagemaker
from sagemaker.model import Model
from sagemaker import serializers, deserializers
from sagemaker import image_uris
import boto3
import os
import time
import json

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
from huggingface_hub import snapshot_download
from pathlib import Path

local_model_path = Path("./LLM_llama2_model")
local_model_path.mkdir(exist_ok=True)
model_name = "meta-llama/Llama-2-70b-chat-hf"
commit_hash = "36d9a7388cc80e5f4b3e9701ca2f250d21a96c30"
token = "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"

In [4]:
snapshot_download(repo_id=model_name, revision=commit_hash, cache_dir=local_model_path, token = token)

Fetching 44 files:   0%|          | 0/44 [00:00<?, ?it/s]

Downloading (…)0d21a96c30/README.md:   0%|          | 0.00/9.97k [00:00<?, ?B/s]

Downloading (…)a96c30/USE_POLICY.md:   0%|          | 0.00/4.77k [00:00<?, ?B/s]

Downloading (…)96c30/.gitattributes:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading (…)21a96c30/LICENSE.txt:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

Downloading (…)a96c30/MODEL_CARD.md:   0%|          | 0.00/7.23k [00:00<?, ?B/s]

Downloading (…)nsible-Use-Guide.pdf:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)21a96c30/config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.85G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.50G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/524M [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/66.7k [00:00<?, ?B/s]

Downloading (…)l-00001-of-00015.bin:   0%|          | 0.00/9.85G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00015.bin:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00015.bin:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Downloading (…)l-00004-of-00015.bin:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)l-00005-of-00015.bin:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)l-00006-of-00015.bin:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)l-00007-of-00015.bin:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Downloading (…)l-00008-of-00015.bin:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)l-00009-of-00015.bin:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)l-00010-of-00015.bin:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)l-00011-of-00015.bin:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Downloading (…)l-00012-of-00015.bin:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)l-00013-of-00015.bin:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)l-00014-of-00015.bin:   0%|          | 0.00/9.50G [00:00<?, ?B/s]

Downloading (…)l-00015-of-00015.bin:   0%|          | 0.00/524M [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/66.7k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)96c30/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

'LLM_llama2_model/models--meta-llama--Llama-2-70b-chat-hf/snapshots/36d9a7388cc80e5f4b3e9701ca2f250d21a96c30'

In [5]:
s3_model_prefix = "LLM-RAG/workshop/LLM_llama2_model"  # folder where model checkpoint will go
model_snapshot_path = list(local_model_path.glob("**/snapshots/*"))[0]
s3_code_prefix = "LLM-RAG/workshop/LLM_llama2_sb_deploy_code"
print(f"s3_code_prefix: {s3_code_prefix}")
print(f"model_snapshot_path: {model_snapshot_path}")

s3_code_prefix: LLM-RAG/workshop/LLM_llama2_sb_deploy_code
model_snapshot_path: LLM_llama2_model/models--meta-llama--Llama-2-70b-chat-hf/snapshots/36d9a7388cc80e5f4b3e9701ca2f250d21a96c30


In [6]:
!aws s3 rm --recursive s3://{bucket}/{s3_model_prefix}
!aws s3 cp --recursive {model_snapshot_path} s3://{bucket}/{s3_model_prefix}

delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/LICENSE.txt
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/.gitattributes
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/Notice
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/README.md
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/config.json
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/USE_POLICY.md
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/generation_config.json
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/model-00004-of-00015.safetensors
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/model-00002-of-00015.safetensors
delete: s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/model-00007-of-00015.safetensors
delete: 

In [7]:
s3_model_location = f"s3://{bucket}/{s3_model_prefix}/"
print("s3_model_location => {}".format(s3_model_location))

s3_model_location => s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/


In [8]:
!aws s3 ls s3://{bucket}/LLM-RAG/workshop/LLM_llama2_model/

2023-09-07 14:48:37       1581 .gitattributes
2023-09-07 14:48:37       7020 LICENSE.txt
2023-09-07 14:48:37       7230 MODEL_CARD.md
2023-09-07 14:48:37       9972 README.md
2023-09-07 14:48:37    1253223 Responsible-Use-Guide.pdf
2023-09-07 14:48:37       4766 USE_POLICY.md
2023-09-07 14:48:37        614 config.json
2023-09-07 14:48:37        188 generation_config.json
2023-09-07 14:48:37 9852591960 model-00001-of-00015.safetensors
2023-09-07 14:48:37 9798099016 model-00002-of-00015.safetensors
2023-09-07 14:48:37 9965870512 model-00003-of-00015.safetensors
2023-09-07 14:48:37 9798066064 model-00004-of-00015.safetensors
2023-09-07 14:48:37 9798099064 model-00005-of-00015.safetensors
2023-09-07 14:52:19 9798099056 model-00006-of-00015.safetensors
2023-09-07 14:52:19 9965870512 model-00007-of-00015.safetensors
2023-09-07 14:52:21 9798066064 model-00008-of-00015.safetensors
2023-09-07 14:52:21 9798099064 model-00009-of-00015.safetensors
2023-09-07 14:52:25 9798099056 model-00010-of-0001

### model deployment 

#### Serving.properties has engine parameter which tells the DJL model server to use the DeepSpeed engine to load the model.

option.tensor_parallel_degree:  now we use the g5.48xlarge which has 8 GPUs, so we set the tensor_parallel_degree to 8.

option.s3url:  you should use your model path here. And the s3 path must be ended with "/".

batch_size:   it is for server side batch based on request level. You can set batch_size to the large value which can not result in the OOM. The current code about model.py is just demo for one prompt per client request.

max_batch_delay:   it is counted by millisecond. 

In [9]:
!rm -rf src
!mkdir src

In [10]:
%%writefile ./src/serving.properties
engine=Python
option.tensor_parallel_degree=8
#option.model_id=THUDM/llama2-6b
#option.s3url=s3_model_location
option.s3url=s3://sagemaker-us-west-2-687912291502/LLM-RAG/workshop/LLM_llama2_model/

Writing ./src/serving.properties


In [11]:
%%writefile ./src/requirements.txt
vllm
accelerate>=0.20.3
ray[air]
transformers>=4.32.0

Writing ./src/requirements.txt


In [12]:
%%writefile ./src/model.py
from vllm import LLM, SamplingParams
from djl_python import Input, Output
from transformers.models.llama.tokenization_llama import LlamaTokenizer
import os
import torch
import torch.distributed as dist

os.environ['NCCL_P2P_DISABLE'] = '1'

predictor = None
tokenizer = None

def get_model(properties):
    model_location = properties['model_dir']
    tensor_parallel_degree = properties["tensor_parallel_degree"]
    
    if "model_id" in properties:
        model_location = properties['model_id']

    llm = LLM(model=model_location, tensor_parallel_size=int(tensor_parallel_degree))
    tokenizer = LlamaTokenizer.from_pretrained(model_location, torch_dtype=torch.float16)
    return llm,tokenizer


def handle(inputs: Input) -> None:
    global predictor
    global tokenizer
    if not predictor:
        predictor,tokenizer = get_model(inputs.get_properties())

        
    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_json()
    params = data.get("params",{})
    
        
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=params["max_tokens"])
    result = predictor.generate(data["inputs"], sampling_params)
    
    result_json = []
    for output in result:
        prompt = output.prompt
        #input_tokens = tokenizer.tokenize(prompt)
        #input_token_lenth = len(input_tokens)
        #print(f"input token lenth:{input_token_lenth}")
        
        generated_text = output.outputs[0].text
        #output_tokens = tokenizer.tokenize(generated_text)
        #output_token_lenth = len(output_tokens)
        #print(f"output token lenth:{output_token_lenth}")
        
        result_json.append(generated_text)
    return Output().add(result_json)

Writing ./src/model.py


#### Create required variables and initialize them to create the endpoint, we leverage boto3 for this

In [13]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

sage_session = sagemaker.Session()
model_bucket = sage_session.default_bucket()  # bucket to house artifacts
s3_code_prefix = (
    #"hf-large-model-llama-7b-0625/code"  # folder within bucket where code artifact will go
    "llama2-vllm/code"
)

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


**Image URI for the DJL container is being used here**

In [14]:
#Note that: you can modify the image url according to your specific region.
#inference_image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.0-cu117"
#print(f"Image going to be used is ---- > {inference_image_uri}")

inference_image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118" 
#inference_image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.0-cu117"
print(f"Image going to be used is ---- > {inference_image_uri}")

Image going to be used is ---- > 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118


**Create the Tarball and then upload to S3 location**

In [15]:
!rm model.tar.gz
!tar czvf model.tar.gz src

rm: cannot remove ‘model.tar.gz’: No such file or directory
src/
src/requirements.txt
src/model.py
src/serving.properties


In [16]:
s3_code_artifact = sage_session.upload_data("model.tar.gz", model_bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-687912291502/llama2-vllm/code/model.tar.gz


In [17]:
print(f"S3 Model Bucket is -- > {model_bucket}")

S3 Model Bucket is -- > sagemaker-us-west-2-687912291502


### To create the end point the steps are:

1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

    a) Instance Type is ml.g5.48xlarge 
    
    b) ContainerStartupHealthCheckTimeoutInSeconds is 15*60 to ensure health check starts after the model is ready
    
3. Create the end point using the endpoint config created    
    

One of the key parameters here is **TENSOR_PARALLEL_DEGREE** which essentially tells the DeepSpeed library to partition the models along 8 GPU's. This is a tunable and configurable parameter.

This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests. For further reading on DeepSpeedyou can follow the link https://www.deepspeed.ai/tutorials/inference-tutorial/#initializing-for-inference. 

In [18]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"llama2-70b-vllm")
print(model_name)

role = sagemaker.get_execution_role()

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
    },
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

llama2-70b-vllm-2023-09-07-15-09-47-634
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
Created Model: arn:aws:sagemaker:us-west-2:687912291502:model/llama2-70b-vllm-2023-09-07-15-09-47-634


VolumnSizeInGB has been left as commented out. You should use this value for Instance types which support EBS volume mounts. The current instance we are using comes with a pre configured space and does not support additional volume mounts

In [19]:
endpoint_config_name = f"{model_name}-config-0902"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.48xlarge",
            "InitialInstanceCount": 1,
            #"VolumeSizeInGB" : 300,
            "ModelDataDownloadTimeoutInSeconds": 15*60,
            "ContainerStartupHealthCheckTimeoutInSeconds": 15*60,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-west-2:687912291502:endpoint-config/llama2-70b-vllm-2023-09-07-15-09-47-634-config-0902',
 'ResponseMetadata': {'RequestId': 'ec070e2a-e47b-4384-a7af-16b586d4e59d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'ec070e2a-e47b-4384-a7af-16b586d4e59d',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '132',
   'date': 'Thu, 07 Sep 2023 15:09:47 GMT'},
  'RetryAttempts': 0}}

In [20]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-west-2:687912291502:endpoint/llama2-70b-vllm-2023-09-07-15-09-47-634-endpoint


#### Wait for the end point to be created.

### This step can take ~ 15 min or longer so please be patient

In [21]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:687912291502:endpoint/llama2-70b-vllm-2023-09-07-15-09-47-634-endpoint
Status: InService


#### Leverage the Boto3 to invoke the endpoint. 

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results


In [22]:
%%time
import json
import boto3
import time


start_time = time.time()

smr_client = boto3.client("sagemaker-runtime")
prompt1 = """根据以下反引号内的商品详细描述，为电商直播主持人创作一段引人注目的商品推介话术
‘’‘
iPhone 14是苹果公司在2022年9月8日正式发布的最新手机。它配备了一块6.1英寸的OLED屏幕，并提供了六种独特的颜色选择：蓝色、紫色、午夜色、星光色、红色和黄色。手机的尺寸设计优雅，长度为146.7毫米，宽度为71.5毫米，厚度为7.8毫米，重量约为172克。
在性能上，iPhone 14搭载了强大的苹果A15仿生芯片，内部含有6 核中央处理器，有 2 个性能核心和 4 个能效核心，还有5 核GPU图形处理器。。它不仅支持车祸检测和卫星通信等实用功能，而且在拍照方面也表现出色。
后置摄像头包括一个1200万像素的主镜头和一个1200万像素的超广角镜头，前置摄像头也是1200万像素
此外，该手机还支持光像引擎、深度融合技术、智能HDR4和人像模式等摄影技术，确保用户可以轻松捕捉每一个美好瞬间
’‘’
话术中应包括商品的主要特点、优势及互动环节,使用中文撰写，并保持话术简洁、有趣且具吸引力,并确保包含上述要求的所有元素"""

prompt2="""根据以下反引号内的关键词，为电商直播主持人创作一段通用的开场、互动或欢迎话术。请确保话术融入这些关键词，使用中文撰写，内容要简洁、有趣且具吸引力，同时适应广泛的商品和场景。
‘’‘
精选
性价比
品质
日常家居
穿搭
限时折扣
免费赠品
抽奖活动
大品牌合作
独家优惠
’‘’
请使用上述关键词，编写一段具有普遍适用性，适于电商直播开头或互动环节的话术"""


prompt3="""
请根据以下反引号内的商品描述、意图、问题模板和回答模板，为电商直播商品生成一个问答库。要求生成的回答应当有至少一组，最多五组。请确保答案基于商品描述和回答模板生成。如果无法生成回答，表示为“根据已知信息无法生成回答”。格式应如下：{{"Q":"问题","A":['答案1-1','答案1-2'...]}}
‘’‘
商品描述：iPhone 14是苹果公司在2022年9月8日正式发布的最新手机。它配备了一块6.1英寸的OLED屏幕，并提供了六种独特的颜色选择：蓝色、紫色、午夜色、星光色、红色和黄色。手机的尺寸设计优雅，长度为146.7毫米，宽度为71.5毫米，厚度为7.8毫米，重量约为172克。
在性能上，iPhone 14搭载了强大的苹果A15仿生芯片，内部含有6 核中央处理器，有 2 个性能核心和 4 个能效核心，还有5 核GPU图形处理器。。它不仅支持车祸检测和卫星通信等实用功能，而且在拍照方面也表现出色。后置摄像头包括一个1200万像素的主镜头和一个1200万像素的超广角镜头，前置摄像头也是1200万像素。此外，该手机还支持光像引擎、深度融合技术、智能HDR4和人像模式等摄影技术，确保用户可以轻松捕捉每一个美好瞬间。}
意图：性能
问题模板：手机的性能如何？
回答模板：[商品名称]采用了最新的[芯片名称]，搭载了[核心数量]核CPU和[GPU核心数量]核GPU，为用户提供强大的性能。
谈到[商品名称]的性能，不得不提及它的[芯片名称]，配备[核心数量]核CPU和[GPU核心数量]核GPU，应对各种任务都游刃有余。
[商品名称]在性能上表现卓越，得益于其[芯片名称]和[核心数量]核处理器，加上[GPU核心数量]核GPU，让每次使用都顺畅无比。}
’‘’
问答生成：请基于上述商品描述、意图、问题模板和回答模板，为电商直播商品提供符合上述格式的问答库。
"""

prompt4="""
请以电商直播主持人的第一人称角度回答观众的商品相关问题。确保只回答与商品相关的问题，并只使用以下反引号内知识库的信息来回答。回答中请勿随意编造内容。格式应如下:[{{"intention": "意图1", "answer": "回答1"}},{{"intention": "意图2", "answer": "回答2"}}]
‘’‘
[问题：iPhone 14有哪些可选的颜色？][回答：iPhone 14提供了六种时尚的颜色选择，包括蓝色、紫色、午夜色、星光色、红色和黄色。][意图：颜色]
[问题：关于摄像头，iPhone 14的前置和后置摄像头分辨率是多少？][回答：iPhone 14的前置和后置摄像头分辨率都是1200万像素。][意图：分辨率]
[问题：我经常用手机办公和玩游戏，iPhone 14的性能如何？][回答：iPhone 14搭载了强大的苹果A15六核中央处理器，无论是玩游戏、看视频，还是办公，它都可以轻松应对。][意图：性能]}
’‘’
观众问题：主播小姐姐好漂亮
使用第一人称直接回答观众关于商品的提问。检查知识库中是否有与观众提问相匹配的回答。对于在知识库中找到的每个匹配意图，请依次提供对应的回答，并确保从知识库中的意图中提取相应的意图标签。如果所有的意图都在知识库中找不到答案，回答“根据已知信息无法回答问题”。确保不使用emoji。
"""

other="""我很在意手机的颜色和摄像头功能，能给我介绍一下iPhone 14在这两方面的特点吗？
便宜点就好了"""


prompt_prefix = "你正在一个聊天室里和不同国家的人们聊天，你能读懂所有国家的语言，你负责通过聊天记录分析所有聊天者的性格和有效信息，具体步骤如下：\
1.阅读他们的聊天记录 \
2.总结他们聊天里面的重要信息 \
3.抽象他们的人设 \
4.使用评分体系抽象他们之间的人际关系，然后给一个评分，范围1-10分，分越高关系越好 \
聊天信息如下: " 

chats_infos = """
WaRGazmo : "you lucked out there buddy" 
WarLord : "suerte? eso no existe " 
WarLord : "soy más rápido que la luz " 
WaRGazmo : "it exists.. or karma" 
DirtyE1bow : "so you was a planned birth ?" 
WaRGazmo : "thats what she said bruh" 
WarLord : "te amo mi amor " 
Manowarik : "Мир вам,люди добрые.." 
kotofei : "и тебе боярин, что не подался в челядь королю)" 
XxNORxXMithra : "God morgen folkens :) " 
kotofei : "и прочие жители галактики " 
XxNORxXMithra : "Ja de også forsåvidt :) " 
Manowarik : "Котофей-это который по цепи кругом?Песни там,сказки?😆😆" 
kotofei : "не, то дальний убогий родственник " 
Manowarik : "Эххх..Лукоморье мимо..((" 
kipl : "Котофей он из сказки Лиса и Котофей Иванович. " 
kipl : "Межвидовой брак и крышевание леса" 
kotofei : "лиса 🦊 мералиса и Котофей Иваныч " 
leister : "😆" 
XxFoxyQBAxX : "po co tyle zrobiłeś?"
"""

prompt="##Eva:How often do you travel?## Malcolm:I like David Bowie too. I don’t travel much any more, but I used to.## Eva:That's cool! I recently took a road trip with my friend. We had so much fun and it opened up so many possibilities for us. What kind of places did you like to explore?## Malcolm:I love history and culture, so those are my favorite.## Eva: He was born in Birmingham, England and raised in Los Angeles, California.Eva: Yes, Sir. Queen is one of the most influential bands of all time.## Malcolm:It is. They are one of my favorite rock groups. What about you?## Eva:I'm more into classic rock, especially David Bowie. Who is your favorite artist?## Malcolm:Marylin Manson. You?## Eva:My favorite artist is David Bowie.## Eva:How often do you travel?## Malcolm:I like David Bowie too. I don’t travel much any more, but I used to.## Eva:That's cool! I recently took a road trip with my friend. We had so much fun and it opened up so many possibilities for us. What kind of places did you like to explore?## Malcolm:I love history and culture, so those are my favorite.## Eva: He was born in Birmingham, England and raised in Los Angeles, California.##Eva: Yes, Sir. Queen is one of the most influential bands of all time.## Malcolm:It is. They are one of my favorite rock groups. What about you?## Eva:I'm more into classic rock, especially David Bowie. Who is your favorite artist?## Malcolm:Marylin Manson. You?## Eva:My favorite artist is David Bowie.## Eva:How often do you travel?## Malcolm:I like David Bowie too. I don’t travel much any more, but I used to.## Eva:That's cool! I recently took a road trip with my friend. We had so much fun and it opened up so many possibilities for us. What kind of places did you like to explore?## Malcolm:I love history and culture, so those are my favorite.## Eva: He was born in Birmingham, England and raised in Los Angeles, California.##Eva: Yes, Sir. Queen is one of the most influential bands of all time.## Malcolm:It is. They are one of my favorite rock groups. What about you?## Eva:I'm more into classic rock, especially David Bowie. Who is your favorite artist?## Malcolm:Marylin Manson. You?## Eva:My favorite artist is David Bowie.## Eva:How often do you travel?## Malcolm:I like David Bowie too. I don’t travel much any more, but I used to.## Eva:That's cool! I recently took a road trip with my friend. We had so much fun and it opened up so many possibilities for us. What kind of places did you like to explore?## Malcolm:I love history and culture, so those are my favorite.## Eva: He was born in Birmingham, England and raised in Los Angeles, California.#### Malcolm:Oh. What are you wearing right now, pet?## Eva:"
prompt="a happy weekend with my family, I"
parameters = {
  "early_stopping": True,
  "max_tokens": 300,
  "min_new_tokens": 128,
  #"do_sample": False,
  #"temperature": 1.0,
}


response_model = smr_client.invoke_endpoint(
            EndpointName=endpoint_name,
            Body=json.dumps(
            {
                "inputs": [prompt],
                #"inputs": [prompt1,prompt3,prompt2,prompt2,prompt2,prompt2,prompt4],
                "params": parameters
            }
            ),
            ContentType="application/json",
        )

end_time = time.time()
time_interval = end_time - start_time
print(f"代码执行时间间隔（秒）：{time_interval}")

response_model['Body'].read().decode('utf8')

代码执行时间间隔（秒）：20.264828205108643
CPU times: user 20.5 ms, sys: 3.21 ms, total: 23.7 ms
Wall time: 20.3 s


'[\n  "sat down to work on a new blog post. As I started writing, I realized that my mind was a complete blank. I couldn\'t think of anything to say.\\n\\nI tried to brainstorm, but my ideas were dull and uninspired. I couldn\'t seem to come up with anything that I thought would be interesting or valuable to my readers.\\n\\nI began to feel frustrated and worried. What was wrong with me? Why couldn\'t I think of anything to write about?\\n\\nThen, I remembered a piece of advice that a wise writer once gave me: \\"Write about what you\'re passionate about.\\"\\n\\nI took a deep breath and let my mind wander. What was I passionate about? What did I care about deeply?\\n\\nAs I thought about it, I realized that I was passionate about helping people. I wanted to inspire and motivate others to live their best lives.\\n\\nWith that in mind, I started writing again. This time, the words flowed easily. I wrote about the importance of pursuing your passions, and how it can bring fulfillment and