### 1. 安装HuggingFace 并下载模型到本地

In [1]:
%pip install sagemaker huggingface_hub --upgrade  --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
from huggingface_hub import login
login(token='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')

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/ec2-user/.cache/huggingface/token
Login successful


In [3]:
from huggingface_hub import snapshot_download
from pathlib import Path
local_model_path = Path("./LLM_llama3_8b_model")
local_model_path.mkdir(exist_ok=True)

model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
# commit_hash = "41b61a33a2483885c981aa79e0df6b32407ed873"

In [4]:
snapshot_download(repo_id=model_name, cache_dir=local_model_path)

Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

'LLM_llama3_8b_model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/c4a54320a52ed5f88b7a2f84496903ea4ff07b45'

### 2. 把模型拷贝到S3为后续部署做准备

In [10]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment
bucket = sess.default_bucket()
s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

In [11]:
s3_model_prefix = f"aigc-llm-models/{model_name}"  # folder where model checkpoint will go
model_snapshot_path = list(local_model_path.glob("**/snapshots/*"))[0]
s3_code_prefix = f"aigc-llm-models/{model_name}_deploy_code"
print(f"s3_code_prefix: {s3_code_prefix}")
print(f"model_snapshot_path: {model_snapshot_path}")

s3_code_prefix: aigc-llm-models/meta-llama/Meta-Llama-3-8B-Instruct_deploy_code
model_snapshot_path: LLM_llama3_8b_model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/c4a54320a52ed5f88b7a2f84496903ea4ff07b45


In [12]:
s3_path = f"s3://{bucket}/{s3_model_prefix}/"

In [13]:
!aws s3 cp --recursive --exclude "*.pth" {model_snapshot_path} {s3_path}

upload: LLM_llama3_8b_model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/c4a54320a52ed5f88b7a2f84496903ea4ff07b45/README.md to s3://sagemaker-us-east-1-357224784104/aigc-llm-models/meta-llama/Meta-Llama-3-8B-Instruct/README.md
upload: LLM_llama3_8b_model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/c4a54320a52ed5f88b7a2f84496903ea4ff07b45/config.json to s3://sagemaker-us-east-1-357224784104/aigc-llm-models/meta-llama/Meta-Llama-3-8B-Instruct/config.json
upload: LLM_llama3_8b_model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/c4a54320a52ed5f88b7a2f84496903ea4ff07b45/LICENSE to s3://sagemaker-us-east-1-357224784104/aigc-llm-models/meta-llama/Meta-Llama-3-8B-Instruct/LICENSE
upload: LLM_llama3_8b_model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/c4a54320a52ed5f88b7a2f84496903ea4ff07b45/USE_POLICY.md to s3://sagemaker-us-east-1-357224784104/aigc-llm-models/meta-llama/Meta-Llama-3-8B-Instruct/USE_POLICY.md
upload: LLM_llama3_8b_model/models--meta-lla

### 3. 模型部署准备（entrypoint脚本，容器镜像，服务配置）

In [14]:
inference_image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.27.0"
    )

In [15]:
local_code_dir = s3_code_prefix.split('/')[-1]
!mkdir -p {local_code_dir}

#### Note: option.model_id 需要改成模型下载的s3_url

In [29]:
%%writefile {local_code_dir}/serving.properties
engine=Python
option.model_id=S3PATH
option.dtype=bf16
option.task=text-generation
option.rolling_batch=vllm
option.tensor_parallel_degree=1
option.device_map=auto
option.gpu_memory_utilization=0.85
option.max_model_len=8192
option.max_tokens=8192
option.output_formatter = json
option.model_loading_timeout = 1200
option.enforce_eager=true

Overwriting Meta-Llama-3-8B-Instruct_deploy_code/serving.properties


In [30]:
!sed -i "s|option.model_id=S3PATH|option.model_id={s3_path}|" {local_code_dir}/serving.properties

In [31]:
!rm model.tar.gz
!cd {local_code_dir} && rm -rf ".ipynb_checkpoints"
!tar czvf model.tar.gz {local_code_dir}

Meta-Llama-3-8B-Instruct_deploy_code/
Meta-Llama-3-8B-Instruct_deploy_code/serving.properties


In [32]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-357224784104/aigc-llm-models/meta-llama/Meta-Llama-3-8B-Instruct_deploy_code/model.tar.gz


### 4. 创建模型 & 创建endpoint

In [33]:
from sagemaker.utils import name_from_base
import boto3

model_name = name_from_base(f"llama-8b-instruct") #Note: Need to specify model_name
print(model_name)
print(f"Image going to be used is ---- > {inference_image_uri}")

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact
    },
    
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

llama-8b-instruct-2024-05-16-15-29-16-267
Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121
Created Model: arn:aws:sagemaker:us-east-1:357224784104:model/llama-8b-instruct-2024-05-16-15-29-16-267


In [34]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

#Note: ml.g4dn.2xlarge 也可以选择
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.p4d.24xlarge",
            "InitialInstanceCount": 1,
            # "VolumeSizeInGB" : 400,
            # "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 10*60,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:357224784104:endpoint-config/llama-8b-instruct-2024-05-16-15-29-16-267-config',
 'ResponseMetadata': {'RequestId': '761236c0-bfd2-4106-a93c-5c3a9dfb1e10',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '761236c0-bfd2-4106-a93c-5c3a9dfb1e10',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '129',
   'date': 'Thu, 16 May 2024 15:29:19 GMT'},
  'RetryAttempts': 0}}

In [35]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-east-1:357224784104:endpoint/llama-8b-instruct-2024-05-16-15-29-16-267-endpoint


#### 持续检测模型部署进度

In [40]:
import time
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:357224784104:endpoint/llama-8b-instruct-2024-05-13-09-54-02-776-endpoint
Status: InService


### 5. 模型测试

## No stream 

In [25]:
%%time
import json
import boto3

smr_client = boto3.client("sagemaker-runtime")

parameters = {
  "max_new_tokens": 8192,
  "temperature": 0.9,
  "top_p":0.8
}

CPU times: user 6.06 ms, sys: 553 µs, total: 6.62 ms
Wall time: 6.96 ms


In [42]:
prompts1 = """写一篇500字的科幻小说，背景关于宇宙战争"""
start = time.time()
response_model = smr_client.invoke_endpoint(
            EndpointName=endpoint_name,
            Body=json.dumps(
            {
                "inputs": prompts1,
                "parameters": parameters,
                "history" : [],
            }
            ),
            ContentType="application/json",
        )

resp = response_model['Body'].read()
print (f"\ntime:{time.time()-start} s")
print(json.loads(resp)['generated_text'])


time:9.496861457824707 s

The Cosmic War
In the year 2256, humanity had finally reached the stars, colonizing distant planets and moons. But with the expansion of space travel came the inevitable: conflict. The United Earth Government, formed to govern the newly colonized planets, was torn apart by internal strife and petty squabbles. The once-peaceful galaxy was now a battleground, with factions vying for control.

The first shots were fired when the Mars Colonies, tired of being treated as second-class citizens, declared independence from the United Earth Government. The Earth-based government, led by the ruthless and cunning President Zhang, responded with force, sending a fleet of warships to quell the rebellion.

But the Mars Colonies were not alone. The Jupiter Colonies, led by the enigmatic and brilliant Admiral Zhang Wei, had been secretly building a powerful fleet of their own. They saw the conflict as an opportunity to strike back against the Earth-based government, which th

## stream 

In [26]:
import io
import re

NEWLINE = re.compile(r'\\n')  
DOUBLE_NEWLINE = re.compile(r'\\n\\n')

class LineIterator:
    """
    A helper class for parsing the byte stream from Llama 2 model inferenced with LMI Container. 
    
    The output of the model will be in the following repetetive but incremental format:
    ```
    b'{"generated_text": "'
    b'lo from L"'
    b'LM \\n\\n'
    b'How are you?"}'
    ...

    For each iteration, we just read the incremental part and seek for the new position for the next iteration till the end of the line.

    """
    
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        start_sequence = b'{"generated_text": "'
        stop_sequence = b'"}'
        new_line = '\n'
        double_new_line = '\n\n'
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line:
                self.read_pos += len(line)
                if line.startswith(start_sequence):# in :
                    line = line.lstrip(start_sequence)

                if line.endswith(stop_sequence):
                    line =line.rstrip(stop_sequence)
                line = line.decode('utf-8')
                line = NEWLINE.sub(new_line, line)
                line = DOUBLE_NEWLINE.sub(double_new_line, line)
                return line
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if 'PayloadPart' not in chunk:
                print('Unknown event type:' + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk['PayloadPart']['Bytes'])

In [36]:
import json
import boto3

input_text = """写一篇500字的科幻小说，背景关于宇宙战争"""


smr_client = boto3.client("sagemaker-runtime")
response_model = smr_client.invoke_endpoint_with_response_stream(
            EndpointName=endpoint_name,
            Body=json.dumps(
            {
                "inputs": input_text,
                "parameters": parameters,
                "stream" : True
            }
            ),
            ContentType="application/json",
        )

def print_response_stream(response_stream):
    event_stream = response_stream.get('Body')
    for line in LineIterator(event_stream):
        print(line, end='')
        
print_response_stream(response_model)


The Cosmic War
In the year 2256, humanity had finally reached the stars, colonizing distant planets and moons. But with the expansion of space travel came the inevitable: conflict. The United Earth Government, formed to govern the newly colonized planets, was torn apart by internal strife and petty squabbles. The once-peaceful galaxy was now a battleground, with factions vying for control.

The first shots were fired when the Mars Colonies, tired of being treated as second-class citizens, declared independence from the United Earth Government. The Earth-based government, led by the ruthless and cunning President Zhang, responded with force, sending a fleet of warships to quell the rebellion.

But the Mars Colonies were not alone. The Jupiter Colonies, led by the enigmatic and brilliant Admiral Zhang Wei, had been secretly building a powerful fleet of their own. They saw the conflict as an opportunity to strike back against the Earth-based government, which they believed had neglected 

In [28]:
!aws sagemaker delete-endpoint --endpoint-name {endpoint_name}
!aws sagemaker delete-endpoint-config --endpoint-config-name {endpoint_config_name}
!aws sagemaker delete-model --model-name {model_name}