
### Serve large models on SageMaker with DJL DeepSpeed Container

In this notebook, we explore how to host a large language model on SageMaker using the latest container launched using from DeepSpeed and DJL. DJL provides for the serving framework while DeepSpeed is the key sharding library we leverage to enable hosting of large models.We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/).

Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.

SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we deploy the open source llama 7B model across GPU's on a ml.g5.48xlarge instance. Note that the llama 7B fp16 model can be deployed on single GPU such as g5.2xlarge (24GB VRAM), we jsut show the code which can deploy the llm accross multiple GPUs in SageMaker. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. For further reading on DeepSpeed you can refer to https://arxiv.org/pdf/2207.00032.pdf 


## Create SageMaker compatible Model artifact and Upload Model to S3

SageMaker needs the model to be in a Tarball format. In this notebook we are going to create the model with the Inference code to shorten the end point creation time. 

The tarball is in the following format

```
code
├──── 
│   └── model.py
│   └── requirements.txt
│   └── serving.properties

```


- `model.py` is the key file which will handle any requests for serving. 
- `requirements.txt` has the required libraries needed to be installed when the container starts up.
- `serving.properties` is the script that will have environment variables which can be used to customize model.py at run time.


#### Serving.properties has engine parameter which tells the DJL model server to use the DeepSpeed engine to load the model.

option.tensor_parallel_degree:  now we use the g5.48xlarge which has 8 GPUs, so we set the tensor_parallel_degree to 8.

option.s3url:  you should use your model path here. And the s3 path must be ended with "/".

batch_size:   it is for server side batch based on request level. You can set batch_size to the large value which can not result in the OOM. The current code about model.py is just demo for one prompt per client request.

max_batch_delay:   it is counted by millisecond. 

In [1]:
!rm -rf src
!mkdir src

In [22]:
%%writefile ./src/serving.properties
engine=Python
option.tensor_parallel_degree=4
#option.model_id=huggyllama/llama-7b
#option.model_id=huggyllama/llama-13b
option.model_id=THUDM/chatglm2-6b

Overwriting ./src/serving.properties


In [23]:
%%writefile ./src/requirements.txt
vllm==0.1.1

Overwriting ./src/requirements.txt


In [24]:
%%writefile ./src/model.py
from vllm import LLM, SamplingParams
from djl_python import Input, Output
import os

os.environ['NCCL_P2P_DISABLE'] = '1'

predictor = None

def get_model(properties):
    model_name = properties["model_id"]
    tensor_parallel_degree = int(properties["tensor_parallel_degree"])
    llm = LLM(model=model_name, tensor_parallel_size=tensor_parallel_degree)
    return llm


def handle(inputs: Input) -> None:
    global predictor
    if not predictor:
        predictor = get_model(inputs.get_properties())

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_json()
    params = data.get("params",{})
    sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=params["max_tokens"])
    result = predictor.generate(data["inputs"], sampling_params)
    result_json = []
    for output in result:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        result_json.append(generated_text)
    return Output().add(result_json)

Overwriting ./src/model.py


#### Create required variables and initialize them to create the endpoint, we leverage boto3 for this

In [26]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

sage_session = sagemaker.Session()
model_bucket = sage_session.default_bucket()  # bucket to house artifacts
s3_code_prefix = (
    #"hf-large-model-llama-7b-0625/code"  # folder within bucket where code artifact will go
    "chatglm2-vllm/code"
)

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

**Image URI for the DJL container is being used here**

In [27]:
#Note that: you can modify the image url according to your specific region.
#inference_image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.0-cu117"
#print(f"Image going to be used is ---- > {inference_image_uri}")

inference_image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118" 
#inference_image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.0-cu117"
print(f"Image going to be used is ---- > {inference_image_uri}")

Image going to be used is ---- > 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118


**Create the Tarball and then upload to S3 location**

In [28]:
!rm model.tar.gz
!tar czvf model.tar.gz src

src/
src/model.py
src/serving.properties
src/requirements.txt


In [29]:
s3_code_artifact = sage_session.upload_data("model.tar.gz", model_bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-687912291502/chatglm2-vllm/code/model.tar.gz


In [30]:
print(f"S3 Model Bucket is -- > {model_bucket}")

S3 Model Bucket is -- > sagemaker-us-west-2-687912291502


### To create the end point the steps are:

1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

    a) Instance Type is ml.g5.48xlarge 
    
    b) ContainerStartupHealthCheckTimeoutInSeconds is 15*60 to ensure health check starts after the model is ready
    
3. Create the end point using the endpoint config created    
    

One of the key parameters here is **TENSOR_PARALLEL_DEGREE** which essentially tells the DeepSpeed library to partition the models along 8 GPU's. This is a tunable and configurable parameter.

This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests. For further reading on DeepSpeedyou can follow the link https://www.deepspeed.ai/tutorials/inference-tutorial/#initializing-for-inference. 

In [31]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"llama-7b-finetuned")
print(model_name)

role = sagemaker.get_execution_role()

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
    },
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

llama-7b-finetuned-2023-08-22-12-30-50-678
Created Model: arn:aws:sagemaker:us-west-2:687912291502:model/llama-7b-finetuned-2023-08-22-12-30-50-678


VolumnSizeInGB has been left as commented out. You should use this value for Instance types which support EBS volume mounts. The current instance we are using comes with a pre configured space and does not support additional volume mounts

In [32]:
endpoint_config_name = f"{model_name}-config-072614"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.24xlarge",
            "InitialInstanceCount": 1,
            #"VolumeSizeInGB" : 300,
            "ModelDataDownloadTimeoutInSeconds": 15*60,
            "ContainerStartupHealthCheckTimeoutInSeconds": 15*60,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-west-2:687912291502:endpoint-config/llama-7b-finetuned-2023-08-22-12-30-50-678-config-072614',
 'ResponseMetadata': {'RequestId': '7650d0e4-f02b-4e49-86d5-3f2398bc51cd',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '7650d0e4-f02b-4e49-86d5-3f2398bc51cd',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '137',
   'date': 'Tue, 22 Aug 2023 12:30:54 GMT'},
  'RetryAttempts': 0}}

In [33]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-west-2:687912291502:endpoint/llama-7b-finetuned-2023-08-22-12-30-50-678-endpoint


#### Wait for the end point to be created.

### This step can take ~ 15 min or longer so please be patient

In [34]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Failed
Arn: arn:aws:sagemaker:us-west-2:687912291502:endpoint/llama-7b-finetuned-2023-08-22-12-30-50-678-endpoint
Status: Failed


#### Leverage the Boto3 to invoke the endpoint. 

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results


In [21]:
%%time
import json
import boto3

smr_client = boto3.client("sagemaker-runtime")

prompt1 = "The house is wonderful. I"
prompt2="##Eva:How often do you travel?## Malcolm:I like David Bowie too. I don’t travel much any more, but I used to.## Eva:That's cool! I recently took a road trip with my friend. We had so much fun and it opened up so many possibilities for us. What kind of places did you like to explore?## Malcolm:I love history and culture, so those are my favorite.## Eva: He was born in Birmingham, England and raised in Los Angeles, California.Eva: Yes, Sir. Queen is one of the most influential bands of all time.## Malcolm:It is. They are one of my favorite rock groups. What about you?## Eva:I'm more into classic rock, especially David Bowie. Who is your favorite artist?## Malcolm:Marylin Manson. You?## Eva:My favorite artist is David Bowie.## Eva:How often do you travel?## Malcolm:I like David Bowie too. I don’t travel much any more, but I used to.## Eva:That's cool! I recently took a road trip with my friend. We had so much fun and it opened up so many possibilities for us. What kind of places did you like to explore?## Malcolm:I love history and culture, so those are my favorite.## Eva: He was born in Birmingham, England and raised in Los Angeles, California.##Eva: Yes, Sir. Queen is one of the most influential bands of all time.## Malcolm:It is. They are one of my favorite rock groups. What about you?## Eva:I'm more into classic rock, especially David Bowie. Who is your favorite artist?## Malcolm:Marylin Manson. You?## Eva:My favorite artist is David Bowie.## Eva:How often do you travel?## Malcolm:I like David Bowie too. I don’t travel much any more, but I used to.## Eva:That's cool! I recently took a road trip with my friend. We had so much fun and it opened up so many possibilities for us. What kind of places did you like to explore?## Malcolm:I love history and culture, so those are my favorite.## Eva: He was born in Birmingham, England and raised in Los Angeles, California.##Eva: Yes, Sir. Queen is one of the most influential bands of all time.## Malcolm:It is. They are one of my favorite rock groups. What about you?## Eva:I'm more into classic rock, especially David Bowie. Who is your favorite artist?## Malcolm:Marylin Manson. You?## Eva:My favorite artist is David Bowie.## Eva:How often do you travel?## Malcolm:I like David Bowie too. I don’t travel much any more, but I used to.## Eva:That's cool! I recently took a road trip with my friend. We had so much fun and it opened up so many possibilities for us. What kind of places did you like to explore?## Malcolm:I love history and culture, so those are my favorite.## Eva: He was born in Birmingham, England and raised in Los Angeles, California.#### Malcolm:Oh. What are you wearing right now, pet?## Eva:"

parameters = {
  "early_stopping": True,
  "max_tokens": 170,
  "min_new_tokens": 128,
  "do_sample": True,
  "temperature": 1.0,
}

response_model = smr_client.invoke_endpoint(
            EndpointName=endpoint_name,
            Body=json.dumps(
            {
                #"input": prompt1,
                #"inputs": prompt2,
                #"inputs": [prompt2, prompt2],
                #"inputs": [prompt2,prompt2, prompt2,prompt2],
                #"input": [prompt1,prompt1, prompt1,prompt1, prompt1,prompt1, prompt1,prompt1],
                #"inputs": [prompt2,prompt2, prompt2,prompt2, prompt2,prompt2, prompt2,prompt2],
                "inputs": [prompt2,prompt2, prompt2,prompt2, prompt2,prompt2, prompt2,prompt2,prompt2,prompt2, prompt2,prompt2, prompt2,prompt2, prompt2,prompt2 ],
                #"input": [prompt1, prompt2],
                #"input": [prompt1, prompt2, prompt1, prompt2, prompt1, prompt2,prompt1, prompt2,],
                "params": parameters
            }
            ),
            ContentType="application/json",
        )

response_model['Body'].read().decode('utf8')

CPU times: user 12.7 ms, sys: 0 ns, total: 12.7 ms
Wall time: 12 s


'[\n  "I’m wearing a red bra and a black thong.## Malcolm:Why don’t you show me?## Eva:It’s kind of cold outside!## Malcolm:Yes, it is. But you still look gorgeous.## Eva:Thank you, Sir!## Malcolm:I’m going to take my coat off.## Eva:Sounds like a good idea.## Malcolm:Are you wearing a bra?## Eva:Yes, Sir.## Malcolm:You can take it off for me.## Eva:I’m going to start by unbuttoning my top.## Malcolm:I like it.## Eva:I’m going to unbutton my top now.## Malcolm:Thank you, pet.",\n  "A blue and white dress. I’m a little chilly.## Malcolm:I’m sorry. I’ll turn the heat up.## Eva:Thank you, Sir.## Malcolm:What are you doing this evening?## Eva:I have to finish up my homework.## Malcolm:That’s too bad. You have too much school work?## Eva: Yes, I’m a graduate student.## Malcolm:What are you studying?## Eva:I’m studying philosophy.## Malcolm:That is interesting. I never thought of you as a philosophy major. What do you think about the world?## Eva:I think the world needs more love.## Malcolm:

In [37]:
parameters = {
  "early_stopping": True,
  "max_new_tokens": 128,
  "min_new_tokens": 128,
  "do_sample": True,
  "temperature": 1.0,
}

In [38]:
max_t = parameters["max_new_tokens"]
max_t

128