
### Serve large models on SageMaker with DeepSpeed Container. In this notebook we show Bloom-176B model hosting

In this notebook, we explore how to host a large language model on SageMaker using the latest container launched using DeepSpeed and DJL. DJL provides for the serving framework while DeepSpeed is the key sharding library we leverage to enable hosting of large models. We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/).

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.

SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we deploy the open source Bloom 176B quantized model across GPU's on a ml.p4d.24xlarge instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. For further reading on DeepSpeed you can refer to https://arxiv.org/pdf/2207.00032.pdf 


## Licence agreement
View license information https://huggingface.co/spaces/bigscience/license for this model including the use-based restrictions in Section 5 before using the model. 


In [None]:
# Instal boto3 library to create model and run inference workloads
%pip install -Uqq boto3 awscli

## Optional Section to Download Model from Hugging Face Hub

Use this section of you are interested in downloading the model directly from Huggingface hub and storing in your own S3 bucket. Please change the variable "install_model_locally" to True in that case.

**However this notebook currently leverages the model stored in AWS public S3 location for ease of use. So you can skip this step**

The below step to download and then upload to S3 can take several minutes since the model size is extremely large

In [None]:
install_model_locally=False

In [None]:
if install_model_locally:
    %pip install huggingface-hub -Uqq 

In [None]:
if install_model_locally:
    
    from huggingface_hub import snapshot_download
    from pathlib import Path
    
    import sagemaker
    bucket = sagemaker.session.Session().default_bucket()
    
    # - This will download the model into the ./model directory where ever the jupyter file is running
    local_model_path = Path("./model")
    local_model_path.mkdir(exist_ok=True)
    model_name = "microsoft/bloom-deepspeed-inference-int8"
    commit_hash = "aa00a6626f6484a2eef68e06d1e089e4e32aa571"
    
    # - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
    snapshot_download(repo_id=model_name, revision=commit_hash, cache_dir=local_model_path)
    
    # - Upload to S3 using AWS CLI 
    s3_model_prefix = "hf-large-model-djl-ds/model"  # folder where model checkpoint will go
    model_snapshot_path = list(local_model_path.glob("**/snapshots/*"))[0]
    
    !aws s3 cp --recursive {model_snapshot_path} s3://{bucket}/{s3_model_prefix}

## Create SageMaker compatible Model artifact and Upload Model to S3

SageMaker needs the model to be in a Tarball format. In this notebook we are going to create the model with the Inference code to shorten the end point creation time. In the Inference code we kick of a multi threaded approach to download the model weights into the container using awscli

The tarball is in the following format

```
code
├──── 
│   └── model.py
│   └── requirements.txt
│   └── serving.properties

```

The actual model is stored in S3 location and will be downloaded into the container directly when the endpoint is created. For that we will pass in two environment variables

1.  "MODEL_S3_BUCKET" Specify the S3 Bucket where the model artifact is
2.  "MODEL_S3_PREFIX" Specify the S3 prefix for where the model artifacts file are actually located

This will be used in the model.py file to read in the actual model artifacts. 

- `model.py` is the key file which will handle any requests for serving. It is also responsible for loading the model from S3
- `requirements.txt` has the awscli library needed to be installed when the container starts up.
- `serving.properties` is the script that will have environment variables which can be used to customize model.py at run time.


In [1]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

#### Create required variables and initialize them to create the endpoint, we leverage boto3 for this

In [256]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = "sagemaker-sample-files"
s3_code_prefix = "hf-large-model-djl-ds/code"  # folder within bucket where code artifact will go
s3_model_prefix = "models/bloom-176B/raw_model_microsoft/" # "bloom-176B/raw_model_microsoft/"  # folder where model checkpoint will go
# --  s3://sagemaker-sample-files/models/bloom-176B/raw_model_microsoft/ -

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

In [257]:
print(region)

us-east-1


**Image URI of the DJL Container to be used**

In [258]:
#inference_image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/djl-ds:latest"
inference_image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.19.0-deepspeed0.7.3-cu113"
print(f"Image going to be used is ---- > {inference_image_uri}")


Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.19.0-deepspeed0.7.3-cu113


**Create the Tarball and then upload to S3 location**

In [259]:
!mkdir -p code_bloom176

In [260]:
%%writefile code_bloom176/model.py
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from djl_python import Input, Output
import os
import deepspeed
import torch
import torch.distributed as dist
import sys
import subprocess
import time
from glob import glob


tokenizer = None
model = None

def check_config():
    local_rank = os.getenv('LOCAL_RANK')
    curr_pid = os.getpid()
    print(f'__Number CUDA Devices:{torch.cuda.device_count()}:::local_rank={local_rank}::curr_pid={curr_pid}::')
    
    if not local_rank:
        return False

    return True

def get_model():

    if not check_config():
        raise Exception("DJL:DeepSpeed configurations are not default. This code does not support non default configurations") 
        
    deepspeed.init_distributed("nccl")
    
    tensor_parallel = int(os.getenv('TENSOR_PARALLEL_DEGREE', '1'))
    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    model_dir = "/tmp/model"
    bucket = os.environ.get("MODEL_S3_BUCKET")
    key_prefix = os.environ.get("MODEL_S3_PREFIX")
    curr_pid = os.getpid()
    print(f'tensor_parallel={tensor_parallel}::curr_pid={curr_pid}::')
    print(f"Current Rank: {local_rank}:: pid={curr_pid}::Going to load the model weights on rank 0: bucket={bucket}::key={key_prefix}::")
    
    if local_rank == 0: 
            
        if f"{model_dir}/DONE" not in glob(f"{model_dir}/*"):
            print(f"Starting Model downloading files pid={curr_pid}::")
            print(f"Starting Model pid={curr_pid}::")
            
            try:
                # -- 
                proc_run = subprocess.run(["aws", "s3", "cp", "--recursive", f"s3://{bucket}/{key_prefix}", model_dir], capture_output=True, text=True) # python 7 onwards
                print(f"Model download finished: pid={curr_pid}::")
                
                # write file when download complete. Could use dist.barrier() but this makes it easier to check if model is downloaded in case of retry 
                with open(f"{model_dir}/DONE", "w") as f:
                    f.write("download_complete")

                print(f"Model download checkmark written out pid={curr_pid}::return_code:{proc_run.returncode}:stderr:-- >:{proc_run.stderr}")
                proc_run.check_returncode() # to throw the error in case there was one
                
            except subprocess.CalledProcessError as e:
                print ( "Model download failed: Error:\nreturn code: ", e.returncode, "\nOutput: ", e.stderr )
                raise # FAIL FAST 
                
    dist.barrier() # - to ensure all processes load fine
        
    print(f"Load the Model  pid={curr_pid}::")
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    
    # has to be FP16 as Int8 model loading not yet supported
    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = AutoModelForCausalLM.from_config(
            AutoConfig.from_pretrained(model_dir), torch_dtype=torch.bfloat16
        )
        
    model = model.eval()
    
    model = deepspeed.init_inference(
        model,
        mp_size=tensor_parallel,
        dtype=torch.int8,
        base_dir = model_dir,
        checkpoint=os.path.join(model_dir, "ds_inference_config.json"),
        replace_method='auto',
        replace_with_kernel_inject=True
    )

    model = model.module
    dist.barrier()
    return model, tokenizer



def handle(inputs: Input):
    print("Model In handle")
    global model, tokenizer
    if not model:
        model, tokenizer = get_model()

    if inputs.is_empty():
        print("Model warm up: inputs were empty:called by Model server to warmup")
        # Model server makes an empty call to warmup the model on startup
        return None
    
    inputs = inputs.get_as_json()
    
    #print(inputs)
    data = inputs["input"]
    generate_kwargs = inputs.get("gen_kwargs", {})
    padding = bool(inputs.get("padding", 'True') )
    
    start = time.time() 
    input_tokens = tokenizer(data, return_tensors="pt", padding=padding)
    print(len(input_tokens))
    
    for t in input_tokens:
        if torch.is_tensor(input_tokens[t]):
            input_tokens[t] = input_tokens[t].to(torch.cuda.current_device())
    #print(f"Model:Tokenizer:ENCODE:time:{(time.time() - start) * 1000} ms")
    
    start = time.time()    
    with torch.no_grad():
        generate_kwargs.pop('padding', None)
        output = model.generate(**input_tokens, **generate_kwargs)
    #print(output)
    print(f"Model:Prediction:time:{(time.time() - start) * 1000} ms")
    
    start = time.time()
    output = tokenizer.batch_decode(output, skip_special_tokens=True)
    #print(f"Model:Tokenizer:DECODE:time:{(time.time() - start) * 1000} ms")
    
    torch.cuda.empy_cache() # to fre up the memory
    return Output().add_as_json(output)


Overwriting code_bloom176/model.py


In [261]:
gen_kwargs = {
                "min_length": 5,
                "max_new_tokens": 100,
                "temperature": 0.8,
                "num_beams": 5,
                "no_repeat_ngram_size": 2,
                "padding":'True',
                "padding" : True,
}

gen_kwargs.pop('padding', None)
    
print(bool('True'), bool('true'), type(bool('True')))
gen_kwargs

True True <class 'bool'>


{'min_length': 5,
 'max_new_tokens': 100,
 'temperature': 0.8,
 'num_beams': 5,
 'no_repeat_ngram_size': 2}

#### with Pipeline Batch size 

#### Serving.properties has engine parameter which tells the DJL model server to use the DeepSpeed engine to load the model

In [262]:
%%writefile code_bloom176/serving.properties
engine=DeepSpeed
batch_size=50
batchSize=50
BATCH_SIZE=50
max_batch_delay=10
maxBatchDelay=10
MAX_BATCH_DELAY=10

Overwriting code_bloom176/serving.properties


#### Requirements will tell the container to load these additional libraries into the container. We need these to download the model into the container

In [263]:
%%writefile code_bloom176/requirements.txt
boto3
awscli

Overwriting code_bloom176/requirements.txt


In [264]:
!rm model.tar.gz
!tar czvf model.tar.gz code_bloom176

code_bloom176/
code_bloom176/serving.properties
code_bloom176/model.py
code_bloom176/requirements.txt


In [265]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-622343165275/hf-large-model-djl-ds/code/model.tar.gz


In [266]:
print(f"S3 Model Prefix where the model files are -- > {s3_model_prefix}")
print(f"S3 Model Bucket is -- > {model_bucket}")

S3 Model Prefix where the model files are -- > models/bloom-176B/raw_model_microsoft/
S3 Model Bucket is -- > sagemaker-sample-files


### This is optional in case you want to use VpcConfig to specify when creating the end points

For more details you can refer to this link https://docs.aws.amazon.com/sagemaker/latest/dg/host-vpc.html

The below is just an example to extract information about Security Groups and Subnets needed to configure

In [221]:
!aws ec2 describe-security-groups --filter Name=vpc-id,Values=<use vpcId> | python3 -c "import sys, json; print(json.load(sys.stdin)['SecurityGroups'])"

/bin/bash: -c: line 0: syntax error near unexpected token `|'
/bin/bash: -c: line 0: `aws ec2 describe-security-groups --filter Name=vpc-id,Values=<use vpcId> | python3 -c "import sys, json; print(json.load(sys.stdin)['SecurityGroups'])"'


In [None]:
# - provide networking configs if needed.
security_group_ids = []  # add the security group id's
subnets = []  # add the subnet id for this vpc
privateVpcConfig = {"SecurityGroupIds": security_group_ids, "Subnets": subnets}
print(privateVpcConfig)

### To create the end point the steps are:

1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

    a) Instance Type is ml.p4d.24xlarge 
    
    b) ModelDataDownloadTimeoutInSeconds is 2400 which is needed to ensure the Model downloads from S3 successfully,
    
    c) ContainerStartupHealthCheckTimeoutInSeconds is 2400 to ensure health check starts after the model is ready
    
3. Create the end point using the endpoint config created    
    

One of the key parameters here is **TENSOR_PARALLEL_DEGREE** which essentially tells the DeepSpeed library to partition the models along 8 GPU's. This is a tunable and configurable parameter. For the purpose of this notebook we would like to leave these as **default settings**.

This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests. For further reading on DeepSpeedyou can follow the link https://www.deepspeed.ai/tutorials/inference-tutorial/#initializing-for-inference. 

In [267]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"bloom-djl-ds")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
        "Environment": {
            "MODEL_S3_BUCKET": model_bucket,
            "MODEL_S3_PREFIX": s3_model_prefix,
            "TENSOR_PARALLEL_DEGREE": "8",
        },
    },
    # Uncomment if providing networking configs
    # VpcConfig=privateVpcConfig
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

bloom-djl-ds-2022-12-05-17-11-54-151
Created Model: arn:aws:sagemaker:us-east-1:622343165275:model/bloom-djl-ds-2022-12-05-17-11-54-151


VolumnSizeInGB has been left as commented out. You should use this value for Instance types which support EBS volume mounts. The current instance we are using comes with a pre configured space and does not support additional volume mounts

In [268]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.p4d.24xlarge",
            "InitialInstanceCount": 1,
            #"VolumeSizeInGB" : 400,
            "ModelDataDownloadTimeoutInSeconds": 2400,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:622343165275:endpoint-config/bloom-djl-ds-2022-12-05-17-11-54-151-config',
 'ResponseMetadata': {'RequestId': '8e35a899-486c-4c6a-96d1-fe97b3b881bc',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '8e35a899-486c-4c6a-96d1-fe97b3b881bc',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '124',
   'date': 'Mon, 05 Dec 2022 17:11:58 GMT'},
  'RetryAttempts': 0}}

In [269]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-east-1:622343165275:endpoint/bloom-djl-ds-2022-12-05-17-11-54-151-endpoint


#### Wait for the end point to be created.This can be take couple of minutes or longer. Please be patient
However while that happens, let us look at the critical areas of the helper files we are using to load the model
1. We will look at the code snippets for model.py to see the model downloading mechanism
2. Requirements.txt to see the required libraries to be loaded
3. Serving.properties to see the environment related properties

In [47]:
# This is the code snippet which is responsible to load the model from S3
! sed -n '40,60p' code_bloom176/model.py

    print(f"Current Rank: {local_rank}:: pid={curr_pid}::Going to load the model weights on rank 0: bucket={bucket}::key={key_prefix}::")
    
    if local_rank == 0: 
            
        if f"{model_dir}/DONE" not in glob(f"{model_dir}/*"):
            print(f"Starting Model downloading files pid={curr_pid}::")
            print(f"Starting Model pid={curr_pid}::")
            
            try:
                # -- 
                proc_run = subprocess.run(["aws", "s3", "cp", "--recursive", f"s3://{bucket}/{key_prefix}", model_dir], capture_output=True, text=True) # python 7 onwards
                print(f"Model download finished: pid={curr_pid}::")
                
                # write file when download complete. Could use dist.barrier() but this makes it easier to check if model is downloaded in case of retry 
                with open(f"{model_dir}/DONE", "w") as f:
                    f.write("download_complete")

                print(f"Model download checkmark written out p

In [None]:
# This is the code snippet which loads the libraries into the container needed for run
! sed -n '1,3p' code_bloom176/requirements.txt

In [None]:
# This is the code snippet which shows the environment variables being used to customize runtime
! sed -n '1,3p' code_bloom176/serving.properties

### This step can take ~ 15 min or longer so please be patient

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating


#### Leverage the Boto3 to invoke the endpoint. 

This is a generative model so we pass in a Text (specified in the 'input' field in the json ) as a prompt and Model will complete the sentence and return the results. More details on these parameters can be found at https://huggingface.co/docs/api-inference/detailed_parameters#text-generation-task. Some quick explainations are below
1. temperature -- > The temperature of the sampling operation. 1 means regular sampling, 0 means always take the highest score and 100 means uniform probability
2. max_new_tokens -- > The amount of new tokens or text to be generated. More tokens will increase the prediction time
3. num_beams -- > Beam Search keeps track of the n-th most likely word sequences.


In [273]:
%%time
smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "input": "Amazon.com is the best ",
            "gen_kwargs": {
                "min_length": 5,
                "max_new_tokens": 100,
                "temperature": 0.8,
                "num_beams": 5,
                "no_repeat_ngram_size": 2,
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

CPU times: user 21.1 ms, sys: 0 ns, total: 21.1 ms
Wall time: 11.7 s


'[\n  "Amazon.com is the best  online shopping site in the world. It has a wide range of products. You can buy anything you want from the site. The site is very easy to use and you can search for the product you are looking for. There are a lot of options to choose from and the prices are very reasonable. I have been using this site for a long time now and I am very happy with the service. They have a very good customer service and they are always ready to help you with any problem you have"\n]'

#### Batch tokens

In [None]:
%%time
smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "input": ["Amazon.com is the best ", "DJL is the best serving model", "deepspeed works best",],
            "gen_kwargs": {
                "min_length": 5,
                "max_new_tokens": 100,
                "temperature": 0.8,
                "num_beams": 5,
                "no_repeat_ngram_size": 2,
                "padding":'True',
                #"padding" : True,
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

#### Time test
max_new_tokens is the key for inference time since this is a text generation model

max_new_tokens -- 100 leads to ~ 11 sec 
max_new_tokens -- 50 leads to ~ 5 secs

max_new_tokens -- 10 leads to ~ 1 sec  -- so fairly linear response time

now with batch size of 100

In [None]:
from datetime import timedelta
import time
from timeit import default_timer as timer
import numpy as np
results = []
for i in range(0, 10):
    start = time.time()
    smr_client.invoke_endpoint(
        EndpointName=endpoint_name,
        Body=json.dumps(
            {
                "input": ["Amazon.com is the best ", "DJL is the best serving model", "deepspeed works best",], #"Amazon.com is the best ",
                "gen_kwargs": {
                    "min_length": 5,
                    "max_new_tokens": 50, # 100
                    "temperature": 0.8, # 10 # --  0 
                    "num_beams": 5,
                    "no_repeat_ngram_size": 2,
                    'padding': 'True',
                },
            }
        ),
        ContentType="application/json",
    )["Body"].read().decode("utf8")
    results.append((time.time() - start) * 1000)
    
print("\nPredictions for model latency: \n")
print("\nP95: " + str(np.percentile(results, 95)) + " ms\n")
print("P90: " + str(np.percentile(results, 90)) + " ms\n")
print("Average: " + str(np.average(results)) + " ms\n")

In [132]:
import numpy as np
print("\nPredictions for model latency: \n")
print("\nP95: " + str(np.percentile(results, 95)) + " ms\n")
print("P90: " + str(np.percentile(results, 90)) + " ms\n")
print("Average: " + str(np.average(results)) + " ms\n")


Predictions for model latency: 


P95: 6168.650722503662 ms

P90: 6163.633060455322 ms

Average: 6126.430702209473 ms



#### Batch Tests max tokens is 50

1. Run with 10 batch size -- run 5, 10, 15, 20, 25 input prompts -- Store in file 10_batch values like 5, p95 in ms for 10 runs
2. Run with 20 batch size
3. Run with 30 batch size
4. Run with 40 batch size

The total number of tokens produced during 1 invocation will be as follows

1. Max_tokens x no of inputs (prompt_size) -- gives total number of tokens
2. Divide by the Total time in seconds or p95
3. This gives us the Throughput - or WALL Time for total number of tokens
4. For tokens per second -- divide by the BATCH size ?

In [None]:
input_prompts=[
    "Amazon.com is the best ", 
    "DJL is the best serving model", 
    "deepspeed works best for Large models",
    "Large models in machine learning",
    "performance bench mark tests for machine learning"
]*40
print(len(input_prompts))
input_prompts[0]

In [None]:
!mkdir -p temp-data/llm-perf

In [163]:
for prompt_size in range(1,25,5):
    print(prompt_size)

1
6
11
16
21


In [275]:
input_prompts[:prompt_size]

['Amazon.com is the best ',
 'DJL is the best serving model',
 'deepspeed works best for Large models',
 'Large models in machine learning',
 'performance bench mark tests for machine learning',
 'Amazon.com is the best ',
 'DJL is the best serving model',
 'deepspeed works best for Large models',
 'Large models in machine learning',
 'performance bench mark tests for machine learning',
 'Amazon.com is the best ',
 'DJL is the best serving model',
 'deepspeed works best for Large models',
 'Large models in machine learning',
 'performance bench mark tests for machine learning',
 'Amazon.com is the best ',
 'DJL is the best serving model',
 'deepspeed works best for Large models',
 'Large models in machine learning',
 'performance bench mark tests for machine learning']

In [None]:
%%time

prompt_size = 10
batch_size = 50
max_new_tokens=50

smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "input": input_prompts[:prompt_size],
            "gen_kwargs": {
                "min_length": 5,
                "max_new_tokens": 50,
                "temperature": 0.8,
                "num_beams": 5,
                "no_repeat_ngram_size": 2,
                "padding":'True',
                #"padding" : True,
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

#### Test the prompt for question answer

In [None]:
%%time


prompt =
"""
Please parse the product into words by white space. First word should be the main concept and the main concept should be as short as possible. main concept should be consistent with the category. category: product: smoked turkey breast output: turkey breast; smoked. Explanation: turkey is the main concept. smoked is a way of cooking category: food->pantry->pasta->spaghetti pasta product: whole wheat thin spaghetti box output: spaghetti; whole wheat, thin, box. Explanation: spaghetti is the main concept. whole wheat is a nutrition fact. thin is a shape. box is a packaging method. category: food->fresh produce->fresh vegetables->root vegetables->potatoes->sweet potatoes product: sweet potatoes output: sweet potatoes;. Explanation: sweet potatoes is the main concept. There is no attribute. category: food->frozen food->frozen desserts->ice creams->ice creams product: premium mint chocolate chip frozen dessert output: ice cream; premium, mint, chocolate chip, frozen, dessert. Explanation: ice cream is the main concept. ice cream is not in the product name but is implied by its category. premium is a quality. mint is a flavor. chocolate chip is a flavor. frozen is a state. category: home & garden->home->bed product: 15" cotton voile california king bed skirt in ivory output: bed skirt; 15", cotton, voile, california, king, ivory. Explanation: bed skirt is the main concept. 15" is a size. cotton is a material. king is a size. ivory is a color. category: pet->pet food product: organic avocado output: avocado; organic. 
Explanation: avocado is the main concept. avocado is not a pet food. the category information is incorrect. organic is a quality. category: {c} product: {p} output:
"""
input_prompt = prompt.format(c="home improvement->bathroom->bathroom hardware->towel bar", p="stainless steel towel bar")


smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "input": input_prompt,
            "gen_kwargs": {
                "min_length": 5,
                "max_new_tokens": 50,
                "temperature": 0, #0.8,
                "num_beams": 5,
                "no_repeat_ngram_size": 2,
                "padding":'True',
                'do_sample': False,
                #"padding" : True,
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

In [233]:
%%time

prompt_size = 5
batch_size = 10
max_new_tokens=50

from datetime import timedelta
import time
from timeit import default_timer as timer
import numpy as np

prompt_size_result = []

for prompt_size in range(1,42, 5): # is the index in the list of prompts
    results = [0]
    error_count = 0
    total_runs = 0
    for i in range(0, 3):
        start = time.time()
        try:
            total_runs = total_runs+1
            smr_client.invoke_endpoint(
                EndpointName=endpoint_name,
                Body=json.dumps(
                    {
                        "input": input_prompts[:prompt_size],
                        "gen_kwargs": {
                            "min_length": 5,
                            "max_new_tokens": 50, # 100
                            "temperature": 0.8, # 10 # --  0 
                            "num_beams": 5,
                            "no_repeat_ngram_size": 2,
                            'padding': 'True',
                        },
                    }
                ),
                ContentType="application/json",
            )["Body"].read().decode("utf8")
            results.append((time.time() - start) * 1000)
            time.sleep(2)
        except:
            error_count = error_count+1
            
    p_95_ms = str(np.percentile(results, 95)) + " ms"   
    total_tokens = prompt_size * max_new_tokens
    p_95_response_ms = np.percentile(results, 95)
    if p_95_response_ms <= 0:
        p_95_response_ms = 1
    tokens_per_sec = total_tokens * 1000 / p_95_response_ms # -- since this is in ms response
    
    prompt_size_result.append(f"Total_invocation={total_runs}:NoOfInputs={prompt_size}:P-95={p_95_ms}:total_tokens={total_tokens}:tokens_per_sec={tokens_per_sec}:error_count={error_count}:\n")

with open(f"./temp-data/llm-perf/{batch_size}_batch.txt","w+") as f:
        f.writelines(prompt_size_result)


CPU times: user 40.7 ms, sys: 0 ns, total: 40.7 ms
Wall time: 437 ms


## Conclusion
In this post, we demonstrated how to use SageMaker large model inference containers to host two large language models, BLOOM-176B and OPT-30B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker machine learning instance. For more details about Amazon SageMaker and its large model inference capabilities, refer to the following:

* Amazon SageMaker now supports deploying large models through configurable volume size and timeout quotas (https://aws.amazon.com/about-aws/whats-new/2022/09/amazon-sagemaker-deploying-large-models-volume-size-timeout-quotas/)
* Real-time inference – Amazon SageMake (https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html)


## Clean Up

In [277]:
# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': 'b2f89942-1183-4615-bce3-d0ffea0608c1',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'b2f89942-1183-4615-bce3-d0ffea0608c1',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Tue, 06 Dec 2022 03:38:23 GMT'},
  'RetryAttempts': 0}}

In [278]:
# - In case the end point failed we still want to delete the model
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

{'ResponseMetadata': {'RequestId': 'd46ac0b7-42c2-4908-8c4a-edcef18d663e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'd46ac0b7-42c2-4908-8c4a-edcef18d663e',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Tue, 06 Dec 2022 03:38:24 GMT'},
  'RetryAttempts': 0}}

#### Optionally delete the model checkpoint from S3

In [None]:
!aws s3 rm --recursive s3://{bucket}/{s3_model_prefix}

In [None]:
s3_client = boto3.client("s3")

In [None]:
len(s3_client.list_objects(Bucket=bucket, Prefix=f"{s3_model_prefix}/")["Contents"])