
### Serve large models on SageMaker with DeepSpeed Container

In this notebook, we explore how to host a large language model on SageMaker using the latest container launched using from DeepSpeed.

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.

SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting

In this notebook, we deploy the open source Bloom 176B quantized model across GPU's on a ml.p4d.24xlarge instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers.


In [None]:
# Instal boto3 library to create model and run inference workloads
%pip install -Uqq boto3 awscli

## Setup Docker Image
This section should be removed after DLC release

In [None]:
%%sh
docker pull deepjavalibrary/djl-serving:0.19.0-deepspeed

In [None]:
%%bash 
# The name of our algorithm
image_id=$(docker images | grep 0.19.0-deepspeed | tr -s " " | cut -d " " -f 3)
echo "image_id=${image_id}"


repo_name='djl-ds' # same as algorithim name
echo "repo_name=$repo_name"

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-east-1}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${repo_name}:latest"
echo "Full_name=$fullname"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${repo_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${repo_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.
image_id=$(docker images | grep 0.19.0-deepspeed | tr -s " " | cut -d " " -f 3)

echo "image_id=${image_id}"

#docker build -q -t ${algorithm_name} .

docker tag $image_id ${fullname}
#docker tag ${algorithm_name} ${fullname}

docker push ${fullname}



## Optional Section to Download Model from Hugging Face Hub

Use this section of you are interested in downloading the model directly from Huggingface hub and storing in your own S3 bucket. 

**However this notebook currently leverages the model stored in AWS public S3 location for ease of use. So you can skip this step**

The below step to download and then upload to S3 can take several minutes since the model size is extremely large

In [None]:
%pip install huggingface-hub -Uqq

In [1]:
from huggingface_hub import snapshot_download
from pathlib import Path

In [2]:
# - This will download the model into the ./model directory where ever the jupyter file is running
local_model_path = Path("./model_30b")
local_model_path.mkdir(exist_ok=True)

#commit_hash = "aa00a6626f6484a2eef68e06d1e089e4e32aa571" 

In [3]:
# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
from huggingface_hub import snapshot_download

snapshot_download(repo_id="facebook/opt-30b", ignore_patterns=["*.msgpack", "*.h5"], cache_dir=local_model_path)

Fetching 18 files:   0%|          | 0/18 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/11.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.0k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/651 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.79G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.87G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.87G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.87G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.87G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.87G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/822M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/62.8k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/221 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/81.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

'model_30b/models--facebook--opt-30b/snapshots/463007d7da4e87fe962909a027811a8c0b32ede8'

#### Upload to S3 using the awscli 

In [22]:
s3_model_prefix = "hf-large-model-djl-opt30b/model" # folder where model checkpoint will go
model_snapshot_path = list(local_model_path.glob("**/snapshots/*"))[0]

In [11]:
!aws s3 cp --recursive {model_snapshot_path} s3://{bucket}/{s3_model_prefix}

upload: model_30b/models--facebook--opt-30b/snapshots/463007d7da4e87fe962909a027811a8c0b32ede8/LICENSE.md to s3://sagemaker-us-east-1-622343165275/hf-large-model-djl-opt30b/model/LICENSE.md
upload: model_30b/models--facebook--opt-30b/snapshots/463007d7da4e87fe962909a027811a8c0b32ede8/.gitattributes to s3://sagemaker-us-east-1-622343165275/hf-large-model-djl-opt30b/model/.gitattributes
upload: model_30b/models--facebook--opt-30b/snapshots/463007d7da4e87fe962909a027811a8c0b32ede8/flax_model.msgpack.index.json to s3://sagemaker-us-east-1-622343165275/hf-large-model-djl-opt30b/model/flax_model.msgpack.index.json
upload: model_30b/models--facebook--opt-30b/snapshots/463007d7da4e87fe962909a027811a8c0b32ede8/README.md to s3://sagemaker-us-east-1-622343165275/hf-large-model-djl-opt30b/model/README.md
upload: model_30b/models--facebook--opt-30b/snapshots/463007d7da4e87fe962909a027811a8c0b32ede8/config.json to s3://sagemaker-us-east-1-622343165275/hf-large-model-djl-opt30b/model/config.json
uplo

In [15]:
print(f's3://{bucket}/{s3_model_prefix}')
!aws s3 ls s3://sagemaker-us-east-1-622343165275/hf-large-model-djl-opt30b/model/

s3://sagemaker-us-east-1-622343165275/hf-large-model-djl-opt30b/model
2022-10-28 06:41:53       1173 .gitattributes
2022-10-28 06:41:53      11117 LICENSE.md
2022-10-28 06:41:53      10014 README.md
2022-10-28 06:41:53        651 config.json
2022-10-28 06:41:53      68114 flax_model.msgpack.index.json
2022-10-28 06:41:53     456318 merges.txt
2022-10-28 06:41:53 9794466629 pytorch_model-00001-of-00007.bin
2022-10-28 06:41:53 9866534401 pytorch_model-00002-of-00007.bin
2022-10-28 06:41:53 9866534465 pytorch_model-00003-of-00007.bin
2022-10-28 06:41:54 9866534465 pytorch_model-00004-of-00007.bin
2022-10-28 06:41:53 9866534465 pytorch_model-00005-of-00007.bin
2022-10-28 06:47:46 9866534465 pytorch_model-00006-of-00007.bin
2022-10-28 06:47:54  822185815 pytorch_model-00007-of-00007.bin
2022-10-28 06:47:58      62801 pytorch_model.bin.index.json
2022-10-28 06:47:58        221 special_tokens_map.json
2022-10-28 06:47:59      81232 tf_model.h5.index.json
2022-10-28 06:47:59        685 tokeniz

## Create SageMaker compatible Model artifact and Upload Model to S3

SageMaker needs the model to be in a Tarball format. In this notebook we are going to create the model with the Inference code to shorten the end point creation time. In the Inference code we kick of a multi threaded approach to download the model weights into the container using awscli

The tarball is in the following format

```
code_opt30
├──── 
│   └── model.py
│   └── requirements.txt
│   └── serving.properties

```

The actual model is stored in S3 location and will be downloaded into the container directly when the endpoint is created. For that we will pass in 2 environment variables

1.  "MODEL_S3_BUCKET" : Specify the S3 Bucket where the model artifact is
2.  "MODEL_S3_PREFIX" : Specify the S3 prefix for where the model artifacts file are actually located

This will be used in the model.py file to read in the actual model artifacts. 

- `model.py` is the key file which will handle any requests for serving. It is also responsible for loading the model from S3
- `requirements.txt` has the awscli library needed to be installed when the container starts up.
- `serving.properties` is the script that will have environment variables which can be used to customize model.py at run time.


In [32]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

#### Create required variables and initialize them to create the endpoint, we leverage boto3 for this

In [33]:
role = sagemaker.get_execution_role()      # execution role for the endpoint
sess = sagemaker.session.Session()         # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()             # bucket to house artifacts
s3_code_prefix = "hf-large-model-djl-opt30b/code"       # folder within bucket where code artifact will go
s3_model_prefix = 'hf-large-model-djl-opt30b/model' # folder where model checkpoint will go

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

In [34]:
inference_image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/djl-ds:latest"
print(f"Image going to be used is ---- > {inference_image_uri}")
# 622343165275.dkr.ecr.us-east-1.amazonaws.com/djl-ds:latest

Image going to be used is ---- > 622343165275.dkr.ecr.us-east-1.amazonaws.com/djl-ds:latest


**Create the Tarball and then upload to S3 location**

In [35]:
!rm model.tar.gz
!tar czvf model.tar.gz code_opt30 

code_opt30/
code_opt30/serving.properties
code_opt30/model.py
code_opt30/.ipynb_checkpoints/
code_opt30/.ipynb_checkpoints/model-checkpoint.py
code_opt30/requirements.txt


In [36]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-622343165275/hf-large-model-djl-opt30b/code/model.tar.gz


In [37]:

print(f"S3 Model Prefix where the model files are -- > {s3_model_prefix}")
print(f"S3 Model Bucket is -- > {bucket}")

S3 Model Prefix where the model files are -- > hf-large-model-djl-opt30b/model
S3 Model Bucket is -- > sagemaker-us-east-1-622343165275


### This is optional in case you want to use VpcConfig to specify when creating the end points

For more details you can refer to this link https://docs.aws.amazon.com/sagemaker/latest/dg/host-vpc.html

The below is just an example to extract information about Security Groups and Subnets needed to configure

In [None]:
!aws ec2 describe-security-groups --filter Name=vpc-id,Values=<use vpcId> | python3 -c "import sys, json; print(json.load(sys.stdin)['SecurityGroups'])"


In [None]:
# - provide networking configs if needed. 
security_group_ids = [] # add the security group id's
subnets = [] # add the subnet id for this vpc
privateVpcConfig={
    'SecurityGroupIds': security_group_ids, 
    'Subnets': subnets
}
print(privateVpcConfig)


### To create the end point the steps are:

1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

    a) Instance Type is ml.p4d.24xlarge 
    
    b) ModelDataDownloadTimeoutInSeconds is 2400 which is needed to ensure the Model downloads from S3 successfully,
    
    c) ContainerStartupHealthCheckTimeoutInSeconds is 2400 to ensure health check starts after the model is ready
    
3. Create the end point using the endpoint config created    
    

In [38]:
from sagemaker.utils import name_from_base
model_name = name_from_base(f"opt30-djl-ds")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
        "Environment": {
            "MODEL_S3_BUCKET": bucket,
            "MODEL_S3_PREFIX": s3_model_prefix,
            "TENSOR_PARALLEL_DEGREE": "8"
        },
    },
    # Uncomment if providing networking configs
    #VpcConfig=privateVpcConfig
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

opt30-djl-ds-2022-10-28-19-41-23-882
Created Model: arn:aws:sagemaker:us-east-1:622343165275:model/opt30-djl-ds-2022-10-28-19-41-23-882


In [39]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.48xlarge", #"ml.p4d.24xlarge",
            "InitialInstanceCount": 1,
            # "VolumeSizeInGB" : 200
            'ModelDataDownloadTimeoutInSeconds': 2400,
            'ContainerStartupHealthCheckTimeoutInSeconds': 2400
        },
    ],
     
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:622343165275:endpoint-config/opt30-djl-ds-2022-10-28-19-41-23-882-config',
 'ResponseMetadata': {'RequestId': '66b53937-07d4-4ec1-b019-d8147b36bbd4',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '66b53937-07d4-4ec1-b019-d8147b36bbd4',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '124',
   'date': 'Fri, 28 Oct 2022 19:41:36 GMT'},
  'RetryAttempts': 0}}

In [40]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-east-1:622343165275:endpoint/opt30-djl-ds-2022-10-28-19-41-23-882-endpoint


#### Wait for the end point to be created. This can be take couple of minutes or longer. Please be patient
However while that happens, let us look at the critical areas of the helper files we are using to load the model
1. We will look at the code snippets for model.py to see the model downloading mechanism
2. Requirements.txt to see the required libraries to be loaded
3. Serving.properties to see the environment related properties

In [27]:
# This is the code snippet which is responsible to load the model from S3
! sed -n '26,34p' code/model.py

    if local_rank == 0: 
            
        if f"{model_dir}/DONE" not in glob(f"{model_dir}/*"):
            print("Starting Model downloading files")
            # download_files(s3_paths, model_dir)
            subprocess.run(["aws", "s3", "cp", "--recursive", f"s3://{bucket}/{key_prefix}", model_dir])
            print("Model downloading finished")
            
            # write file when download complete. Could use dist.barrier() but this makes it easier to check if model is downloaded in case of retry 


In [28]:
# This is the code snippet which loads the libraries into the container needed for run
! sed -n '1,3p' code/requirements.txt

boto3
awscli

In [29]:
# This is the code snippet which shows the environment variables being used to customize runtime
! sed -n '1,3p' code/serving.properties

engine=Rubikon

In [None]:
import time
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

while status=='Creating':
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Status: " + status)

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating


#### Leverage the Boto3 to invoke the endpoint. 

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results


In [43]:
%%time
smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"input": "Amazon.com is the best ", "gen_kwargs": {"min_length":5, "max_new_tokens": 100, "temperature": 0.8, "num_beams": 5, "no_repeat_ngram_size": 2} }),
    ContentType='application/json'
)["Body"].read().decode("utf8")

ValidationError: An error occurred (ValidationError) when calling the InvokeEndpoint operation: Endpoint opt30-djl-ds-2022-10-28-19-41-23-882-endpoint of account 622343165275 not found.

## Clean Up

In [None]:
# - Delete the end point 
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [None]:
# - In case the end point failed we still want to delete the model 
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

#### Optionally delete the model checkpoint from S3

In [None]:
!aws s3 rm --recursive s3://{bucket}/{s3_model_prefix}

In [None]:
s3_client = boto3.client("s3")

In [None]:
len(s3_client.list_objects(Bucket=bucket, Prefix=f"{s3_model_prefix}/")["Contents"])