
### Serve large models on SageMaker with DeepSpeed Container

In this notebook, we explore how to host a large language model on SageMaker using the latest container launched using from DeepSpeed.

Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.

Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.

SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting

In this notebook, we deploy the open source Bloom 176B quantized model across GPU's on a ml.p4d.24xlarge instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers.


In [68]:
# Instal boto3 library to create model and run inference workloads
%pip install -Uqq boto3 awscli

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
docker-compose 1.29.2 requires jsonschema<4,>=2.5.1, but you have jsonschema 4.16.0 which is incompatible.
docker-compose 1.29.2 requires websocket-client<1,>=0.32.0, but you have websocket-client 1.4.1 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


## Setup Docker Image
This section should be removed after DLC release

In [2]:
%%sh
docker pull deepjavalibrary/djl-serving:0.19.0-deepspeed

0.19.0-deepspeed: Pulling from deepjavalibrary/djl-serving
d5fd17ec1767: Pulling fs layer
602a45a9c0c5: Pulling fs layer
e1bae4c1f40f: Pulling fs layer
d9d586ab2510: Pulling fs layer
2b44adc78060: Pulling fs layer
cd4d84563a60: Pulling fs layer
e19a4e23074d: Pulling fs layer
a69bd65705b8: Pulling fs layer
7145f7b4815b: Pulling fs layer
e367c0f08642: Pulling fs layer
f98db42fa8f9: Pulling fs layer
889527aaa22f: Pulling fs layer
a061992ad5a9: Pulling fs layer
4557083dcd61: Pulling fs layer
c5b78129d513: Pulling fs layer
1971a3fe9e0d: Pulling fs layer
ff79d3f300b0: Pulling fs layer
601a5fb776e4: Pulling fs layer
2b44adc78060: Waiting
cd4d84563a60: Waiting
e19a4e23074d: Waiting
a69bd65705b8: Waiting
7145f7b4815b: Waiting
e367c0f08642: Waiting
f98db42fa8f9: Waiting
889527aaa22f: Waiting
a061992ad5a9: Waiting
4557083dcd61: Waiting
c5b78129d513: Waiting
1971a3fe9e0d: Waiting
ff79d3f300b0: Waiting
d9d586ab2510: Waiting
601a5fb776e4: Waiting
602a45a9c0c5: Verifying Checksum
602a45a9c0c5: Downlo

In [31]:
%%bash 
# The name of our algorithm
image_id=$(docker images | grep 0.19.0-deepspeed | tr -s " " | cut -d " " -f 3)
echo "image_id=${image_id}"


repo_name='djl-ds' # same as algorithim name
echo "repo_name=$repo_name"

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-east-1}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${repo_name}:latest"
echo "Full_name=$fullname"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${repo_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${repo_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.
image_id=$(docker images | grep 0.19.0-deepspeed | tr -s " " | cut -d " " -f 3)

echo "image_id=${image_id}"

#docker build -q -t ${algorithm_name} .

docker tag $image_id ${fullname}
#docker tag ${algorithm_name} ${fullname}

docker push ${fullname}



image_id=8344d1c51e98
repo_name=djl-ds
Full_name=622343165275.dkr.ecr.us-east-1.amazonaws.com/djl-ds:latest
Login Succeeded
image_id=8344d1c51e98
The push refers to repository [622343165275.dkr.ecr.us-east-1.amazonaws.com/djl-ds]
b67adf46a2f4: Preparing
5dd1da667968: Preparing
f4c0ee9d3ebe: Preparing
ac511274f948: Preparing
ba1a8dd44a3b: Preparing
36aca82a9594: Preparing
a8bb2bd009b8: Preparing
2830fca3c261: Preparing
1a5fac543081: Preparing
a8d0c4c62eef: Preparing
7ed9a71261c7: Preparing
a1eeba43cdbe: Preparing
6127942867a5: Preparing
e592fe6d10a9: Preparing
f42691182163: Preparing
68016c5bb65c: Preparing
8034550a3bbe: Preparing
bf8cedc62fb3: Preparing
36aca82a9594: Waiting
a8bb2bd009b8: Waiting
2830fca3c261: Waiting
1a5fac543081: Waiting
a8d0c4c62eef: Waiting
7ed9a71261c7: Waiting
a1eeba43cdbe: Waiting
6127942867a5: Waiting
e592fe6d10a9: Waiting
f42691182163: Waiting
68016c5bb65c: Waiting
8034550a3bbe: Waiting
bf8cedc62fb3: Waiting
ac511274f948: Pushed
ba1a8dd44a3b: Pushed
f4c0ee9d3e

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



## Optional Download Model from Hugging Face Hub

Use this section of you are interested in downloading the model directly from Huggingface hub and storing in your own S3 bucket. This notebook currently leverages the Micosoft model stored in AWS public S3 location for ease of use. This step to download and then upload to S3 can take several minutes since the model size is extremely large

In [4]:
%pip install huggingface-hub -Uqq

Note: you may need to restart the kernel to use updated packages.


In [5]:
from huggingface_hub import snapshot_download
from pathlib import Path

In [2]:
# - This will download the model into the ./model directory where ever the jupyter file is running
local_model_path = Path("./model")
local_model_path.mkdir(exist_ok=True)
model_name = "microsoft/bloom-deepspeed-inference-int8"
commit_hash = "aa00a6626f6484a2eef68e06d1e089e4e32aa571" 

In [None]:
# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
snapshot_download(repo_id=model_name, 
                  revision=commit_hash,
                  cache_dir=local_model_path)

#### Upload to S3 using the awscli 

In [None]:
s3_model_prefix = "hf-large-model-djl-ds/model" # folder where model checkpoint will go
model_snapshot_path = list(local_model_path.glob("**/snapshots/*"))[0]

In [None]:
!aws s3 cp --recursive {model_snapshot_path} s3://{bucket}/{s3_model_prefix}

## Create SageMaker compatible Model artifact and Upload Model to S3

SageMaker needs the model to be in a Tarball format. In this notebook we are going to create the model with the Inference code to shorten the end point creation time. In the Inference code we kick of a multi threaded approach to download the model weights into the container using awscli

In [6]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

In [7]:
role = sagemaker.get_execution_role()      # execution role for the endpoint
sess = sagemaker.session.Session()         # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()             # bucket to house artifacts
s3_code_prefix = "hf-large-model-djl-ds/code"    # folder within bucket where code artifact will go
s3_model_prefix = "hf-large-model-djl-ds/model" # folder where model checkpoint will go

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

In [32]:
inference_image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/djl-ds:latest"
inference_image_uri
# 622343165275.dkr.ecr.us-east-1.amazonaws.com/djl-ds:latest

'622343165275.dkr.ecr.us-east-1.amazonaws.com/djl-ds:latest'

In [33]:
!rm model.tar.gz
!tar czvf model.tar.gz code 

code/
code/serving.properties
code/model.py
code/.ipynb_checkpoints/
code/.ipynb_checkpoints/model-checkpoint.py
code/.ipynb_checkpoints/requirements-checkpoint.txt
code/.ipynb_checkpoints/serving-checkpoint.properties
code/requirements.txt


In [35]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)
s3_code_artifact

's3://sagemaker-us-east-1-622343165275/hf-large-model-djl-ds/code/model.tar.gz'

#### This works

In [36]:
s3_model_prefix = 'bloom-176B/raw_model_microsoft/'
print(s3_model_prefix)
print(bucket)

bloom-176B/raw_model_microsoft/
sagemaker-us-east-1-622343165275


#### Test the new downloads

In [71]:
s3_model_prefix = 'hf-large-model-djl-ds/model/'
print(s3_model_prefix)
print(bucket)

hf-large-model-djl-ds/model/
sagemaker-us-east-1-622343165275


In [37]:
!aws sagemaker describe-domain --domain-id d-vj9jud4p6ywy | python3 -c "import sys, json; print(json.load(sys.stdin)['SubnetIds'])"
!aws sagemaker describe-domain --domain-id d-vj9jud4p6ywy | python3 -c "import sys, json; print(json.load(sys.stdin)['VpcId'])"
!aws ec2 describe-security-groups --filter Name=vpc-id,Values=vpc-05edeb4f9b293161c | python3 -c "import sys, json; print(json.load(sys.stdin)['SecurityGroups'][0]['GroupId'])"


['subnet-0508539ff391bc62a', 'subnet-0f88fe2e674a870c4', 'subnet-02e4d3f4bd7ac9e66', 'subnet-0814f48bf38ffc0ae', 'subnet-076597677e5d1293b', 'subnet-09e3b111fe0bc7fa7']
vpc-05edeb4f9b293161c
sg-042c834d701c600a1


In [72]:
# - provide networking configs if needed. 
security_group_ids = [] # add the security group id's
subnets = [] # add the subnet id for this vpc
privateVpcConfig={
    'SecurityGroupIds': security_group_ids, 
    'Subnets': subnets
}
print(privateVpcConfig)


{'SecurityGroupIds': [], 'Subnets': []}


In [73]:
# - print the Inference image
print(inference_image_uri)

622343165275.dkr.ecr.us-east-1.amazonaws.com/djl-ds:latest


In [74]:
from sagemaker.utils import name_from_base
model_name = name_from_base(f"bloom-djl-ds")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
        "Environment": {
            "MODEL_S3_BUCKET": bucket,
            "MODEL_S3_PREFIX": s3_model_prefix,
            "TENSOR_PARALLEL_DEGREE": "8"
        },
    },
    # Uncomment if providing networking configs
    #VpcConfig=privateVpcConfig
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

bloom-djl-ds-2022-10-27-19-41-28-480
Created Model: arn:aws:sagemaker:us-east-1:622343165275:model/bloom-djl-ds-2022-10-27-19-41-28-480


In [75]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.p4d.24xlarge",
            "InitialInstanceCount": 1,
            # "VolumeSizeInGB" : 200
            'ModelDataDownloadTimeoutInSeconds': 2400,
            'ContainerStartupHealthCheckTimeoutInSeconds': 2400
        },
    ],
     
)
endpoint_config_response

In [76]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-east-1:622343165275:endpoint/bloom-djl-ds-2022-10-27-19-41-28-480-endpoint


In [77]:
import time
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

while status=='Creating':
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Status: " + status)

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:622343165275:endpoint/bloom-djl-ds-2022-10-27-19-41-28-480-endpoint
Status: InService


In [78]:
smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"input": "Amazon.com is the best ", "gen_kwargs": {"min_length":5, "max_new_tokens": 100, "temperature": 0.8, "num_beams": 5, "no_repeat_ngram_size": 2} }),
    ContentType='application/json'
)["Body"].read().decode("utf8")

'[\n  "Amazon.com is the best  online shopping site in the world. It has a wide range of products. You can buy anything you want from the site. The site is very easy to use and you can search for the product you are looking for. There are a lot of options to choose from and the prices are very reasonable. I have been using this site for a long time now and I am very happy with the service. They have a very good customer service and they are always ready to help you with any problem you have"\n]'

## Clean Up

In [None]:
# - Delete the end point 
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [65]:
# - In case the end point failed we still want to delete the model 
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

{'ResponseMetadata': {'RequestId': '0151fe17-2ae0-431d-98df-764c3f8f9151',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '0151fe17-2ae0-431d-98df-764c3f8f9151',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Thu, 27 Oct 2022 17:39:48 GMT'},
  'RetryAttempts': 0}}

In [None]:
# Optionally delete the model checkpoint from S3

In [None]:
!aws s3 rm --recursive s3://{bucket}/{s3_model_prefix}

In [None]:
s3_client = boto3.client("s3")

In [None]:
len(s3_client.list_objects(Bucket=bucket, Prefix=f"{s3_model_prefix}/")["Contents"])