# An sample to deploy WizardCoder-15B on SageMaker

In [1]:
## Update sagemaker python sdk version
!pip install -U sagemaker

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


In [12]:
import boto3
import sagemaker
from sagemaker import get_execution_role


sess                     = sagemaker.Session()
role                     = get_execution_role()
sagemaker_default_bucket = sess.default_bucket()

account                  = sess.boto_session.client("sts").get_caller_identity()["Account"]
region                   = sess.boto_session.region_name

## Download pretrained model from HuggingFace Hub

To avoid download model from Huggingface hub failure, we download first and push those model files to S3 bucket first.

In [6]:
!pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


In [21]:
from huggingface_hub import snapshot_download
from pathlib import Path


local_cache_path = Path("./model")
local_cache_path.mkdir(exist_ok=True)

model_name = "WizardLM/WizardCoder-15B-V1.0"

# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.model", "*.py"]

model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_cache_path,
    allow_patterns=allow_patterns,
    revision='926ca1b215c4631bc5f8c3e47173381452c23e5c'
)

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading pytorch_model.bin:   0%|          | 0.00/31.0G [00:00<?, ?B/s]

Downloading (…)5c/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)23e5c/tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

Downloading (…)452c23e5c/vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

Downloading (…)52c23e5c/config.json:   0%|          | 0.00/997 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/717 [00:00<?, ?B/s]

**Upload model files to S3**

In [22]:
# Get the model files path
import os
from glob import glob

local_model_path = None

paths = os.walk(r'./model')
for root, dirs, files in paths:
    for file in files:
        if file == 'config.json':
            print(os.path.join(root,file))
            local_model_path = str(os.path.join(root,file))[0:-11]
            print(local_model_path)
if local_model_path == None:
    print("Model download may failed, please check prior step!")

./model/models--WizardLM--WizardCoder-15B-V1.0/snapshots/926ca1b215c4631bc5f8c3e47173381452c23e5c/config.json
./model/models--WizardLM--WizardCoder-15B-V1.0/snapshots/926ca1b215c4631bc5f8c3e47173381452c23e5c/


In [9]:
%%script env sagemaker_default_bucket=$sagemaker_default_bucket local_model_path=$local_model_path bash

chmod +x ./s5cmd
./s5cmd sync ${local_model_path} s3://${sagemaker_default_bucket}/llm/models/wizardcoder/WizardLM/WizardLM-15B/

rm -rf model

cp model/models--WizardLM--WizardCoder-15B-V1.0/snapshots/926ca1b215c4631bc5f8c3e47173381452c23e5c/added_tokens.json s3://sagemaker-us-west-2-928808346782/llm/models/wizardcoder/WizardLM/WizardLM-15B/added_tokens.json
cp model/models--WizardLM--WizardCoder-15B-V1.0/snapshots/926ca1b215c4631bc5f8c3e47173381452c23e5c/generation_config.json s3://sagemaker-us-west-2-928808346782/llm/models/wizardcoder/WizardLM/WizardLM-15B/generation_config.json
cp model/models--WizardLM--WizardCoder-15B-V1.0/snapshots/926ca1b215c4631bc5f8c3e47173381452c23e5c/tokenizer_config.json s3://sagemaker-us-west-2-928808346782/llm/models/wizardcoder/WizardLM/WizardLM-15B/tokenizer_config.json
cp model/models--WizardLM--WizardCoder-15B-V1.0/snapshots/926ca1b215c4631bc5f8c3e47173381452c23e5c/config.json s3://sagemaker-us-west-2-928808346782/llm/models/wizardcoder/WizardLM/WizardLM-15B/config.json
cp model/models--WizardLM--WizardCoder-15B-V1.0/snapshots/926ca1b215c4631bc5f8c3e47173381452c23e5c/special_tokens_map.json


### Serve large models on SageMaker with DJL DeepSpeed Container

In this notebook, we explore how to host a large language model on SageMaker using the latest container launched using from DeepSpeed and DJL. DJL provides for the serving framework while DeepSpeed is the key sharding library we leverage to enable hosting of large models.We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/).

Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.

SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we deploy the open source llama 7B model across GPU's on a ml.g5.48xlarge instance. Note that the llama 7B fp16 model can be deployed on single GPU such as g5.2xlarge (24GB VRAM), we jsut show the code which can deploy the llm accross multiple GPUs in SageMaker. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. For further reading on DeepSpeed you can refer to https://arxiv.org/pdf/2207.00032.pdf 


## Create SageMaker compatible Model artifact and Upload Model to S3

SageMaker needs the model to be in a Tarball format. In this notebook we are going to create the model with the Inference code to shorten the end point creation time. 

The tarball is in the following format

```
code
├──── 
│   └── model.py
│   └── requirements.txt
│   └── serving.properties

```


- `model.py` is the key file which will handle any requests for serving. 
- `requirements.txt` has the required libraries needed to be installed when the container starts up.
- `serving.properties` is the script that will have environment variables which can be used to customize model.py at run time.


#### Serving.properties has engine parameter which tells the DJL model server to use the DeepSpeed engine to load the model.

option.tensor_parallel_degree:  if we use the g5.48xlarge which has 8 GPUs, so we set the tensor_parallel_degree to 8.

option.s3url:  you should use your model path here. And the s3 path must be ended with "/".

batch_size:   it is for server side batch based on request level. You can set batch_size to the large value which can not result in the OOM. The current code about model.py is just demo for one prompt per client request.

max_batch_delay:   it is counted by millisecond. 

In [10]:
!rm -rf src
!mkdir src

In [5]:
%%writefile ./src/serving.properties
engine=DeepSpeed
option.tensor_parallel_degree=4
option.s3url=s3://sagemaker-us-west-2-928808346782/llm/models/wizardcoder/WizardLM/WizardLM-15B/
batch_size=1
max_batch_delay=50

Overwriting ./src/serving.properties


In [17]:
!pip install transformers==4.30.1
!pip install deepspeed==0.10.0

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting deepspeed==0.10.0
  Downloading deepspeed-0.10.0.tar.gz (836 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m836.6/836.6 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting hjson (from deepspeed==0.10.0)
  Downloading hjson-3.1.0-py3-none-any.whl (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.0/54.0 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ninja (from deepspeed==0.10.0)
  Downloading ninja-1.11.1-py2.py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (145 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m146.0/146.0 kB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
Collecting py-cpuinfo (from deepspeed==0.10.0)
  Downloading py_cpuinfo-9.0.0-p

In [2]:
import os
import logging
import torch
import deepspeed
import transformers
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, AutoModel, LlamaForCausalLM


[2023-07-31 04:18:17,620] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)

Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.8/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)


In [3]:
base_model="./model/models--WizardLM--WizardCoder-15B-V1.0/snapshots/926ca1b215c4631bc5f8c3e47173381452c23e5c/"
# base_mode="WizardLM/WizardCoder-15B-V1.0"

In [4]:
tokenizer = AutoTokenizer.from_pretrained(base_model)

In [5]:
load_8bit=True

In [None]:
model = AutoModelForCausalLM.from_pretrained(
            base_model,
            load_in_8bit=load_8bit,
            torch_dtype=torch.float16,
            device_map="auto",
        )

In [None]:
model

In [6]:
%%writefile ./src/requirements.txt
transformers==4.30.1
deepspeed==0.10.0
sagemaker
nvgpu

Overwriting ./src/requirements.txt


In [7]:
%%writefile ./src/model.py
from djl_python import Input, Output
import os
import logging
import torch
import deepspeed
import transformers
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, AutoModel, LlamaForCausalLM
from transformers.models.llama.tokenization_llama import LlamaTokenizer


predictor = None
#here, we need to set the global variable batch_size according to the batch_size in the serving.properties file.
batch_size = 1


def load_model(properties):
    tensor_parallel = properties["tensor_parallel_degree"]
    model_location = properties['model_dir']
    if "model_id" in properties:
        model_location = properties['model_id']
    logging.info(f"Loading model in {model_location}")
    
    tokenizer = AutoTokenizer.from_pretrained(model_location, torch_dtype=torch.float16)

    # for deepspeed inference 
    model = AutoModelForCausalLM.from_pretrained(
        model_location, 
        load_in_8bit=True,
        low_cpu_mem_usage=True, 
        torch_dtype=torch.float16,
        device_map="auto")
    
    
    print("----------model dtype is {0}---------".format(model.dtype))
    print("----------model config is {0}---------".format(model.config))
    model.config.pad_token_id = tokenizer.pad_token_id
    print("----------model config is {0}---------".format(model.config))
    
    model.eval()
    
    model = deepspeed.init_inference(
        model,
        mp_size=tensor_parallel,
        dtype=torch.half,
        replace_method="auto",
        replace_with_kernel_inject=True,
    )
        
    local_rank = int(os.getenv("LOCAL_RANK", "0"))
    generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, use_cache=True, device=local_rank)
    
    generator.tokenizer.pad_token_id = model.config.pad_token_id
    
    return generator, model, tokenizer


def generate_prompt(instruction, input=None):
    return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""


def handle(inputs: Input) -> None:
    global predictor, model, tokenizer
    try:
        if not predictor:
            predictor,model,tokenizer = load_model(inputs.get_properties())

        print(inputs)
        if inputs.is_empty():
            # Model server makes an empty call to warmup the model on startup
            return None
        
        if inputs.is_batch():
            #the demo code is just suitable for single sample per client request
            bs = inputs.get_batch_size()
            logging.info(f"Dynamic batching size: {bs}.")
            batch = inputs.get_batches()
            #print(batch)
            tmp_inputs = []
            for _, item in enumerate(batch):
                tmp_item = item.get_as_json()
                tmp_inputs.append(tmp_item.get("input"))
            
            # For server side batch, we just use the custom generation parameters for single Sagemaker Endpoint.
            result = predictor(tmp_inputs, batch_size = bs, max_new_tokens = 128, min_new_tokens = 128, temperature = 1.0, do_sample = True)
            
            outputs = Output()
            for i in range(len(result)):
                outputs.add(result[i], key="generate_text", batch_index=i)
            return outputs
        else:
            inputs = inputs.get_as_json()
            if not inputs.get("input"):
                return Output().add_as_json({"code":-1,"msg":"input field can't be null"})

            #input data
            data = inputs.get("input")
            params = inputs.get("params",{})
            print("data  :{}".format(data))
            print("params:{}".format(params))

            #for pure client side batch
            if type(data) == str:
                bs = 1
            elif type(data) == list:
                if len(data) > batch_size:
                    bs = batch_size
                else:
                    bs = len(data)
            else:
                return Output().add_as_json({"code":-1,"msg": "input has wrong type"})
                
            print("client side batch size is ", bs)
            #predictor
            result = predictor(data, batch_size = bs, **params)
            print("result:{}".format(data))

            return Output().add({"code":0,"msg":"ok","data":result})
    except Exception as e:
        return Output().add_as_json({"code":-1,"msg":e})

Overwriting ./src/model.py


#### Create required variables and initialize them to create the endpoint, we leverage boto3 for this

In [8]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path


s3_code_prefix = (
    "llm/models/wizardcoder/code-15b"  # folder within bucket where code artifact will go
)

s3_client      = boto3.client("s3")
sm_client      = boto3.client("sagemaker")
smr_client     = boto3.client("sagemaker-runtime")

**Image URI for the DJL container is being used here**

All available images on SageMaker:

https://github.com/aws/deep-learning-containers/blob/master/available_images.md

In [9]:
#Note that: you can modify the image url according to your specific region.
inference_image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118"
print(f"Image going to be used is ---- > {inference_image_uri}")

Image going to be used is ---- > 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118


In [10]:
!rm -f model.tar.gz
!tar czvf model.tar.gz src

src/
src/requirements.txt
src/model.py
src/serving.properties


In [13]:
s3_code_artifact = sess.upload_data("model.tar.gz", sagemaker_default_bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-928808346782/llm/models/wizardcoder/code-15b/model.tar.gz


### To create the end point the steps are:

1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

    a) Instance Type is ml.g5.2xlarge 
    
    b) ContainerStartupHealthCheckTimeoutInSeconds is 15*60 to ensure health check starts after the model is ready
    
3. Create the end point using the endpoint config created    
    

One of the key parameters here is **TENSOR_PARALLEL_DEGREE** which essentially tells the DeepSpeed library to partition the models along 8 GPU's. This is a tunable and configurable parameter.

This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests. For further reading on DeepSpeed you can follow the link https://www.deepspeed.ai/tutorials/inference-tutorial/#initializing-for-inference. 

In [14]:
from sagemaker.utils import name_from_base


model_name              = name_from_base(f"wizardcoder-15b")
print(model_name)

create_model_response   = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
    },
)
model_arn               = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

wizardcoder-15b-2023-07-31-04-26-20-234
Created Model: arn:aws:sagemaker:us-west-2:928808346782:model/wizardcoder-15b-2023-07-31-04-26-20-234


In [15]:
endpoint_config_name     = f"{model_name}-config"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            #"VolumeSizeInGB" : 300,
            #"ModelDataDownloadTimeoutInSeconds": 15*60,
            "ContainerStartupHealthCheckTimeoutInSeconds": 15*60,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-west-2:928808346782:endpoint-config/wizardcoder-15b-2023-07-31-04-26-20-234-config',
 'ResponseMetadata': {'RequestId': 'dc4a0bcc-1deb-4bce-9010-981d1b0ecc1b',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'dc4a0bcc-1deb-4bce-9010-981d1b0ecc1b',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '127',
   'date': 'Mon, 31 Jul 2023 04:26:28 GMT'},
  'RetryAttempts': 0}}

In [16]:
endpoint_name            = f"{model_name}-endpoint"

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-west-2:928808346782:endpoint/wizardcoder-15b-2023-07-31-04-26-20-234-endpoint


#### Wait for the end point to be created.

### This step can take ~ 15 min or longer so please be patient

In [17]:
import time


resp   = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Failed
Arn: arn:aws:sagemaker:us-west-2:928808346782:endpoint/wizardcoder-15b-2023-07-31-04-26-20-234-endpoint
Status: Failed
