# Deploy the CPM-Bee 10B  model on Amazon SageMaker

As we have finetuned the model, next we will show you how to deploy the model on SageMaker.

In this notebook, we explore how to host a large language model on SageMaker using the [Large Model Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-inference.html) container that is optimized for hosting large models using DJLServing. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent [blog post](https://aws.amazon.com/blogs/machine-learning/deploy-large-models-on-amazon-sagemaker-using-djlserving-and-deepspeed-model-parallel-inference/).

## Create a SageMaker Model for Deployment
As a first step, we'll import the relevant libraries and configure several global variables such as the hosting image that will be used nd the S3 location of our model artifacts

In [None]:
import sagemaker
from sagemaker.model import Model
from sagemaker import serializers, deserializers
from sagemaker import image_uris
import boto3
import os
import time
import json
import jinja2
from pathlib import Path

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

region = sess._region_name # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

s3_client = boto3.client("s3") # client to intreract with S3 API
sm_client = boto3.client("sagemaker")  # client to intreract with SageMaker
smr_client = boto3.client("sagemaker-runtime") # client to intreract with SageMaker Endpoints
jinja_env = jinja2.Environment() # jinja environment to generate model configuration templates

In [None]:
# lookup the inference image uri based on our current region
djl_inference_image_uri = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.3-cu117"
)

**The model we tested is from https://huggingface.co/openbmb/cpm-bee-10b, you could download first and upload to S3, you could follow below commented code to download the model.**

In [None]:
# !pip install huggingface_hub

In [None]:
# from huggingface_hub import snapshot_download
# from pathlib import Path

# local_cache_path = Path("./model")
# local_cache_path.mkdir(exist_ok=True)

# model_name = "openbmb/cpm-bee-10b"

# # Only download pytorch checkpoint files
# allow_patterns = ["*.json", "*.pt", "*.bin", "*.model"]

# model_download_path = snapshot_download(
#     repo_id=model_name,
#     cache_dir=local_cache_path,
#     allow_patterns=allow_patterns,
# )

In [None]:
# # Get the model files path
# import os
# from glob import glob

# local_model_path = None

# paths = os.walk(r'./model')
# for root, dirs, files in paths:
#     for file in files:
#         if file == 'config.json':
#             print(os.path.join(root,file))
#             local_model_path = str(os.path.join(root,file))[0:-11]
#             print(local_model_path)
# if local_model_path == None:
#     print("Model download may failed, please check prior step!")

In [None]:
# %%script env sagemaker_default_bucket=$sagemaker_default_bucket local_model_path=$local_model_path bash

# chmod +x ./s5cmd
# ./s5cmd sync ${local_model_path} s3://${sagemaker_default_bucket}/llm/models/cpm-bee/l0B/ 

In [None]:
# # remove model artifact from local notebook storage
# !rm -rf model

In [None]:
pretrained_model_location = "s3://sagemaker-us-west-2-169088282855/llm/models/cpm-bee/10B/"# Change to the model artifact path in S3 which we get from the fine tune job
print(f"Pretrained model will be downloaded from ---- > {pretrained_model_location}")

## Deploying a Large Language Model using Hugging Face Accelerate
The DJL Inference Image which we will be utilizing ships with a number of built-in inference handlers for a wide variety of tasks including:
- `text-generation`
- `question-answering`
- `text-classification`
- `token-classification`

You can refer to this [GitRepo](https://github.com/deepjavalibrary/djl-serving/tree/master/engines/python/setup/djl_python) for a list of additional handlers and available NLP Tasks. <br>
These handlers can be utilized as is without having to write any custom inference code. We simply need to create a `serving.properties` text file with our desired hosting options and package it up into a `tar.gz` artifact.

Lets take a look at the `serving.properties` file that we'll be using for our first example

In [None]:
!mkdir accelerate_src

**IMPORTANT** The ```option.tensor_parallel_degree``` means how GPU we will use to load a single model, here we set it to 2, as 10B model need at least 20GB (fp16) GPU memory, here we use 4 GPU to host this 10B model.

In [None]:
%%writefile accelerate_src/requirements.txt
transformers==4.30.2
accelerate==0.20.3

In [None]:
%%writefile accelerate_src/serving.template
engine=Python
# option.entryPoint=djl_python.huggingface
option.s3url={{ s3url }}
option.task=text-generation
option.device_map=auto
option.dtype=fp16
option.tensor_parallel_degree=4

In [None]:
# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(Path("accelerate_src/serving.template").open().read())
Path("accelerate_src/serving.properties").open("w").write(template.render(s3url=pretrained_model_location))
!pygmentize accelerate_src/serving.properties | cat -n

There are a few options specified here. Lets go through them in turn<br>
1. `engine` - specifies the engine that will be used for this workload. In this case we'll be hosting a model using the [DJL Python Engine](https://github.com/deepjavalibrary/djl-serving/tree/master/engines/python)
2. `option.entryPoint` - specifies the entrypoint code that will be used to host the model. djl_python.huggingface refers to the `huggingface.py` module from [djl_python repo](https://github.com/deepjavalibrary/djl-serving/tree/master/engines/python/setup/djl_python).  
3. `option.s3url` - specifies the location of the model files. Alternativelly an `option.model_id` option can be used instead to specifiy a model from Hugging Face Hub (e.g. `EleutherAI/gpt-j-6B`) and the model will be automatically downloaded from the Hub. The s3url approach is recommended as it allows you to host the model artifact within your own environment and enables faster deployments by utilizing optimized approach within the DJL inference container to transfer the model from S3 into the hosting instance 
4. `option.task` - This is specific to the `huggingface.py` inference handler and specifies for which task this model will be used
5. `option.device_map` - Enables layer-wise model partitioning through [Hugging Face Accelerate](https://huggingface.co/docs/accelerate/usage_guides/big_modeling#designing-a-device-map). With `option.device_map=auto`, Accelerate will determine where to put each **layer** to maximize the use of your fastest devices (GPUs) and offload the rest on the CPU, or even the hard drive if you don’t have enough GPU RAM (or CPU RAM). Even if the model is split across several devices, it will run as you would normally expect.

For more information on the available options, please refer to the [SageMaker Large Model Inference Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html)

Our initial approach here is to utilize the built-in functionality within Hugging Face Transformers to enable Large Language Model hosting. 

We place the `serving.properties` file into a tarball and upload it to S3

In [None]:
%%writefile accelerate_src/model.py
from djl_python import Input, Output
import os
import logging
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import dispatch_model

model = None
tokenizer = None

def load_model(properties):
    tensor_parallel = properties["tensor_parallel_degree"]
    model_location = properties['model_dir']
    if "model_id" in properties:
        model_location = properties['model_id']
    logging.info(f"Loading model in {model_location}")
    
    tokenizer = AutoTokenizer.from_pretrained(model_location, trust_remote_code=True)
    
    #for HF accelerate inference
    
    model = AutoModelForCausalLM.from_pretrained(model_location, trust_remote_code=True)
    device_map = {
        "cpmbee.input_embedding": 0,
        "cpmbee.position_bias": 0,
        "lm_head": 0,
        "cpmbee.encoder.output_layernorm": 0
    }

    for i in range(48):
        device_map["cpmbee.encoder.layers.{}".format(i)] = i % 4

    model = dispatch_model(model, device_map=device_map)

    return model, tokenizer


def handle(inputs: Input) -> None:
    global model, tokenizer
    try:
        if not model:
            model,tokenizer = load_model(inputs.get_properties())

        #print(inputs)
        if inputs.is_empty():
            # Model server makes an empty call to warmup the model on startup
            return None
        
        if inputs.is_batch():
            #the demo code is just suitable for single sample per client request
            bs = inputs.get_batch_size()
            logging.info(f"Dynamic batching size: {bs}.")
            batch = inputs.get_batches()
            #print(batch)
            tmp_inputs = []
            for _, item in enumerate(batch):
                tmp_item = item.get_as_json()
                tmp_inputs.append(tmp_item.get("inputs"))
            
            #For server side batch, we just use the custom generation parameters for single Sagemaker Endpoint.
            result = model.generate(tmp_inputs, tokenizer)
            
            outputs = Output()
            for i in range(len(result)):
                outputs.add(result[i], key="generate_text", batch_index=i)
            return outputs
        else:
            inputs = inputs.get_as_json()
            if not inputs.get("inputs"):
                return Output().add_as_json({"code":-1,"msg":"input field can't be null"})

            #input data
            data = inputs.get("inputs")
            params = inputs.get("parameters",{})

            #for pure client side batch
            if type(data) == str:
                bs = 1
            elif type(data) == list:
                bs = len(data)
            else:
                return Output().add_as_json({"code":-1,"msg": "input has wrong type"})
                
            print("client side batch size is ", bs)
            #predictor
            result = model.generate(data, tokenizer, **params)

            #return
            return Output().add({"code":0,"msg":"ok","data":result})
    except Exception as e:
        return Output().add_as_json({"code":-1,"msg":e})

In [None]:
!tar czvf acc_model.tar.gz accelerate_src/ 

In [None]:
s3_code_prefix = "cpm-bee/deploy/code/10B"

code_artifact = sess.upload_data("acc_model.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

## Deploy Model to a SageMaker Endpoint
With a helper function we can now deploy our endpoint and invoke it with some sample inputs

In [None]:
def deploy_model(image_uri, model_data, role, endpoint_name, instance_type, sagemaker_session):
    
    """Helper function to create the SageMaker Endpoint resources and return a predictor"""
    model = Model(
            image_uri=image_uri, 
              model_data=model_data, 
              role=role
             )
    
    model.deploy(
        initial_instance_count=1,
        instance_type=instance_type,
        endpoint_name=endpoint_name,
#         model_data_download_timeout=60*15, ##
        container_startup_health_check_timeout=60*15 ##
        )
    
    # our requests and responses will be in json format so we specify the serializer and the deserializer
    predictor = sagemaker.Predictor(
        endpoint_name=endpoint_name, 
        sagemaker_session=sagemaker_session, 
        serializer=serializers.JSONSerializer(), 
        deserializer=deserializers.JSONDeserializer())
    
    return predictor

In [None]:
# creates a unique endpoint name
endpoint_name = sagemaker.utils.name_from_base("cpm-bee-10B")
print(f"Our endpoint will be called {endpoint_name}")

In [None]:
# deployment will take about 10 minutes
predictor = deploy_model(image_uri=djl_inference_image_uri, 
                            model_data=code_artifact, 
                            role=role, 
                            endpoint_name=endpoint_name, 
                            instance_type="ml.g5.12xlarge", 
                            sagemaker_session=sess)

Let's run an example with a basic text generation prompt Large model inference is

In [None]:
%%time

data_list = [
    {"input": "今天天气是真的", "prompt": "往后写两句话", "<ans>": ""},
    {"input": "北京市气象台提示，4月12日午后偏南风加大，阵风可达6级左右，南下的沙尘可能伴随回流北上进京，外出仍需注意<mask_0>，做好健康防护。天津市气象台也提示，受<mask_1>影响，我市4月12日有浮尘天气，PM10浓度<mask_2>。请注意关好门窗，老人儿童尽量减少户外活动，外出注意带好<mask_3>。” ","<ans>":{"<mask_0>":"","<mask_1>":"","<mask_2>":"","<mask_3>":""}},
]

result = predictor.predict({ 
                    "inputs" : data_list, 
                    "parameters": {"max_new_tokens": 50, "repetition_penalty": 1.1, "temperature": 0.5}
                })

for res in result["data"]:
    print(res)

#### Clean up endpoint to save cost

In [None]:
# Clean up the endpoint before proceeding
predictor.delete_endpoint()

## Reference

[sagemaker-hosting/Large-Language-Model-Hosting/](https://github.com/aws-samples/sagemaker-hosting/tree/main/Large-Language-Model-Hosting)