# Deploy aaditya/Llama3-OpenBioLLM-70B Model with high performance on SageMaker using Sagemaker LMI and Rolling batch

ref:  
[https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B](https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B)

[https://aws.amazon.com/blogs/machine-learning/meta-llama-3-models-are-now-available-in-amazon-sagemaker-jumpstart/](https://aws.amazon.com/blogs/machine-learning/meta-llama-3-models-are-now-available-in-amazon-sagemaker-jumpstart/)

[https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/generativeai/llm-workshop/lab11-llama2/meta-llama-2-70b-lmi.ipynb](https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/generativeai/llm-workshop/lab11-llama2/meta-llama-2-70b-lmi.ipynb)



In [None]:
!pip install -qU sagemaker boto3 huggingface_hub

In [None]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

In [None]:
model_bucket = sess.default_bucket()  # bucket to house model artifacts
s3_code_prefix = "hf-large-model-djl/aaditya/Llama3-OpenBioLLM-70B/code"  # folder within bucket where code artifact will go

s3_model_prefix = "models/aaditya/Llama3-OpenBioLLM-70B"  # folder within bucket where model artifact will go
region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

jinja_env = jinja2.Environment()

### [OPTIONAL] Download the model from Hugging Face and upload the model artifacts on Amazon S3

If you intend to download your copy of the model and upload it to a s3 location in your AWS account, please follow the below steps, else you can skip to the next step.

In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path
import os

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = "aaditya/Llama3-OpenBioLLM-70B"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.txt", "*.model", "*.safetensors", "*.bin", "*.chk", "*.pth"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name, cache_dir=local_model_path, allow_patterns=allow_patterns
)

In [None]:
#upload files from local to S3 location or using s5cmd
pretrained_model_location = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
print(f"Model uploaded to --- > {pretrained_model_location}")

In [None]:
#Cleanup locally stored model files post S3 upload
#!rm -rf {model_download_path}

### Define a variable to contain the s3url of the location that has the model

In [None]:
# Define a variable to contain the s3url of the location that has the model. 
#pretrained_model_location = f"s3://{model_bucket}/models/aaditya/Llama3-OpenBioLLM-70B"

## Create SageMaker compatible Model artifact,  upload Model to S3 and bring your own inference script.

SageMaker Large Model Inference containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom pre-processing of the input data or postprocessing of the model's predictions.

SageMaker needs the model artifacts to be in a Tarball format. In this example, we provide the following files - serving.properties.

The tarball is in the following format:

```
code
├──── 
│   └── serving.properties
```

    serving.properties is the configuration file that can be used to configure the model server.


#### Create serving.properties 
This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.

Here is a list of settings that we use in this configuration file -

    engine: The engine for DJL to use. In this case, we have set it to MPI.
    option.model_id: The model id of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models) or S3 path to the model artefacts. 
    option.tensor_parallel_degree: Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 4 GPU machine and we are creating 4 partitions then we will have 1 worker per model to serve the requests.

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.



In [None]:
!rm -rf code_aaditya_Llama3_OpenBioLLM_70B
!mkdir -p code_aaditya_Llama3_OpenBioLLM_70B

In [None]:
%%writefile code_aaditya_Llama3_OpenBioLLM_70B/serving.properties
engine=MPI
option.tensor_parallel_degree=8
option.rolling_batch=auto
option.max_rolling_batch_size=4
option.model_loading_timeout=3600
option.model_id={{model_id}}
option.paged_attention=true
option.trust_remote_code=true
option.dtype=fp16
option.max_rolling_batch_prefill_tokens=8709
option.enable_streaming=True

In [None]:
# we plug in the appropriate model location into our `serving.properties`
template = jinja_env.from_string(Path("code_aaditya_Llama3_OpenBioLLM_70B/serving.properties").open().read())
Path("code_aaditya_Llama3_OpenBioLLM_70B/serving.properties").open("w").write(
    template.render(model_id=pretrained_model_location)
)
!pygmentize code_aaditya_Llama3_OpenBioLLM_70B/serving.properties | cat -n

**Image URI for the DJL container is being used here**

In [None]:
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.27.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

**Create the Tarball and then upload to S3 location**

In [None]:
!rm model.tar.gz
!tar czvf model.tar.gz code_aaditya_Llama3_OpenBioLLM_70B

In [None]:
s3_code_artifact = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

### To create the end point the steps are:

1. Create the Model using the Image container and the Model Tarball uploaded earlier
2. Create the endpoint config using the following key parameters

    a) Instance Type is ml.g5.48xlarge 
    
    b) ContainerStartupHealthCheckTimeoutInSeconds is 3600 to ensure health check starts after the model is ready    
3. Create the end point using the endpoint config created    


#### Create the Model
Use the image URI for the DJL container and the s3 location to which the tarball was uploaded.

The container downloads the model into the `/tmp` space on the instance because SageMaker maps the `/tmp` to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB. 
It leverages `s5cmd`(https://github.com/peak/s5cmd) which offers a very fast download speed and hence extremely useful when downloading large models.

For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the `/tmp` on the container. The size of this mount is large enough to hold the model.


In [None]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"aaditya-Llama3-OpenBioLLM-70B")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
        "Environment": {"MODEL_LOADING_TIMEOUT": "900"},
    },
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

In [None]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.48xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 900,
            "ContainerStartupHealthCheckTimeoutInSeconds": 900,
        },
    ],
)
endpoint_config_response

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

### This step can take ~ 20 min or longer so please be patient

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

#### While you wait for the endpoint to be created, you can read more about:
- [Deep Learning containers for large model inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html)

#### Leverage the Boto3 to invoke the endpoint. 

This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results.

You can pass a prompt as input to the model. This done by setting inputs to a prompt. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters.
These parameters need to be passed to the endpoint as a dictionary of kwargs. Refer this documentation - https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig for more details.

The below code sample illustrates the invocation of the endpoint using a text prompt and also sets some parameters

In [None]:
%%time
endpoint_name = "aaditya-Llama3-OpenBioLLM-70B-2024-05-21-01-28-34-858-endpoint"
# a single string
inputs = "You are an expert and experienced from the healthcare and biomedical domain with extensive medical knowledge and practical experience. Your name is OpenBioLLM, and you were developed by Saama AI Labs. who's willing to help answer the user's query with explanation. In your explanation, leverage your deep medical expertise such as relevant anatomical structures, physiological processes, diagnostic criteria, treatment guidelines, or other pertinent medical concepts. Use precise medical terminology while still aiming to make the explanation clear and accessible to a general audience.How can i split a 3mg or 4mg waefin pill so i can get a 2.5mg pill?"
smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": inputs,
            "parameters": {
                "do_sample": True,
                "max_new_tokens": 100,
                "min_new_tokens": 100,
                "temperature": 0.3,
                "watermark": True,
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

In [None]:
%%time
# formated prompt in one string
inputs = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an expert and experienced from the healthcare and biomedical domain with extensive medical knowledge and practical experience. Your name is OpenBioLLM, and you were developed by Saama AI Labs. who's willing to help answer the user's query with explanation. In your explanation, leverage your deep medical expertise such as relevant anatomical structures, physiological processes, diagnostic criteria, treatment guidelines, or other pertinent medical concepts. Use precise medical terminology while still aiming to make the explanation clear and accessible to a general audience.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow can i split a 3mg or 4mg waefin pill so i can get a 2.5mg pill?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": inputs,
            "parameters": {
                "do_sample": True,
                "max_new_tokens": 100,
                "min_new_tokens": 100,
                "temperature": 0.3,
                "watermark": True,
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

In [None]:
#turn json into llama3 formated prompt
def build_llama3_prompt(messages):
    startPrompt = "<|begin_of_text|>"
    endPrompt = "<|start_header_id|>assistant<|end_header_id|>\n\n"
    conversation = []
    for index, message in enumerate(messages):
        if message["role"] == "system" and index == 0:
            conversation.append(f"<|start_header_id|>system<|end_header_id|>\n\n")
            conversation.append(f"{message['content']}<|eot_id|>")
        elif message["role"] == "user":
            conversation.append(f"<|start_header_id|>user<|end_header_id|>\n\n")
            conversation.append(f"{message['content'].strip()}<|eot_id|>")
        elif message["role"] == "assistant":
            conversation.append(f"<|start_header_id|>assistant<|end_header_id|>\n\n")
            conversation.append(f"{message['content'].strip()}<|eot_id|>")
        else:
            conversation.append(f" {message['content'].strip()}")
 
    return startPrompt + "".join(conversation) + endPrompt
 


In [None]:
%%time
# formated prompt 

messages = [
    {"role": "system", "content": "You are an expert and experienced from the healthcare and biomedical domain with extensive medical knowledge and practical experience. Your name is OpenBioLLM, and you were developed by Saama AI Labs. who's willing to help answer the user's query with explanation. In your explanation, leverage your deep medical expertise such as relevant anatomical structures, physiological processes, diagnostic criteria, treatment guidelines, or other pertinent medical concepts. Use precise medical terminology while still aiming to make the explanation clear and accessible to a general audience."},
    {"role": "user", "content": "How can i split a 3mg or 4mg waefin pill so i can get a 2.5mg pill?"},
]
#messages = "The diamondback terrapin was the first reptile to be"
#instruction = "What are some cool ideas to do in the summer?"
#messages.append({"role": "user", "content": instruction})
prompt = build_llama3_prompt(messages)
print(prompt)

In [None]:
%%time
# formated prompt in one string
messages = [
    {"role": "system", "content": "You are an expert and experienced from the healthcare and biomedical domain with extensive medical knowledge and practical experience. Your name is OpenBioLLM, and you were developed by Saama AI Labs. who's willing to help answer the user's query with explanation. In your explanation, leverage your deep medical expertise such as relevant anatomical structures, physiological processes, diagnostic criteria, treatment guidelines, or other pertinent medical concepts. Use precise medical terminology while still aiming to make the explanation clear and accessible to a general audience."},
    {"role": "user", "content": "How can i split a 3mg or 4mg waefin pill so i can get a 2.5mg pill?"},
]
smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": build_llama3_prompt(messages),
            "parameters": {
                "do_sample": True,
                "max_new_tokens": 100,
                "min_new_tokens": 100,
                "temperature": 0.3,
                "watermark": True,
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

In [None]:
# other examples from https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B
#1.Summarize Clinical Notes
messages = [
    {"role": "system", "content": "You are an expert and experienced from the healthcare and biomedical domain with extensive medical knowledge and practical experience. Your name is OpenBioLLM, and you were developed by Saama AI Labs. who's willing to help answer the user's query with explanation. In your explanation, leverage your deep medical expertise such as relevant anatomical structures, physiological processes, diagnostic criteria, treatment guidelines, or other pertinent medical concepts. Use precise medical terminology while still aiming to make the explanation clear and accessible to a general audience."},
    {"role": "user", "content": """Please summarize the key points from the following clinical note, focusing on the patient's chief complaint, relevant medical history, physical examination findings, diagnosis, and treatment plan:; keep it short and concise

```REASON FOR CONSULTATION: Abnormal echocardiogram findings and followup. Shortness of breath, congestive heart failure, and valvular insufficiency.
HISTORY OF PRESENT ILLNESS: The patient is an 86-year-old female admitted for evaluation of abdominal pain and bloody stools. The patient has colitis and also diverticulitis, undergoing treatment. During the hospitalization, the patient complains of shortness of breath, which is worsening. The patient underwent an echocardiogram, which shows severe mitral regurgitation and also large pleural effusion. This consultation is for further evaluation in this regard. As per the patient, she is an 86-year-old female, has limited activity level. She has been having shortness of breath for many years. She also was told that she has a heart murmur, which was not followed through on a regular basis.

CORONARY RISK FACTORS: History of hypertension, no history of diabetes mellitus, nonsmoker, cholesterol status unclear, no prior history of coronary artery disease, and family history noncontributory.
FAMILY HISTORY: Nonsignificant.

MEDICATIONS: Presently on Lasix, potassium supplementation, Levaquin, hydralazine 10 mg b.i.d., antibiotic treatments, and thyroid supplementation.
ALLERGIES: AMBIEN, CARDIZEM, AND IBUPROFEN.

PERSONAL HISTORY: She is a nonsmoker. Does not consume alcohol. No history of recreational drug use.``` """},
]

#2.Answer Medical Questions
messages = [
    {"role": "system", "content": "You are an expert and experienced from the healthcare and biomedical domain with extensive medical knowledge and practical experience. Your name is OpenBioLLM, and you were developed by Saama AI Labs. who's willing to help answer the user's query with explanation. In your explanation, leverage your deep medical expertise such as relevant anatomical structures, physiological processes, diagnostic criteria, treatment guidelines, or other pertinent medical concepts. Use precise medical terminology while still aiming to make the explanation clear and accessible to a general audience."},
    {"role": "user", "content": """Question: A 35-year-old woman comes to the physician because of a 1-month history of double vision, difficulty climbing stairs, and weakness when trying to brush her hair. She reports that these symptoms are worse after she exercises and disappear after she rests for a few hours. Physical examination shows drooping of her right upper eyelid that worsens when the patient is asked to gaze at the ceiling for 2 minutes. There is diminished motor strength in the upper extremities. The remainder of the examination shows no abnormalities. Which of the most likely diagnosis?"""},
]

#3.Answer Medical Questions
messages = [
    {"role": "system", "content": "You are an expert and experienced from the healthcare and biomedical domain with extensive medical knowledge and practical experience. Your name is OpenBioLLM, and you were developed by Saama AI Labs. who's willing to help answer the user's query with explanation. In your explanation, leverage your deep medical expertise such as relevant anatomical structures, physiological processes, diagnostic criteria, treatment guidelines, or other pertinent medical concepts. Use precise medical terminology while still aiming to make the explanation clear and accessible to a general audience."},
    {"role": "user", "content": """
    Question: An investigator is studying the modification of newly formed polypeptides in plated eukaryotic cells.
After the polypeptides are released from the ribosome, a chemically-tagged protein attaches covalently to lysine residues on the polypeptide chain, forming a modified polypeptide. When a barrel-shaped complex is added to the cytoplasm, the modified polypeptide lyses, resulting in individual amino acids and the chemically-tagged proteins.
Which of the following post-translational modifications has most likely occurred?

A) Glycosylation
B) Phosphorylation
C) Carboxylation
D) Ubiquitination
D) Hypothyroidism
"""},
]

#4. etc


## Clean Up

In [None]:
# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [None]:
# - In case the end point failed we still want to delete the model
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)