## Testing Falcon 40B model

- falcon 40b model : https://huggingface.co/tiiuae/falcon-40b
- instruction model : https://huggingface.co/tiiuae/falcon-40b-instruct
- streaming example : https://github.com/andrewgcodes/FalconStreaming/blob/main/Falcon40B_Instruct_Streaming.ipynb


### With HuggingFace LLM container
- HF LLM container : https://huggingface.co/blog/sagemaker-huggingface-llm
- HF LLM inference server code : https://github.com/huggingface/text-generation-inference


### Deploy Falcon 40B on HF LLM conainer
- how to deploy it on sagemaker : https://www.philschmid.de/sagemaker-falcon-llm
  - It can be easily used, but cannot use model in s3 (only available from HF model hub)
- AWS blog : https://aws.amazon.com/ko/blogs/machine-learning/announcing-the-launch-of-new-hugging-face-llm-inference-containers-on-amazon-sagemaker/
- Document for HF Model : https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-model


In [None]:
# # TODO: remove once new version is released
# !pip install -q git+https://github.com/aws/sagemaker-python-sdk --upgrade

# # install latest sagemaker SDK
# !pip install "sagemaker==2.163.0" --upgrade --quiet

!pip install sagemaker --upgrade -q
!pip install -q transformers

In [None]:
import boto3
import json
import sagemaker
from sagemaker.utils import name_from_base
from sagemaker import image_uris

In [None]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
sm_client = sagemaker_session.sagemaker_client
sm_runtime_client = sagemaker_session.sagemaker_runtime_client
s3_client = boto3.client('s3')

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")


In [None]:
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
# instance_type = "ml.g5.2xlarge"
# instance_type = "ml.g4dn.12xlarge"

number_of_gpu = 4
# number_of_gpu = 1

health_check_timeout = 900

# TGI config
config = {
  'HF_MODEL_ID': "tiiuae/falcon-40b-instruct", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text
  'MAX_TOTEL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
  # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

# create HuggingFaceModel
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)


In [None]:
# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
    container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
    # wait=False
)


In [None]:
user_utter = "How can I learn spear fishing in korea?"

In [None]:
# define payload
prompt = f"""You are an helpful Assistant, called Falcon. Knowing everyting about AWS.

User: {user_utter}
Falcon:"""

# hyperparameters for llm
payload = {
  "inputs": prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 1024,
    "repetition_penalty": 1.03,
    "stop": ["\nUser:","<|endoftext|>","</s>"]
  }
}


In [None]:
print(payload)

In [None]:
%%time

# send request to endpoint
response = llm.predict(payload)

# print assistant respond
assistant = response[0]["generated_text"][len(prompt):]

In [None]:
print(assistant)

### Invoke Falcon model using SageMaker Runtime client

- It is easy to invoke model using SageMaker SDK, but it's also possible to use boto3
- Here we use sagemaker runtime client to invoke endpoint

In [None]:
endpoint_name = "huggingface-pytorch-tgi-inference-2023-06-16-02-46-29-194"

In [None]:
user_utter = "How can I buy a great bluetooth earphone in pakistan?"

In [None]:
# define payload
prompt = f"""You are an helpful Assistant, called Falcon. Knowing everyting about AWS.

User: {user_utter}
Falcon:"""

# hyperparameters for llm
payload = {
  "inputs": prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 1024,
    "repetition_penalty": 1.03,
    "stop": ["\nUser:","<|endoftext|>","</s>"]
  }
}

print(payload)

In [None]:
%%time

response_model = sm_runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
)


In [None]:
raw_output = response_model["Body"].read().decode("utf8")

In [None]:
output = json.loads(raw_output)[0]["generated_text"][len(prompt):]

In [None]:
print(output)

### Inference test result

For the FP32:
- `g5.12xlarge` : 3~5 sec
- `g4dn.12xlarge` : OOM

For the int8 (quantization):
- `g5.12xlarge` (`$5.672`) : 6~15 sec (It takes more time when quantization)
- `g4dn.12xlarge` (`$3.912`) : 20 sec
- `g5.2xlarge` (`$1.212`): timeout

Deploying falcon 40B using official guide works well. Then how about DJL?

### How to deploy it to the DJL?

- SageMaker model type DJL : https://sagemaker.readthedocs.io/en/stable/frameworks/djl/using_djl.html
- Sample code for deploying Falcon model using DJL : https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/generativeai/llm-workshop/lab10-falcon-40b-and-7b/falcon-40b-accelerate.ipynb


In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path
import os

local_model_path = Path("./pretrained-models")
local_model_path.mkdir(exist_ok=True)
# model_name = "tiiuae/falcon-40b"
model_name = "tiiuae/falcon-40b-instruct"
allow_patterns = ["*.json", "*.pt", "*.bin", "*.txt", "*.model", "*.py"]

model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

In [None]:
print(f"Model download path (Falcon 40B) : {model_download_path}")

In [None]:
s3_model_prefix = "llm/falcon/model"  # folder where model checkpoint will go

In [None]:
# base_model_s3 = f"{s3_model_prefix}/falcon-40b"
base_model_s3 = f"{s3_model_prefix}/falcon-40b-instruct"

In [None]:
# Run only wants to upload model files
s3_model_artifact = sagemaker_session.upload_data(path=model_download_path, key_prefix=base_model_s3)

In [None]:
default_bucket = sagemaker_session.default_bucket()
try:
    print(f"Model s3 uri : {s3_model_artifact}")
except:
    s3_model_artifact = f"s3://{default_bucket}/{base_model_s3}"
    
print(s3_model_artifact)

In [None]:
framework_name = f"djl-deepspeed"
inference_image_uri = image_uris.retrieve(
    framework=framework_name, region=sagemaker_session.boto_session.region_name, version="0.22.1"
)

print(f"Inference container uri: {inference_image_uri}")

In [None]:
# Accelerate version
src_dir_name = f"falcon-40b-src"

# # DeepSpeed version
# src_dir_name = f"falcon-40b-ds-src"

s3_target = f"s3://{sagemaker_session.default_bucket()}/llm/falcon-40b/code/"

In [None]:
!rm -rf {src_dir_name}.tar.gz
!tar zcvf {src_dir_name}.tar.gz {src_dir_name} --exclude ".ipynb_checkpoints" --exclude "__pycache__"
!aws s3 cp {src_dir_name}.tar.gz {s3_target}

In [None]:
model_uri = f"{s3_target}{src_dir_name}.tar.gz"
print(model_uri)

In [None]:
model_name = name_from_base(f"falcon-40b-djl")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": model_uri},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

In [None]:
instance_type = "ml.g5.12xlarge"

endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1200,
        },
    ],
)
print(endpoint_config_response)

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

In [None]:
user_utter = "What is the best way to buy some gopro in pakistan?"

prompt = f"""You are an helpful Assistant, called Falcon.

User: {user_utter}
Falcon:"""

In [None]:
%%time

response_model = sm_runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"text": prompt, "text_length": 150}),
    ContentType="application/json",
)


In [None]:
raw_output = response_model["Body"].read().decode("utf8")

In [None]:
output = json.loads(raw_output)["outputs"][0]["generated_text"][len(prompt):]

In [None]:
print(output)