<a href="https://colab.research.google.com/github/2003Yash/Falcon-40B-deployment-in-sagemaker/blob/main/Falcon_40B_deployment_in_Sagemaker_%2B_Prompt_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this lab, we'll host a llm on Amazon SageMaker using Hugging Face LLM Inference Container for Amazon SageMaker, which allows you to easily deploy the most popular open-source LLMs, including Falcon, StarCoder, BLOOM, GPT-NeoX, Llama, and T5.

---------------------------------------

Background and Details

We'll be working with Falcon-40B-Instruct that was developed by the Technology Innovation Institute (TII). Falcon-40B-Instruct is a 40B parameters causal decoder-only model built by TII based on Falcon-40B and finetuned on a mixture of Baize. It is made available under the Apache 2.0 license.

----------------------------------------------------------

install dependencies

In [None]:
!pip install --upgrade boto3 sagemaker #upgrades boto3 and sagemaker libraires

In [None]:
Create a bucket

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()

sagemaker_session_bucket=None #  sagemaker session bucket -> used for uploading data, models and logs
                              # sagemaker will automatically create this bucket if it not exists


if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket() # set to default bucket if a bucket name is not given

Get Role, so we can use it to call models and buckets

In [None]:
try:
    role = sagemaker.get_execution_role() # try calling role directly

except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn'] # it direct call fails use this code

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}") # we use roles to call models
print(f"sagemaker session region: {sess.boto_region_name}")

Get image of hugging_face_llm to acutally get models from hf-hub and run them as containers locally

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri # HuggingFaceModel model class with a image_uri pointing to the image.
                                                                # To retrieve the new Hugging Face LLM Deep Learning Container in Amazon SageMaker,

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

To deploy Falcon-40B-Instruct model to Amazon SageMaker, we create a HuggingFaceModel model class and define our endpoint configuration including the hf_model_id, and instance_type. We will use a g5.12xlarge instance type with 4 NVIDIA A10G GPUs and 96GB of GPU memory.

Create Model from importing it from HF-Hub

In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4

# TGI config
config = {
  'HF_MODEL_ID': "tiiuae/falcon-40b-instruct", # model id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
  # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

# create HuggingFaceModel
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image, # optional parameter to use when deploying as container
  env=config
)

Deploy Model as EndPoint

In [None]:
# define payload - simply a command to tell what prompt to execute in llm and how to execute it
prompt = """You are an helpful Assistant, called Falcon. Knowing everyting about AWS.
User: Can you tell me something about Amazon SageMaker?
Falcon:"""

# hyperparameters for llm
payload = {
  "inputs": prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 1024,
    "repetition_penalty": 1.03,
    "stop": ["\nUser:","<|endoftext|>","</s>"]
  }
}

# send request to endpoint
response = llm.predict(payload)

for seq in response:
    print(f"Result: {seq['generated_text']}")

Prompt - Engineering

Prompt engineering is a technique used to design effective prompts for LLMs with the goal to achieve: Control over the output, Mitigate Bias, Improve Model Efficiency

In [None]:
# Prompt engineered Prompt template: 1) Instruction - a specific task or instruction you want the model to perform
#                                    2) Context - can involve external information or additional context that can steer the model to better responses
#                                    3) Input Data - is the input or question that we are interested to find a response for
#                                    4) Output Indicator - indicates the type or format of output.

In [None]:
# Simple unstructured prompt
prompt = """
Teplizumab traces its roots to a New Jersey drug company called Ortho Pharmaceutical. There, scientists generated an early version of the antibody, dubbed OKT3. Originally sourced from mice, the molecule was able to bind to the surface of T cells and limit their cell-killing potential. In 1986, it was approved to help prevent organ rejection after kidney transplants, making it the first therapeutic antibody allowed for human use.

User: What was OKT3 originally sourced from?

Falcon:"""


# hyperparameters for llm
payload = {
  "inputs": prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 1024,
    "repetition_penalty": 1.03,
    "stop": ["\nUser:","<|endoftext|>","</s>"]
  }
}

# send request to endpoint
response = llm.predict(payload)

for seq in response:
    print(f"Result: {seq['generated_text']}")

In [None]:
# Engineered prompt with above 2nd cell template
prompt = """
Answer the question based on the context below. Keep the answer short and concise. Respond "Unsure about answer" if not sure about the answer.

Context: Teplizumab traces its roots to a New Jersey drug company called Ortho Pharmaceutical. There, scientists generated an early version of the antibody, dubbed OKT3. Originally sourced from mice, the molecule was able to bind to the surface of T cells and limit their cell-killing potential. In 1986, it was approved to help prevent organ rejection after kidney transplants, making it the first therapeutic antibody allowed for human use.

Question: What was OKT3 originally sourced from?

Answer:"""


# hyperparameters for llm
payload = {
  "inputs": prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 1024,
    "repetition_penalty": 1.03,
    "stop": ["\nUser:","<|endoftext|>","</s>"]
  }
}

# send request to endpoint
response = llm.predict(payload)
for seq in response:
    print(f"Result: {seq['generated_text']}")

FEW-SHOT LEARNING

Few-shot learning in prompt engineering involves providing a model with a few examples (typically 2-5) of a task within the prompt to guide its understanding and response. This helps the model generalize and perform the task with minimal training data.

In [None]:
# One-shot - means no examples or references are provides relying only on model interpretation to get output

prompt = """
Tweet: "This new music video was incredibile"
Sentiment:"""

# hyperparameters for llm
payload = {
  "inputs": prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 1024,
    "repetition_penalty": 1.03,
    "stop": ["\nUser:","<|endoftext|>","</s>"]
  }
}

# send request to endpoint
response = llm.predict(payload)

for seq in response:
    print(f"Result: {seq['generated_text']}")

In [None]:
# With Few-shot technique
prompt = """
Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 👍"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:"""

# hyperparameters for llm
payload = {
  "inputs": prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 1024,
    "repetition_penalty": 1.03,
    "stop": ["\nUser:","<|endoftext|>","</s>"]
  }
}

# send request to endpoint
response = llm.predict(payload)
for seq in response:
    print(f"Result: {seq['generated_text']}")

Clear up resources

In [None]:
llm.delete_model()
llm.delete_endpoint()

also manually check and delete s3 and stop sagemaker space and after stoping delete it