# Question Answering with Extracted Document

Many use cases such as building a chatbot require text (text2text) generation models like **[BloomZ 7B1](https://huggingface.co/bigscience/bloomz-7b1)**, **[Flan T5 XXL](https://huggingface.co/google/flan-t5-xxl)**, and **[Flan T5 UL2](https://huggingface.co/google/flan-ul2)** to respond to user questions with insightful answers. The **BloomZ 7B1**, **Flan T5 XXL**, and **Flan T5 UL2** models have picked up a lot of general knowledge in training, but we often need to ingest and use a large library of more specific information.

In this notebook we will demonstrate,
- (1) How to deploy Large Language Model(LLM) in SageMaker Jumpstart. 
- (2) Common use case of LLM 
- (3) Ask a question to LLM with or without providing the context. 

**Note**
- This notebook serves a template such that you can easily replace the example dataset by your own to build a custom question and asnwering application.
- This lab will take you 15 mins(10 mins deployment + 5 mins testing model)

## Step 1. Deploy large language model (LLM) in SageMaker JumpStart

To better illustrate the idea, let's first deploy all the models that are required to perform the demo. You can choose either deploying all three Flan T5 XL, BloomZ 7B1, and Flan UL2 models as the large language model (LLM) to compare their model performances, or select **subset** of the models based on your preference. To do that, you need modify the `_MODEL_CONFIG_` python dictionary defined as below.

In [2]:
!pip install --upgrade sagemaker --quiet
!pip install ipywidgets==7.0.0 --quiet

Keyring is skipped due to an exception: 'keyring.backends'
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Keyring is skipped due to an exception: 'keyring.backends'
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
import time
import sagemaker, boto3, json
from sagemaker.session import Session
from sagemaker.model import Model
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()
model_version = "*"

In [4]:
def query_endpoint_with_json_payload(encoded_json, endpoint_name, content_type="application/json"):
    client = boto3.client("runtime.sagemaker")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType=content_type, Body=encoded_json
    )
    return response


def parse_response_model_flan_t5(query_response):
    model_predictions = json.loads(query_response["Body"].read())
    generated_text = model_predictions["generated_texts"]
    return generated_text


def parse_response_multiple_texts_bloomz(query_response):
    generated_text = []
    model_predictions = json.loads(query_response["Body"].read())
    for x in model_predictions[0]:
        generated_text.append(x["generated_text"])
    return generated_text

Please uncomment the entries as below if you want to deploy multiple LLM models to compare their performance. 

| models | support languages|
|---|---|
|flan-t5-small | English, Chinese, ...|
|bloomz-3b|English, Chinese, ...|

- [Flan-t5-small model spec](https://huggingface.co/google/flan-t5-small)
- [bloomz-3b model spec](https://huggingface.co/bigscience/bloomz-3b)

You may check [avaliable models](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html) on Amazon SageMaker Jumpstart to get the avaliable mode list, and [g5 pricing](https://aws.amazon.com/ec2/instance-types/g5/) to estimate the budget.

Note. In the deployment session, it will take you 10 mins to deploy two endpoints. (5 mins for each one).

In [8]:
_MODEL_CONFIG_ = {
    "huggingface-text2text-flan-t5-small": {
        "instance type": "ml.g5.xlarge",
        "env": {"TS_DEFAULT_WORKERS_PER_MODEL": "1"},
        "parse_function": parse_response_model_flan_t5,
        "prompt": """Answer based on context:\n\n{context}\n\n{question}""",
    },
    "huggingface-text2text-flan-t5-base": {
        "instance type": "ml.g5.2xlarge",
        "env": {},
        "parse_function": parse_response_model_flan_t5,
        "prompt": """Answer based on context:\n\n{context}\n\n{question}""",
    }
}

In [9]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"

for model_id in _MODEL_CONFIG_.keys():
    print(model_id)
    endpoint_name = name_from_base(f"jp-{model_id}", short=False)
    inference_instance_type = _MODEL_CONFIG_[model_id]["instance type"]
    print(endpoint_name)
    # Retrieve the inference container uri. This is the base HuggingFace container image for the default model above.
    deploy_image_uri = image_uris.retrieve(
        region=None,
        framework=None,  # automatically inferred from model_id
        image_scope="inference",
        model_id=model_id,
        model_version=model_version,
        instance_type=inference_instance_type,
    )
    # Retrieve the model uri.
    model_uri = model_uris.retrieve(
        model_id=model_id, model_version=model_version, model_scope="inference"
    )
    model_inference = Model(
        image_uri=deploy_image_uri,
        model_data=model_uri,
        role=aws_role,
        predictor_cls=Predictor,
        name=endpoint_name,
        env=_MODEL_CONFIG_[model_id]["env"],
    )
    model_predictor_inference = model_inference.deploy(
        initial_instance_count=1,
        instance_type=inference_instance_type,
        predictor_cls=Predictor,
        endpoint_name=endpoint_name,
    )
    print(f"{bold}Model {model_id} has been deployed successfully.{unbold}{newline}")
    _MODEL_CONFIG_[model_id]["endpoint_name"] = endpoint_name
    print(endpoint_name)
    print("---")
    
print(_MODEL_CONFIG_)

huggingface-text2text-flan-t5-small
jp-huggingface-text2text-flan-t5-small-2023-07-31-17-55-15-884
--------![1mModel huggingface-text2text-flan-t5-small has been deployed successfully.[0m

jp-huggingface-text2text-flan-t5-small-2023-07-31-17-55-15-884
---
huggingface-text2text-flan-t5-base
jp-huggingface-text2text-flan-t5-base-2023-07-31-17-59-48-981
--------![1mModel huggingface-text2text-flan-t5-base has been deployed successfully.[0m

jp-huggingface-text2text-flan-t5-base-2023-07-31-17-59-48-981
---
{'huggingface-text2text-flan-t5-small': {'instance type': 'ml.g5.xlarge', 'env': {'TS_DEFAULT_WORKERS_PER_MODEL': '1'}, 'parse_function': <function parse_response_model_flan_t5 at 0x7fbdc4f793b0>, 'prompt': 'Answer based on context:\n\n{context}\n\n{question}', 'endpoint_name': 'jp-huggingface-text2text-flan-t5-small-2023-07-31-17-55-15-884'}, 'huggingface-text2text-flan-t5-base': {'instance type': 'ml.g5.2xlarge', 'env': {}, 'parse_function': <function parse_response_model_flan_t5 a

## Step 2. Common use case of LLM

- Text summarization
- Common sense reasoning
- Question answering
- Sentiment classification
- Translation
- Pronoun resolution
- Text generation based on article
- Imaginary article based on title

Here are the sameple queries: [Zero-shot prompting for the Flan-T5 foundation model in Amazon SageMaker JumpStart](https://aws.amazon.com/blogs/machine-learning/zero-shot-prompting-for-the-flan-t5-foundation-model-in-amazon-sagemaker-jumpstart/)

In [11]:
question = "Here is what customer said in the call: 'Yes, I have received a defective product, and I am extremely angry about it! This is unacceptable, and I want it resolved immediately!' What does customer want?"

In [12]:
payload = {
    "text_inputs": question,
    "max_length": 100,
    "num_return_sequences": 1,
    "top_k": 10,
    "top_p": 0.95, #0.95,
    "do_sample": True,
}


for model_id in _MODEL_CONFIG_:
    endpoint_name = _MODEL_CONFIG_[model_id]["endpoint_name"]
    query_response = query_endpoint_with_json_payload(
        json.dumps(payload).encode("utf-8"), endpoint_name=endpoint_name
    )
    generated_texts = _MODEL_CONFIG_[model_id]["parse_function"](query_response)
    print(f"For model: {model_id}, the generated output is: {generated_texts[0]}\n")

For model: huggingface-text2text-flan-t5-small, the generated output is: I am angry about it.

For model: huggingface-text2text-flan-t5-base, the generated output is: A refund



## Compare the result

Now, test the two endpoints with these sample prompt, or you can prepare yours.

- sample_prompt_1 = ""
- sample_prompt_2 = ""
- sample_prompt_3 = ""

## Delete the endpoint
- Keep: huggingface-text2text-flan-t5-base
- Delete: huggingface-text2text-flan-t5-small

In [16]:
endpoint_name = "jp-huggingface-text2text-flan-t5-small-2023-07-31-17-55-15-884"
model_name = "huggingface-text2text-flan-t5-small"