# Retrieval-Augmented Generation: Question Answering based on Custom Dataset

Many use cases such as building a chatbot require text (text2text) generation models like **[BloomZ 7B1](https://huggingface.co/bigscience/bloomz-7b1)**, **[Flan T5 XXL](https://huggingface.co/google/flan-t5-xxl)**, and **[Flan T5 UL2](https://huggingface.co/google/flan-ul2)** to respond to user questions with insightful answers. The **BloomZ 7B1**, **Flan T5 XXL**, and **Flan T5 UL2** models have picked up a lot of general knowledge in training, but we often need to ingest and use a large library of more specific information.

<img src="rag.png" max-width="1080"/>

Retrieval-Augmented Generation (RAG) combines the power of pre-trained LLMs with information retrieval - enabling more accurate and context-aware responses. 

1. The first process in a chat bot is to generate embeddings. Typically you will have an ingestion process which will run through your embedding model and generate the embeddings which will be stored in a sort of a vector store as your knowledge base. 

2. Second process is to retrieve relevant information from the knowledge base, and generate a response based on retrieved information and input query.

In this notebook we will demonstrate how to use **[Falcon-7B-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)** to answer questions using a library of documents as a reference, by using document embeddings and retrieval. The embeddings are generated from [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2), a [sentence-transformers](https://www.sbert.net/) embedding model.

**This notebook serves a template such that you can easily replace the example dataset by your own to build a custom question and asnwering application.**

This notebook is tested with SageMaker Studio with following configuration:
- Image: PyTorch 1.13 Python 3.9 CPU Optimized
- Kernel: Python 3
- Instance type: ml.t3.medium

## Step 1. Deploy large language model (LLM) in SageMaker JumpStart

To better illustrate the idea, let's first deploy the LLM model that are required to perform the demo.

In [2]:
!pip install --upgrade sagemaker --quiet
!pip install ipywidgets==7.0.0 --quiet

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
import time
import sagemaker, boto3, json
from sagemaker.session import Session
from sagemaker.model import Model
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()
model_version = "*"

In [4]:
model_id = 'huggingface-llm-falcon-7b-instruct-bf16'  # this is hard-coded
instance_type = 'ml.g5.2xlarge'

unix_time = int(time.time())
endpoint_name = name_from_base(f"jumpstart-example-rag-{model_id}-{unix_time}")
inference_instance_type = instance_type

# Retrieve the inference container uri. This is the base HuggingFace container image for the default model above.
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,  # automatically inferred from model_id
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)
# Retrieve the model uri.
model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="inference"
)
model_inference = Model(
    image_uri=deploy_image_uri,
    model_data=model_uri,
    role=aws_role,
    predictor_cls=Predictor,
    name=endpoint_name,
    env={
        'HF_MODEL_ID': '/opt/ml/model'
    }
)
model_predictor_inference = model_inference.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    predictor_cls=Predictor,
    endpoint_name=endpoint_name,
)
print(f"Model {model_id} has been deployed successfully.")

-----------------!Model huggingface-llm-falcon-7b-instruct-bf16 has been deployed successfully.


In [5]:
# endpoint_name = 'jumpstart-example-rag-huggingface-llm-f-2023-09-05-08-41-58-340'

In [6]:
def query_endpoint_with_json_payload(encoded_json, endpoint_name, content_type="application/json"):
    client = boto3.client("runtime.sagemaker")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType=content_type, Body=encoded_json
    )
    return response

def parse_response(query_response):
    model_predictions = json.loads(query_response["Body"].read().decode("utf8"))    
    return model_predictions[0]['generated_text']

## Step 2. Ask a question to LLM without providing the context

To better illustrate why we need retrieval-augmented generation (RAG) based approach to solve the question and anwering problem. Let's directly ask the model a question and see how they respond.

In [7]:
# These are hyper-parameters; Hyperparameters are used before inferencing a model because they have a
# direct impact on the performance of the resulting machine learning model. 
# Hyperparameters are used before inferencing a model because they control the behavior of the model, 
# and optimize its performance for the job at hand.
# For this workshop, hyper parameters have been identified for you. 
# If you like, you can use some of these in the code below.
# They will impact the behavior of your LLM response. 

parameters = {
    "max_new_tokens": 300,
    "num_return_sequences": 1,
    "top_k": 50,
    "top_p": 0.95,
    "do_sample": False,
    "return_full_text": False,
    "temperature": 0.2,
    # "stop": ["\n"]
}

question = "Which instances can I use with Managed Spot Training in SageMaker?"
prompt = f"Question: {question}\n\nAnswer:"
print(f"Question being asked is --> {prompt}")

payload = {"inputs": prompt, "parameters": parameters}
payload = json.dumps(payload).encode('utf-8')

response = query_endpoint_with_json_payload(payload, endpoint_name)
response = parse_response(response)
print(response)

Question being asked is --> Question: Which instances can I use with Managed Spot Training in SageMaker?

Answer:
 Managed Spot Training can be used with instances that are available in the AWS Marketplace. These instances include Amazon Linux, Ubuntu, and CentOS. You can also use instances that are available in the AWS Marketplace, such as the Amazon EMR clusters.

Note: Managed Spot Training is only available in the AWS Marketplace for the following instance types:
- Amazon Linux 2.x
- Ubuntu 2.x
- CentOS 7.x
- Amazon EMR clusters

If you are using an instance that is not available in the AWS Marketplace, you can use a managed instance from AWS Marketplace.


## Step 3. Improve the answer to the same question using prompt engineering with insightful context

To better answer the question well, we provide extra contextual information, combine it with a prompt, and send it to model together with the question. Below is an example.

In [8]:
question = "Which instances can I use with Managed Spot Training in SageMaker?"
context = """Managed Spot Training can be used with all instances supported in Amazon SageMaker. Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available."""

prompt = f""""Context: {context}\n\nQuestion: {question}\n\nAnswer:"""
print(f"Question being asked is -- > {prompt}")

payload = {"inputs": prompt, "parameters": parameters}
payload = json.dumps(payload).encode('utf-8')

response = query_endpoint_with_json_payload(payload, endpoint_name)
response = parse_response(response)
print(response)

Question being asked is -- > "Context: Managed Spot Training can be used with all instances supported in Amazon SageMaker. Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available.

Question: Which instances can I use with Managed Spot Training in SageMaker?

Answer:


Managed Spot Training is currently supported on all instances supported in Amazon SageMaker. This includes the following instance types:

- `m5.large`
- `m5.xlarge`
- `m5.2xlarge`
- `m5.4xlarge`
- `m5.8xlarge`
- `m5.12xlarge`
- `m5.16xlarge`
- `m5.24xlarge`
- `m5.32xlarge`
- `m5.40xlarge`
- `m5.48xlarge`
- `m5.56xlarge`
- `m5.64xlarge`
- `m5.72xlarge`
- `m5.8xlarge`
- `m5.10xlarge`
- `m5.12xlarge`
- `m5.14xlarge`
- `m5.16xlarge`
- `m5.18xlarge`
- `m5.20xlarge`
- `m5.24xlarge`
- `m5.26xlarge`
- `m5.28xlarge`
- `m5.30xlarge`
- `


The output from tells us the chance to get the correct response significantly correlates with the insightful context you send into the LLM. 

Now, the question becomes where can I find the insightful context based on the user query? The answer is to use a pre-stored knowledge data base with retrieval augmented generation, as shown below.

## Step 4. Use RAG based approach to identify the correct documents, and use them and question to query LLM

We plan to use document embeddings to fetch the most relevant documents in our document knowledge library and combine them with the prompt that we provide to LLM.

To achieve that, we will do following.

- Generate embedings for each of document in the knowledge library with the HuggingFace sentence-transformers embedding model.
- Identify top K most relevant documents based on user query.
  - For a query of your interest, generate the embedding of the query using the same embedding model.
  - Search the indexes of top K most relevant documents in the embedding space using the SageMaker KNN algorithm.
  - Use the indexes to retrieve the corresponded documents.
- Combine the retrieved documents with prompt and question and send them into LLM.

Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt. 



### 4.1 Preparing the embedding model

We'll start with initializing our embeddings model. In this notebook, we are using all-mpnet-base-v2 model. 

If you want, you can experiment with other embedding model, for example you can use GPT-J-6B embedding model which you can easily deploy with a few clicks from SageMaker JumpStart UI.

In [9]:
pip install -U sentence-transformers

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [10]:
import sentence_transformers

from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-mpnet-base-v2")

### 4.2. Generate embedings for each of document in the knowledge library with the embedding model.

For the purpose of the demo we will use [Amazon SageMaker FAQs](https://aws.amazon.com/sagemaker/faqs/) as knowledge library. The data are formatted in a CSV file with two columns Question and Answer. We use only the Answer column as the documents of knowledge library, from which relevant documents are retrieved based on a query.

Each row in the CSV format dataset corresponds to a textual document. We will iterate each document to get its embedding vector via the all-mpnet-base-v2 embedding models. For your purpose, you can replace the example dataset of your own to build a custom question and answering application.

In [11]:
import pandas as pd

df_knowledge = pd.read_csv("Amazon_SageMaker_FAQs.csv", header=None, usecols=[1], names=["Answer"])
df_knowledge.head(5)

Unnamed: 0,Answer
0,Amazon SageMaker is a fully managed service to...
1,For a list of the supported Amazon SageMaker A...
2,Amazon SageMaker is designed for high availabi...
3,Amazon SageMaker stores code in ML storage vol...
4,Amazon SageMaker ensures that ML model artifac...


In [12]:
vectors = encoder.encode(df_knowledge['Answer'])

In [13]:
vectors.shape[1]

768

## 4.3. Index the embedding knowledge library

We will store and match the embeddings using a vector database. In this notebook, we will showcase [FAISS](https://github.com/facebookresearch/faiss) which will be transient and in memory.

In real-world scenarios, [Amazon Opensearch](https://aws.amazon.com/opensearch-service/) is a popular choice for vector DB. You may refer to this [blog](https://aws.amazon.com/blogs/machine-learning/build-a-powerful-question-answering-bot-with-amazon-sagemaker-amazon-opensearch-service-streamlit-and-langchain/) and follow the instructions to build your own RAG solution with Opensearch! Alternatively, you may also use [Amazon Kendra](https://aws.amazon.com/kendra/) for its built-in NLP capabilities and pre-trained domain knowledge. Refer to this [blog](https://aws.amazon.com/blogs/machine-learning/quickly-build-high-accuracy-generative-ai-applications-on-enterprise-data-using-amazon-kendra-langchain-and-large-language-models/) for guided instructions on how to set it up.

In [14]:
pip install faiss-cpu

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [15]:
import faiss

vector_dimension = vectors.shape[1]
index = faiss.IndexFlatL2(vector_dimension)
faiss.normalize_L2(vectors)
index.add(vectors)

### 4.4 Retrieve the most relevant documents

Given the embedding of a query, we will query the endpoint to get the indexes of top K most relevant documents and use the indexes to retrieve the corresponded textual documents.

Next, the textual documents are concatenated with maximum length of MAX_SECTION_LEN. This is to make sure the context we send into the prompt contains a good enough amount of information all the while not exceeding model's capacity.

In [16]:
MAX_SECTION_LEN = 1500
THRESHOLD = 0.5
SEPARATOR = "\n* "


def construct_context(
    context_predictions_idx, 
    context_prediction_dist, 
    df_knowledge, 
    threshold, 
    context_length
) -> str:
    chosen_sections = []
    chosen_sections_len = 0

    for index, dist in zip(context_predictions_idx, context_prediction_dist):
        document_section = df_knowledge.loc[index]
        chosen_sections_len += len(document_section) + 2
        if dist > threshold:
            break
        chosen_sections.append(SEPARATOR + document_section.replace("\n", " "))
    
    concatenated_doc = "".join(chosen_sections)[:context_length]
    print(
        f"With maximum sequence length {MAX_SECTION_LEN}, \
selected top {len(chosen_sections)} document sections: {concatenated_doc}"
    )

    return concatenated_doc, len(chosen_sections)

Now we can convert question to embedding, and search for the most relevant documents. In this example, we will retrieve top 5 most relevant documents (k=5).

In [17]:
import numpy as np

question = 'Which instances can I use with Managed Spot Training in SageMaker?'
search_vector = encoder.encode(question)
_vector = np.array([search_vector])
faiss.normalize_L2(_vector)

In [18]:
k = 5
distances, indices = index.search(_vector, k=k)
distances, indices

(array([[0.23976025, 0.35568824, 0.4087937 , 0.51908976, 0.5701388 ]],
       dtype=float32),
 array([[90, 84, 91, 87, 85]]))

In [19]:
context_retrieve, num_context_doc = construct_context(
    indices[0], 
    distances[0], 
    df_knowledge["Answer"], 
    threshold=THRESHOLD,
    context_length=MAX_SECTION_LEN
)

With maximum sequence length 1500, selected top 3 document sections: 
  Managed Spot Training can be used with all instances supported in Amazon SageMaker.
* Managed Spot Training with Amazon SageMaker lets you train your ML models using Amazon EC2 Spot instances, while reducing the cost of training your models by up to 90%.
  Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available.


### 4.5 Combine the retrieved documents, prompt, and question to query the LLM

In [20]:
context = context_retrieve

prompt = f""""Context: {context}\n\nQuestion: {question}\n\nAnswer:"""
print(f"Question being asked is -- > {prompt}")

payload = {"inputs": prompt, "parameters": parameters}
payload = json.dumps(payload).encode('utf-8')

response = query_endpoint_with_json_payload(payload, endpoint_name)
response = parse_response(response)
print(response)

Question being asked is -- > "Context: 
  Managed Spot Training can be used with all instances supported in Amazon SageMaker.
* Managed Spot Training with Amazon SageMaker lets you train your ML models using Amazon EC2 Spot instances, while reducing the cost of training your models by up to 90%.
  Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available.

Question: Which instances can I use with Managed Spot Training in SageMaker?

Answer:


1. Amazon EC2 instances with a minimum of 1 vCPU and 2GB of memory can be used with Managed Spot Training in SageMaker.

2. Amazon EC2 instances with a minimum of 2 vCPUs and 4GB of memory can be used with Managed Spot Training in SageMaker.

3. Amazon EC2 instances with a minimum of 4 vCPUs and 8GB of memory can be used with Managed Spot Training in SageMaker.

4. Amazon EC2 instances with a minimum of 8 vCPUs and 16GB of memory can be used with Managed Spot Training in SageMaker.

5. Amazon EC2 instances

## 4.6 Clean Up

Uncomment below cell to delete the endpoint after testing.

In [21]:
# model_predictor_inference.delete_endpoint()