# Retrieval Augmented Question & Answering with LangChain


### Context
Previously we saw that the model told us how to to change the tire, however we had to manually provide it with the relevant data and provide the contex ourselves. We explored the approach to leverage the model availabe under Bedrock and ask questions based on it's knowledge learned during training as well as providing manual context. While that approach works with short documents or single-ton applications, it fails to scale to enterprise level question answering where there could be large enterprise documents which cannot all be fit into the prompt sent to the model. 

### Pattern
We can improve upon this process by implementing an architecure called Retreival Augmented Generation (RAG). RAG retrieves data from outside the language model (non-parametric) and augments the prompts by adding the relevant retrieved data in context. 

In this notebook we explain how to approach the pattern of Question Answering to find and leverage the documents to provide answers to the user questions.

### Challenges
- How to manage large document(s) that exceed the token limit
- How to find the document(s) relevant to the question being asked

### Proposal
To the above challenges, this notebook proposes the following strategy
#### Prepare documents
![Embeddings](./images/Embeddings_lang.png)

Before being able to answer the questions, the documents must be processed and a stored in a document store index
- Load the documents
- Process and split them into smaller chunks
- Create a numerical vector representation of each chunk using Amazon Bedrock Titan Embeddings model
- Create an index using the chunks and the corresponding embeddings
#### Ask question
![Question](./images/Chatbot_lang.png)

When the documents index is prepared, you are ready to ask the questions and relevant documents will be fetched based on the question being asked. Following steps will be executed.
- Create an embedding of the input question
- Compare the question embedding with the embeddings in the index
- Fetch the (top N) relevant document chunks
- Add those chunks as part of the context in the prompt
- Send the prompt to the model under Amazon Bedrock
- Get the contextual answer based on the documents retrieved

## Use Case
#### Dataset
To explain this architecture pattern we are using the documents from IRS. These documents explain topics such as:
- Original Issue Discount (OID) Instruments
- Reporting Cash Payments of Over $10,000 to IRS
- Employer's Tax Guide

#### Persona
Let's assume a persona of a layman who doesn't have an understanding of how IRS works and if some actions have implications or not.

The model will try to answer from the documents in easy language.


## Implementation
In order to follow the RAG approach this notebook is using the LangChain framework where it has integrations with different services and tools that allow efficient building of patterns such as RAG. We will be using the following tools:

- **LLM (Large Language Model)**: Anthropic Claude V1 available through Amazon Bedrock

  This model will be used to understand the document chunks and provide an answer in human friendly manner.
- **Embeddings Model**: Amazon Titan Embeddings available through Amazon Bedrock

  This model will be used to generate a numerical representation of the textual documents
- **Document Loader**: PDF Loader available through LangChain

  This is the loader that can load the documents from a source, for the sake of this notebook we are loading the sample files from a local path. This could easily be replaced with a loader to load documents from enterprise internal systems.

- **Vector Store**: FAISS available through LangChain

  In this notebook we are using this in-memory vector-store to store both the embeddings and the documents. In an enterprise context this could be replaced with a persistent store such as AWS OpenSearch, RDS Postgres with pgVector, ChromaDB, Pinecone or Weaviate.
- **Index**: VectorIndex

  The index helps to compare the input embedding and the document embeddings to find relevant document
- **Wrapper**: wraps index, vector store, embeddings model and the LLM to abstract away the logic from the user.

## Setup


In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
%pip install langchain>=0.1.11
%pip install pypdf==4.1.0
%pip install langchain-community faiss-cpu==1.8.0 tiktoken==0.6.0 sqlalchemy==2.0.28


Note: you may need to restart the kernel to use updated packages.
Collecting pypdf==4.1.0
  Downloading pypdf-4.1.0-py3-none-any.whl.metadata (7.4 kB)
Downloading pypdf-4.1.0-py3-none-any.whl (286 kB)
Installing collected packages: pypdf
Successfully installed pypdf-4.1.0
Note: you may need to restart the kernel to use updated packages.
Collecting langchain-community
  Downloading langchain_community-0.3.4-py3-none-any.whl.metadata (2.9 kB)
Collecting faiss-cpu==1.8.0
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting tiktoken==0.6.0
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting sqlalchemy==2.0.28
  Downloading SQLAlchemy-2.0.28-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting

In [3]:
import json
import os
import sys

import boto3
import botocore

boto3_bedrock = boto3.client('bedrock-runtime')

In [4]:
import warnings

from io import StringIO
import sys
import textwrap
import os
from typing import Optional

# External Dependencies:
import boto3
from botocore.config import Config

warnings.filterwarnings('ignore')

def print_ww(*args, width: int = 100, **kwargs):
    """Like print(), but wraps output to `width` characters (default 100)"""
    buffer = StringIO()
    try:
        _stdout = sys.stdout
        sys.stdout = buffer
        print(*args, **kwargs)
        output = buffer.getvalue()
    finally:
        sys.stdout = _stdout
    for line in output.splitlines():
        print("\n".join(textwrap.wrap(line, width=width)))
        

## Configure langchain

We begin with instantiating the LLM and the Embeddings model. Here we are using Anthropic Claude for text generation and Amazon Titan for text embedding.

Note: It is possible to choose other models available with Bedrock. You can replace the `model_id` as follows to change the model.

`llm = Bedrock(model_id="amazon.titan-text-express-v1")`

Check [documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids-arns.html) for Available text generation and embedding models Ids under Amazon Bedrock.

In [10]:
# We will be using the Titan Embeddings Model to generate our Embeddings.
from langchain.embeddings import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock

# - create the Anthropic Model
llm = Bedrock(model_id="amazon.titan-text-lite-v1", client=boto3_bedrock, model_kwargs={})
bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0", client=boto3_bedrock)

`Note: As an exercise. If you have time, update the cell above so that it uses the "new/appropriate" version/style of code, so that the deprication issue/warning is resolved.`

## Data Preparation
Let's first download some of the files to build our document store. For this example we will be using public IRS documents from [here](https://www.irs.gov/publications).

In [30]:
from urllib.request import urlretrieve

os.makedirs("data", exist_ok=True)
files = ["https://static.aviva.io/content/dam/aviva-public/gb/pdfs/personal/insurance/travel/policy_document_insurance_travel_ntrtg10145_v34_092015.pdf"]
for url in files:
    file_path = os.path.join("data", url.rpartition("/")[2])
    urlretrieve(url, file_path)

After downloading we can load the documents with the help of [DirectoryLoader from PyPDF available under LangChain](https://python.langchain.com/en/latest/reference/modules/document_loaders.html) and splitting them into smaller chunks.

Note: The retrieved document/text should be large enough to contain enough information to answer a question; but small enough to fit into the LLM prompt. Also the embeddings model has a limit of the length of input tokens limited to 8192 tokens, which roughly translates to ~32,000 characters. For the sake of this use-case we are creating chunks of roughly 1000 characters with an overlap of 100 characters using [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html).

In [31]:
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
#from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
from langchain_community.document_loaders.pdf import PyPDFLoader, PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("./data/")

documents = loader.load()
# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 1000,
    chunk_overlap  = 100,
)
docs = text_splitter.split_documents(documents)

In [32]:
avg_doc_length = lambda documents: sum([len(doc.page_content) for doc in documents])//len(documents)
avg_char_count_pre = avg_doc_length(documents)
avg_char_count_post = avg_doc_length(docs)
print(f'Average length among {len(documents)} documents loaded is {avg_char_count_pre} characters.')
print(f'After the split we have {len(docs)} documents more than the original {len(documents)}.')
print(f'Average length among {len(docs)} documents (after split) is {avg_char_count_post} characters.')

Average length among 28 documents loaded is 3229 characters.
After the split we have 111 documents more than the original 28.
Average length among 111 documents (after split) is 841 characters.


We had 3 PDF documents which have been split into smaller ~500 chunks.

Now we can see how a sample embedding would look like for one of those chunks

In [33]:
try:
    sample_embedding = np.array(bedrock_embeddings.embed_query(docs[0].page_content))
    print("Sample embedding of a document chunk: ", sample_embedding)
    print("Size of the embedding: ", sample_embedding.shape)

except ValueError as error:
    if "AccessDeniedException" in str(error):
        print(f"\x1b[41m{error}\
        \nTo troubeshoot this issue please refer to the following resources.\
         \nhttps://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_access-denied.html\
         \nhttps://docs.aws.amazon.com/bedrock/latest/userguide/security-iam.html\x1b[0m\n")      
        class StopExecution(ValueError):
            def _render_traceback_(self):
                pass
        raise StopExecution        
    else:
        raise error

Sample embedding of a document chunk:  [-0.04845939 -0.01871589  0.03308285 ... -0.00302871  0.04442111
 -0.03230626]
Size of the embedding:  (1024,)


Following the similar pattern embeddings could be generated for the entire corpus and stored in a vector store.

This can be easily done using [FAISS](https://github.com/facebookresearch/faiss) implementation inside [LangChain](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/faiss.html) which takes  input the embeddings model and the documents to create the entire vector store. Using the Index Wrapper we can abstract away most of the heavy lifting such as creating the prompt, getting embeddings of the query, sampling the relevant documents and calling the LLM. [VectorStoreIndexWrapper](https://python.langchain.com/en/latest/modules/indexes/getting_started.html#one-line-index-creation) helps us with that.

**⚠️⚠️⚠️ NOTE: it might take few minutes to run the following cell ⚠️⚠️⚠️**

In [34]:
from langchain.chains.question_answering import load_qa_chain
from langchain.vectorstores import FAISS
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

vectorstore_faiss = FAISS.from_documents(
    docs,
    bedrock_embeddings,
)

wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)

## Question Answering

Now that we have our vector store in place, we can start asking questions.

In [35]:
query = """Is it possible that I can have a claim rejected because I have a pre-existing condition?"""

The first step would be to create an embedding of the query such that it could be compared with the documents

In [36]:
query_embedding = vectorstore_faiss.embedding_function.embed_query(query)
np.array(query_embedding)

array([-0.00593043,  0.03191652, -0.00754783, ..., -0.02113391,
        0.06929623, -0.03392927])

We can use this embedding of the query to then fetch relevant documents.
Now our query is represented as embeddings we can do a similarity search of our query against our data store providing us with the most relevant information.

In [37]:
relevant_documents = vectorstore_faiss.similarity_search_by_vector(query_embedding)
print(f'{len(relevant_documents)} documents are fetched which are relevant to the query.')
print('----')
for i, rel_doc in enumerate(relevant_documents):
    print_ww(f'## Document {i+1}: {rel_doc.page_content}.......')
    print('---')

4 documents are fetched which are relevant to the query.
----
## Document 1: 17
3. Any claim for a medical condition if any of the following applied when you took out or renewed
your policy or
when you booked your trip (whichever is later). You :
a) had received advice, medication or treatment for any serious, chronic or recurring illness,
injury or disease in
the last 12 months unless the condition was disclosed to and accepted by us;
b) were under investigation or awaiting results for any diagnosed or undiagnosed condition unless
disclosed to
and accepted by us;
c) were on a waiting list for in-patient treatment or were aware of the need for in-patient
treatment for any
diagnosed or undiagnosed condition unless disclosed to and accepted by us;
d) had been told you have a terminal illness.
4. Any claim for a medical condition where you have been referred to a Consultant/Specialist,
attended A&E or
admitted to a hospital between booking your trip and the departure date unless disclosed

Now we have the relevant documents, it's time to use the LLM to generate an answer based on these documents. 

We will take our inital prompt, together with our relevant documents which were retreived based on the results of our similarity search. We then by combining these create a prompt that we feed back to the model to get our result. At this point our model should give us highly informed information on how we can change the tire of our specific car as it was outlined in our manual.

LangChain provides an abstraction of how this can be done easily.

### Example #1: Using LangChain with RetrievalQA
You have the possibility to use the wrapper provided by LangChain which wraps around the Vector Store and takes input the LLM.
This wrapper performs the following steps behind the scences:
- Take the question as input
- Create question embedding
- Fetch relevant documents
- Stuff the documents and the question into a prompt
- Invoke the model with the prompt and generate the answer in a human readable manner.

In [38]:
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
prompt_template = """

Human: Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context

Question: {question}

Assistant:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [39]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 3}
    ),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)
answer = qa({"query": query})
print_ww(answer)

{'query': 'Is it possible that I can have a claim rejected because I have a pre-existing
condition?', 'result': ' Yes, if you do not declare your pre-existing condition, it could be
rejected.', 'source_documents': [Document(metadata={'source':
'data/policy_document_insurance_travel_ntrtg10145_v34_092015.pdf', 'page': 16}, page_content='17\n3.
Any claim for a medical condition if any of the following applied when you took out or renewed your
policy or \nwhen you booked your trip (whichever is later). You :\na) had received advice,
medication or treatment for any serious, chronic or recurring illness, injury or disease in \nthe
last 12 months unless the condition was disclosed to and accepted by us;\nb) were under
investigation or awaiting results for any diagnosed or undiagnosed condition unless disclosed to
\nand accepted by us;\nc) were on a waiting list for in-patient treatment or were aware of the need
for in-patient treatment for any \ndiagnosed or undiagnosed condition unless disc

`Note: As an exercise. If you have time, update the cell above so that it uses the "new/appropriate" version/style of code, so that the deprication issue/warning is resolved.`

That answer shows that full response, which includes a lot of noise. Zeroing in on primary aspect of the natural language message that we might return to the end user ...

In [40]:
print_ww(answer['result'])

 Yes, if you do not declare your pre-existing condition, it could be rejected.


This is a very accurate answer, because it describes the conditions for the rejection

Let's ask a different question:

In [57]:
query_2 = "Am I covered if there are flight delays?"

In [58]:
answer_2 = qa({"query": query_2})
# show the full response
print_ww(answer_2)

{'query': 'Am I covered if there are flight delays?', 'result': ' If your flight is delayed for 24
hours or cancelled by the airline you can claim for abandonment of your trip.', 'source_documents':
[Document(metadata={'source': 'data/policy_document_insurance_travel_ntrtg10145_v34_092015.pdf',
'page': 12}, page_content='international return journey to the UK.\nCover does not apply for any
internal and/or onward connecting travel, including travel from and to the Channel \nIslands.\nIf
this happens… Am I covered?\nMy flight from Heathrow to Paris has been delayed due \nto bad weather
in France. Can I make a claim for the \ninconvenience?You can claim a benefit for delayed departure
only after \nyour flight has been delayed for 12 hours;\nIf your flight is delayed for 24 hours or
cancelled by the \nairline you can claim for abandonment of your trip.\nIf the scheduled departure
of the ship, aircraft or train on which you are booked to travel is delayed at the point of
\ninternational dep

That answer shows that full response, which includes a lot of noise. Zeroing in on primary aspect of the natural language message that we might return to the end user ...

In [59]:
print_ww(answer_2['result'])

 If your flight is delayed for 24 hours or cancelled by the airline you can claim for abandonment of
your trip.


In [None]:
This is partially true. It is possible to claim for benefits for delayed departures after 12 hours

### Example # 2
Now let's have another look at using [RetrievalQA](https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html) where you can customize how the documents fetched should be added to prompt using `chain_type` parameter. Also, if you want to control how many relevant documents should be retrieved then change the `k` parameter in the cell below to see different outputs. In many scenarios you might want to know which were the source documents that the LLM used to generate the answer, you can get those documents in the output using `return_source_documents` which returns the documents that are added to the context of the LLM prompt. `RetrievalQA` also allows you to provide a custom [prompt template](https://python.langchain.com/en/latest/modules/prompts/prompt_templates/getting_started.html) which can be specific to the model.

In the cell below you see an example of how to control the prompt such that the LLM stays grounded and doesn't answer outside the context.

In [52]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = """

Human: Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context

Question: {question}

Assistant:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 3}
    ),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)
query = "Is it possible that I can have a claim rejected because I have a pre-existing condition?"
result = qa({"query": query})
print_ww(result['result'])

 Yes, it is possible that your claim can be rejected if you have a pre-existing condition. This is
because pre-existing conditions are not covered by your travel insurance policy.


In [53]:
result['source_documents']

[Document(metadata={'source': 'data/policy_document_insurance_travel_ntrtg10145_v34_092015.pdf', 'page': 16}, page_content='17\n3. Any claim for a medical condition if any of the following applied when you took out or renewed your policy or \nwhen you booked your trip (whichever is later). You :\na) had received advice, medication or treatment for any serious, chronic or recurring illness, injury or disease in \nthe last 12 months unless the condition was disclosed to and accepted by us;\nb) were under investigation or awaiting results for any diagnosed or undiagnosed condition unless disclosed to \nand accepted by us;\nc) were on a waiting list for in-patient treatment or were aware of the need for in-patient treatment for any \ndiagnosed or undiagnosed condition unless disclosed to and accepted by us;\nd) had been told you have a terminal illness.\n4. Any claim for a medical condition where you have been referred to a Consultant/Specialist, attended A&E or \nadmitted to a hospital be

# Capstone Assignment Part 2

Don't panic! This is not as difficult as it might first seem.

Using the notebook above as a guide and/or starting point, consider and explore the following questions and tasks. As with most things in life, you'll get the most out this exercise by putting in a reasonable amount of effort. That effort may in in research and/or in writing code.

Put all your answers into the notebook that you are going to submit as your completed assignment.
For each Task, there is an ask to consider the results and note your findings, please these notes in to the notebook.

This is a great opportunity to use or develop your markdown skills.


**Task 1** (for everyone) 

Change the base content used here (the PDFs in the data folder) to be content that is meaningful to you in some way. It might be that it relates to the business domain that you are interested in, or on a topic that interests you. 

For this task, your content should be approximately 15 pages of text. That could be in a single document or in multiple documents. Related to that content create 10 questions or so. These questions will be your initial test set that you will use to determine the quality of your RAG solution.

Upload your content and update the notebook to create a FAISS index of your documents, and update the question examples to use a subset of your questions.

Take a look at the outputs and consider the solutions accuracy. The notebook at this point is your baseline. You might want to make a snapshot of it before you progress further.

**Question 1** Baseline Evaluation 

We want to improve the quality of the output of the solution. We know that we have not done any tuning for the content so far, so our intuition is that we can do better.  Before we start experimenting to improve the quality, we really need to objectively measure the quality that we have.

You have a set of test questions (from task 1) to help drive your evaluation.

You decide that you will first test that retrieval aspect of the solution. You want a metric for if the right chunks of content being returned for your test queries.

A. Describe the ground-truth data would your create so that you can measure this? (hint: test query, document chunk)

Test Queries:  A set of sample queries that reflect realistic user intent across the full range of topics your system is meant to address. There are three types of queries that can be created:

1. Fact based: For information that is explicitly in the documents.
2. Reasoning: To test the system's ability to synthesize information across multiple chunks.
3. Paraphrased/Vague: To measure how well the system retrieves relevant information despite variations in query wording.

Document Chunks: Identify document chunks that would be most relevant in answering each query.

1. Annotation: Manually label document chunks that best address each query.
2. Relevance Scoring: Assign relevance scores to each document chunk based on how well it answers the query.
3. Negative Examples: Include some irrelevant or low-relevance document chunks in the set to test the system’s retrieval precision and ensure it doesn’t retrieve unrelated information.

B. Describe how you might determine success or failure of retrieval for a test query

Metrics can be utilised to determined the success or failure such as:
1. Recall@k: Measures if the relevant documents appear in the top-k retrieved results.
2. Mean Reciprocal Rank (MRR): Focuses on how quickly the most relevant document appears in the ranking.
3. Mean Average Precision (MAP): Evaluates how well the system ranks relevant chunks across multiple test queries.


C. Is your determination of success/failure binary (True/False) or graded 0.0 - 1.0 or both? Describe the reasoning for your choice.
In my example case of travel insurance, binary would my choice, because I feel that this is a high-stakes case. Graded does provide more nuance, but if we are looking for a go/no-go outcome, binary is the better fit. There is also the possibility of combining both approaches.

D. Describe your approach for calculating an aggregate metric (or metrics) for measuring performance of your test set with this base configuration. And, outline the reasoning for your choice.

I would utilise a combination of MRR and MAP. MAP evaluates overall precision and MAP provides how quickly a relevent document appears in the ranking.

This helps us because:
1. A combination of the two gives us a balanced perspective, and MAP is important for user-experience
2. Versatility - the combination can be applied to both single-answer and multi-answer systems
3. Error identification -  A low MAP may indicate issues with retrieving all relevant documents, while a low MRR may indicate that relevant documents are ranked too low, highlighting ranking improvements specifically

**Task 2** (strongly encouraged, but optional) 

Implement, and perhaps refine, your evaluation methods and capture your baseline metrics, by running your set of test cases.

Briefly summarize your results (the baseline metrics) and any intuitions that you have about the results.

**Question 2** Chunking

You know from your research that a critical factor in RAG solutions is how the document corpus is chunked (prior to embedding being created for the chunk, etc). Consider your content and the chunking options that are commonly used in RAG solutions. 

Choose 2 or 3 chunking options/variants that you believe might be better than baseline option that is configured in this notebook, and which are therefore worth experimenting with.

For each option that you choose, briefly outline why you think that might provide better results for your solution and particular document set.

For consideration: 

https://www.pinecone.io/learn/chunking-strategies/

https://python.langchain.com/docs/how_to/semantic-chunker/

https://blog.langchain.dev/a-chunk-by-any-other-name/

**Task 3** (strongly encouraged, but optional) 

Update the implementation to use one or more of your chuncking options.

For the evaluation process, you may need to create an new ground-truth dataset of each chunking option. The good news is that we're only using 10 test cases or so.

Run evaluation methods and capture the revised metrics. 

Briefly summarize your results (the new metrics) and any intuitions that you have about the results.

For consideration: 

https://github.com/aws-samples/amazon-bedrock-claude-2-and-3-with-langchain-popular-use-cases/blob/main/Amazon%20Bedrock%20%26%20Langchain%20Sample%20Solutions.ipynb

**Question 3** Embeddings

You know that one of the major factors that will impact the accuracy of the retrieval solution is the embeddings model that is used to encode the document corpus and the questions that get asked. 

Your customer has asked you to select two alternative models to the one in the baseline solution, from the set of embeddings models available to you in Amazon Bedrock and from Voyagai. Your goal is suggest two that will provide better retrieval performance that the baseline.

Choose two models and describe your reasoning on why each of those models might provide better results. 

As with much of generative AI, the best option will need testing with the content and likely cannot be fully determined from research/experience alone, but you goal is to provide a brief rationale for your recommendation.


**Task 4** (strongly encouraged, but optional)

Using your best (perhaps only) chunking strategy, experiment with applying the embeddings models that you recommended.

Run evaluation methods and capture the revised metrics. 

Briefly summarize your results (the new metrics) and any intuitions that you have about the results.

For consideration:

https://python.langchain.com/docs/integrations/text_embedding/voyageai/

https://docs.voyageai.com/docs/embeddings

**Question 4** K

How many matching results, K, fed into your LLM (in the augmented prompt) is going to have a impact on the cost, latency and the quality of output generated by the solution. In an ideal world, there would only be one chunk, K = 1, and that chunk would have all the information that is needed for the users question, and would be right-sized, with little data/text that is not relevant to the question.

A. K is set to 3 for the baseline solution. If you set this to 1, 2, or 4, or 5 how will this change your evaluation metrics?

B. If your retrieval metrics are well-designed, there is likely very little change if K is set to 3, 4 or 5. Why is this the case?

C. It was noted (above) that we do not want K to be large, how might we test what is the best size for K? Hint: it goes beyond just testing retrieval metrics.


**Task 5** (strongly encouraged, but optional)

Experiment with different values of K (1, 2, 3, 4). 

Evaluate and capture the retrieval metrics for each value of K.

Briefly summarize your results (the new metrics) and any intuitions that you have about the results.


**Question 5** Re-ranking

Adding a re-ranking model to our RAG pipeline helps us 1/ provide a better set of document chunks for RAG, and 2/ may allow us to reduce the number of chunks that are used for augmenting the prompt.

Breifly explain how re-ranking helps with with points 1 and 2 above.

Given the characteristics of your document content and RAG pipeline we have here, can you recommend one or two re-reranking models to experiment with? Brielfy outline the rationale for your recommendation.

For consideration:

https://blog.voyageai.com/2024/03/15/boosting-your-search-and-rag-with-voyages-rerankers/

https://developer.nvidia.com/blog/enhancing-rag-pipelines-with-re-ranking/

**Task 6** (encouraged, but optional)

Add re-ranking to the RAG solution. Using the Voyagai models will likely be easiest. Experiment with a couple re-ranking models.

After adding the re-ranker you will likely find that you get best evaluation results by having a value of 3 or 4 K for retreival from the FAISS vector database, and then taking 2 or 3 of the re-ranked chunks.  

Briefly summarize your results (the new metrics) and any intuitions that you have about the results.

For consideration:

https://python.langchain.com/docs/integrations/document_transformers/voyageai-reranker/

**Task 7** (encouraged, but optional)

Likely the evaluation metrics that you are using thus far is not sensitive to which rank in the set of documents the positive chunk hit/hits are. Order of the chunk, aka rank, is important as 1/ it will likely help the LLM produce a better output, 2/ if the retrieval system consistently returns chunk results in an optimal order, it allows us to more aggressively prune the retrieval results before giving the chunks to the LLM.

Add to the set evaluation metric(s) to have a metrics that has values correct order. The new metric will produce the highest value for correct answers, chunk(s), in the top rank(s), lower values for chunks in none top ranks. This metric better reflects our retrieval system objectives.

Once you have this, redo task 6, and see what the optimal configuration is for K and the re-ranking configuration.

Briefly summarize your results (the new metrics) and any intuitions that you have about the results.

For consideration:

https://www.evidentlyai.com/ranking-metrics/evaluating-recommender-systems

https://towardsdatascience.com/ranking-evaluation-metrics-for-recommender-systems-263d0a66ef54

**Question 6** Distance

An aspect of RAG tuning that easy to experiment with, but is often overlooked it the distance measure that is used to compare the embedding of the query, with the embedding of the documents.

What is the distance measure/method that is used in the baseline implementation?

Your customer wants to experiment with one or two other distance measure for this pipeline. Which measures are you going to recommend? Provide a brief rationale of your recommendation.

Reference:

https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.faiss.FAISS.html

https://python.langchain.com/docs/integrations/vectorstores/faiss/

https://github.com/facebookresearch/faiss/wiki/MetricType-and-distances
https://www.pinecone.io/learn/series/faiss/faiss-tutorial/

**Task 8** (encouraged, but optional)

Experiment with the setting the distance method for the vector database comparsion to the alternatives that you suggested above.

Briefly summarize your results (the new metrics) and any intuitions that you have about the results.


**Question 7** Wrapping up the Retrieval System

At this point we have completed a full iteration of tuning of the data retrieval aspect of our RAG solution. 

A. Briefly outline two or three insights from this exercise

B. Do you have further suggestions for how the accuracy of the retrieval system could be further tuned.