# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

The only way to get started is to get started - so let's grab our dependencies for the day!

> NOTE: Using this notebook as presented will incur a charge of ~$3USD from OpenAI usage.

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Creating a Simple RAG Pipeline with LangChain v.0.2.0
  4. Task 4: Synthetic Dataset Generation for Evaluation using Ragas (Optional)

- 🤝 Breakout Room #2
  1. Task 1: Evaluating our Pipeline with Ragas
  2. Task 2: Testing OpenAI's Claim
  3. Task 3: Selecting an Advanced Retriever and Evaluating

> NOTE: This Notebook *does* contain a bonus challenge, outlined at the bottom of the notebook, which you can complete instead of the notebook for full marks on the assignment.

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room Part #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://python.langchain.com/v0.2/docs/versions/v0_2/) of LangChain v0.2.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [1]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai langchain-qdrant

We'll also get the "star of the show" today, which is Ragas!

In [2]:
!pip install -qU ragas

We'll be leveraging [QDrant](https://qdrant.tech/) again as our LangChain `VectorStore`.

We'll also install `pymupdf` and its dependencies which will allow us to load PDFs using the `PyMuPDFLoader` in the `langchain-community` package!

In [3]:
!pip install -qU qdrant-client pymupdf pandas

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [4]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

## Task 3: Creating a Simple RAG Pipeline with LangChain v0.2.0

Building on what we've been learning, we'll be leveraging LangChain v0.2.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

- [`PyMuPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyMuPDFLoader.html)

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.2.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [5]:
from langchain_community.document_loaders import PyMuPDFLoader

PDF_LINK = "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf"

loader = PyMuPDFLoader(
    PDF_LINK
)

documents = loader.load()

Lets look at the metadata for the first document

In [6]:
documents[0].metadata

{'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf',
 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf',
 'page': 0,
 'total_pages': 195,
 'format': 'PDF 1.3',
 'title': 'The Pmarca Blog Archives',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'Mac OS X 10.10 Quartz PDFContext',
 'creationDate': "D:20150110020418Z00'00'",
 'modDate': "D:20150110020418Z00'00'",
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

- [`RecursiveCharacterTextSplitter`](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#langchain-text-splitters-character-recursivecharactertextsplitter)

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

CHUNK_SIZE = 200
CHUNK_OVERLAP = 50

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [8]:
len(documents)

1864

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

- [`OpenAIEmbeddings`](https://api.python.langchain.com/en/latest/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html#langchain-openai-embeddings-base-openaiembeddings)

> NOTE: We are purposefully using an older embedding model to try and answer the guiding question: Is TE3 better than Ada-002?

In [9]:
from langchain_openai import OpenAIEmbeddings

EMBEDDING_MODEL = "text-embedding-ada-002"

embeddings = OpenAIEmbeddings(
     model=EMBEDDING_MODEL
)

#### Creating a QDrant VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

- [`Qdrant`](https://api.python.langchain.com/en/latest/qdrant/langchain_qdrant.qdrant.QdrantVectorStore.html#langchain_qdrant.qdrant.QdrantVectorStore)

> NOTE: You'll need to provide the embedding dimension for Ada-002!

In [12]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

LOCATION = ":memory:"
COLLECTION_NAME = "PMarca Blogs"
VECTOR_SIZE = 1536

In [13]:
qdrant_client = QdrantClient(LOCATION)

qdrant_client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE)
)

qdrant_vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=COLLECTION_NAME,
    embedding=OpenAIEmbeddings()  # Pass the embeddings here
)

qdrant_vector_store.add_documents(documents)

['b06da23ceb0144749a55b8f0adbe231b',
 '1e41299cb46c481795f06d02df5d1731',
 '8dce787647db449c9e596fa21ce8d764',
 'b7c2c2bc5d76420385ac2d0594e885c7',
 '91e7b21273fa469ea8b0d445170029ef',
 '7340ec9804c245a8af065cc033880ab2',
 'a99b0c665bb140beb05ead47e26b19d9',
 '0cb420c2be7d4378a5a0c753f3e3054d',
 '761aded430df4d6cb401730fd409f309',
 '22e5382be3f84c6693f9f0227a7f5cf9',
 '26b9b161a1cc4c8b94d81b2c6480030d',
 '0d72ade5db654d1fa9e1990b7a369723',
 '077435f32beb456d82b0de49911b27d0',
 'bc15d4f7eacd4bb784005eaf09c5864d',
 '3cc7467a79d94beeaad2a2b3cfc422c6',
 '71fb171a1c504979b677c33e49ce1285',
 '0999ed04b8d941388bcc15a1d73faef7',
 'fc9fc44ccb88450083523ca728d9688b',
 'f9f89191d0f546d2be6a0827f88c9708',
 '98418957a8394f4cb410166ce39d05ef',
 'e861cc7a6c654c5a8977183b42d7f8c0',
 'd29fe5ebbc514f21a0a3fd314e7452a0',
 '9c7244d4dc3141379a15c5bf8719b20a',
 'd4c8ef4e2e974d5c954770d861b4ffad',
 '1a23bd84ca034ba9955f191b1aa9fe2c',
 'd15d5efc32d8411683df72b8237b37b2',
 '65bd9eda20d249d09cf577fa4bf44e84',
 

####❓ Question #1:

List out a few of the techniques that Qdrant uses that make it performant.

> NOTE: Check the [documentation](https://qdrant.tech/documentation/overview/) for more information about QDrant!

#### ! Answer #1:

Qdrant is a vector database. Vector databases are designed to store and query high-dimensional vectors efficiently.

Qdrant uese Hierarchical Navigable Small World (HNSW) for fast approximate nearest neighbor searches. This allows efficient searching of large datasets by orgnizing the vectors into a hierarchical graph. This reduces the number of comparisons needed to find similar vectors.

Qdrant also uses Product Quantization which reduces the precision of vectors with neglible impact to search accuracy. Quantization helps save memory, allowing more vectors to be stored in the same memory space, improving search time.

Qdrant supports the optional addition of payloads alongside the point data. Payload filtering allows for more efficient and targeted queries. This reduces the necessary search by reducing the number of vectors (based on passed criteria) before carrying out the similarity search.

Although Qdrant handles in memory storage, it has a hybrid approach between memory and disk, allowing less frequently accessed vectors to be stored to disk. This helps Qdrant be scalable and resource-efficient.

Qdrant supports multi-threading and parallel processing. It also allows for real time additions and updates to the indexes. It also supports different distance metrics allowing the choice of the best metric for a given application.




#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [14]:
retriever = qdrant_vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [15]:
retrieved_documents = retriever.invoke("What is a rule of thumb for selecting an industry to invest in?")

In [19]:
for doc in retrieved_documents:
  print(f"content: {doc.page_content}")
  print(f"metadata: {doc.metadata}")
  print("---")

content: the existing order — and make sure that those forces of change
have a reasonable chance at succeeding.
Second rule of thumb:
Once you have picked an industry, get right to the center of it
metadata: {'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 125, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': 'e5e9617fae5e4183961e3f01ed47f359', '_collection_name': 'PMarca Blogs'}
---
content: Third rule:
In a rapidly changing Held like technology, the best place to
get experience when you’re starting out is in younger, high-
growth companies.
metadata: {'source': 'https://d1lamhf6l6yk6d.cloudfront.

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [20]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

  prompt = loads(json.dumps(prompt_object.manifest))


In [21]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [53]:
from langchain.prompts import ChatPromptTemplate

template = """
Given the provided context and question, you must answer the question based only on context.
Be concise.

If you cannot answer the question based on the context - you must say "I don't know".

Question:
{question}

Context:
{context}
"""

improved_retrieval_qa_prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [54]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": improved_retrieval_qa_prompt | primary_qa_llm, "context": itemgetter("context")}
)

Let's test it out! Use a question that it should be able to answer

In [26]:
question = "What is a rule of thumb for selecting an industry to invest in?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

A rule of thumb for selecting an industry to invest in is to ensure that the forces of change within that industry have a reasonable chance of succeeding.
[Document(metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 125, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': 'e5e9617fae5e4183961e3f01ed47f359', '_collection_name': 'PMarca Blogs'}, page_content='the existing order — and make sure that those forces of change\nhave a reasonable chance at succeeding.\nSecond rule of thumb:\nOnce you have picked an industry, get right to the center of it'), Document(metadata={'source': 'https://d1lamhf6

Try it with a question that it shouldn't know the answer to.

In [27]:
question = "What did Pink Floyd have to say about how to proceed when investing in a new industry?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

I don't know.
[Document(metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'page': 15, 'total_pages': 195, 'format': 'PDF 1.3', 'title': 'The Pmarca Blog Archives', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': 'Mac OS X 10.10 Quartz PDFContext', 'creationDate': "D:20150110020418Z00'00'", 'modDate': "D:20150110020418Z00'00'", 'trapped': '', '_id': '434857dc49eb4ef1b31fcf138c05c14d', '_collection_name': 'PMarca Blogs'}, page_content='ask if you can call them again if things change.\nTrust me — they’d much rather be saying “yes” than “no” —\nthey need all the good investments they can get.\nSecond, consider the environment.'), Document(metadata={'source': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf', 'file_path': 'https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

## Task 4: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evaluating on every core metric today, but in order to do that - we'll need to create a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

In [28]:
loader = PyMuPDFLoader(
    "https://d1lamhf6l6yk6d.cloudfront.net/uploads/2021/08/The-pmarca-Blog-Archives.pdf",
)

eval_documents = loader.load()

text_splitter_eval = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap = 50
)

eval_documents = text_splitter_eval.split_documents(eval_documents)

####❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

#### ! Answer #2:

Different documents have different structures. Different parameters allow the retriever to handle the different text. 

As we have found in earlier experiments, different queries work well with smaller chunks others with larger chunks depending on the style of the documents.

Also different size chunks impact the precision of retrieval and the context. So a smaller chunk may be very precise but lack important context. Larger chunks may have the required context but could be less precise with a higher noise to signal ratio.

Overall different parameters provide better performance and robustness.


In [29]:
len(eval_documents)

624

> NOTE: 🛑 Running this cell as presented will incur a charge of ~$3USD from OpenAI usage. Most of this cost is produced by the Synthetic Data Generation step. **YOU CAN SKIP THIS STEP BY LOADING THE `.csv` DIRECTLY FROM OUR REPOSITORY.** 🛑

#### Note
The data repository was downloaded
I commented out the code and removed the output

#### Optional: SDG for Evaluation

In [None]:
# from ragas.testset.generator import TestsetGenerator
# from ragas.testset.evolutions import simple, reasoning, multi_context
# from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
# critic_llm = ChatOpenAI(model="gpt-4o-mini")
# embeddings = OpenAIEmbeddings()

# generator = TestsetGenerator.from_langchain(
#     generator_llm,
#     critic_llm,
#     embeddings
# )

# distributions = {
#     simple: 0.5,
#     multi_context: 0.4,
#     reasoning: 0.1
# }

# num_qa_pairs = 20 # You can reduce the number of QA pairs to 5 if you're experiencing rate-limiting issues

# testset = generator.generate_with_langchain_docs(eval_documents, num_qa_pairs, distributions)
# testset.to_pandas()

Let's look at the output and see what we can learn about it!

In [None]:
# testset.test_data[0]

In [None]:
# testset_df = testset.to_pandas()
# testset_df.to_csv("testset.csv")

#### PREFERRED: Download `.csv` from DataRepository

In [36]:
# !git clone https://github.com/AI-Maker-Space/DataRepository.git

Cloning into 'DataRepository'...
remote: Enumerating objects: 87, done.[K
remote: Counting objects: 100% (79/79), done.[K
remote: Compressing objects: 100% (66/66), done.[K
remote: Total 87 (delta 24), reused 28 (delta 8), pack-reused 8 (from 1)[K
Receiving objects: 100% (87/87), 70.09 MiB | 3.56 MiB/s, done.
Resolving deltas: 100% (24/24), done.


In [37]:
# !mv DataRepository/testset.csv .

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [30]:
import pandas as pd

test_df = pd.read_csv("testset.csv")

In [31]:
test_df

Unnamed: 0.1,Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,0,How does the tendency to avoid inconsistency c...,['Five: Inconsistency-Avoidance Tendency\n[Peo...,The tendency to avoid inconsistency contribute...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
1,1,What are some of the challenges faced by start...,['structure that any established company has.\...,"In a startup, it is easy for the code not to g...",simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
2,2,What factors should be considered when decidin...,['Part 2: Skills and education\n[Please read m...,The answer to given question is not present in...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
3,3,What should be valued when evaluating candidat...,"[""How to hire the best people you've\never wor...",The answer to given question is not present in...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
4,4,What are the consequences of not raising enoug...,['Here’s why you shouldn’t do that:\nWhat are ...,Not raising enough money risks the survival of...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
5,5,How does Structured Procrastination suggest us...,['like?\nStructured procrastination\nThis is a...,Structured Procrastination suggests that inste...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
6,6,What analogy is used to describe the layers of...,['as if it’s an onion. Just like you peel an o...,The analogy used to describe the layers of ris...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
7,7,How can Structured Procrastination be used to ...,['like?\nStructured procrastination\nThis is a...,Structured Procrastination suggests that inste...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
8,8,How is the quality of a startup's product defi...,['Let’s start by deXning terms.\nThe caliber o...,The quality of a startup's product in the tech...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True
9,9,What role can a campus computer lab play in he...,"['undergrads to do some of the work, and being...",A campus computer lab can play a role in helpi...,simple,[{'source': 'https://d1lamhf6l6yk6d.cloudfront...,True


In [33]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

['How does the tendency to avoid inconsistency contribute to people being reluctant to change?', 'What are some of the challenges faced by startups in establishing necessary systems and routines for success?', 'What factors should be considered when deciding what to study in college?', 'What should be valued when evaluating candidates for a startup?', 'What are the consequences of not raising enough money for the survival of your company?', 'How does Structured Procrastination suggest using procrastination to your advantage in getting things done?', 'What analogy is used to describe the layers of risk in a startup investment?', "How can Structured Procrastination be used to one's advantage in getting things done?", "How is the quality of a startup's product defined in the context of the tech industry?", 'What role can a campus computer lab play in helping undergraduates gain real-world working experience before graduation?', "Who said, 'This task isn't easy'?", 'What significant event 

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [41]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [42]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [43]:
response_dataset[0]

{'question': 'How does the tendency to avoid inconsistency contribute to people being reluctant to change?',
 'answer': 'The tendency to avoid inconsistency contributes to people being reluctant to change because change represents a form of inconsistency, which people naturally resist. This resistance is reinforced by previous conclusions, loyalties, reputational identity, and commitments that individuals have established, making them less open to new ideas or identities.',
 'contexts': ['Five: Inconsistency-Avoidance Tendency\n[People are] reluctant to change, which is a form of inconsistency\navoidance. We see this in all human habits, constructive and',
  'less brain-blocked by its previous conclusions…\nOne corollary of Inconsistency-Avoidance Tendency is that a per-\nson making big sacriXces in the course of assuming a new identity',
  '[T]ending to be maintained in place by the anti-change tendency\nof the brain are one’s previous conclusions, human loyalties, repu-\ntational ide

# 🤝 Breakout Room Part #2

## Task 1: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html) 
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [44]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [46]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/95 [00:00<?, ?it/s]

In [47]:
results

{'faithfulness': 0.7009, 'answer_relevancy': 0.8656, 'context_recall': 0.6754, 'context_precision': 0.7164, 'answer_correctness': 0.5792}

#### Base Results
| Metric              |  text-embedding-ada-002 |
|---------------------|--------|
| Faithfulness        | 0.7009 |
| Answer Relevancy    | 0.8656 |
| Context Recall      | 0.6754 |
| Context Precision   | 0.7164 |
| Answer Correctness  | 0.5792 |

In [48]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,How does the tendency to avoid inconsistency c...,The tendency to avoid inconsistency contribute...,[Five: Inconsistency-Avoidance Tendency\n[Peop...,The tendency to avoid inconsistency contribute...,0.75,0.936709,0.25,0.805556,0.325213
1,What are some of the challenges faced by start...,Startups face challenges such as the lack of e...,[ied and determined. Sales calls get made. The...,"In a startup, it is easy for the code not to g...",1.0,0.924336,1.0,1.0,0.594483
2,What factors should be considered when decidin...,I don't know.,[including your formal education. So I will st...,The answer to given question is not present in...,0.0,0.0,1.0,0.0,0.195204
3,What should be valued when evaluating candidat...,"When evaluating candidates for a startup, the ...",[priate for your particular startup.\nWith a w...,The answer to given question is not present in...,1.0,0.963762,1.0,0.0,0.180777
4,What are the consequences of not raising enoug...,Not raising enough money risks the survival of...,[Here’s why you shouldn’t do that:\nWhat are t...,Not raising enough money risks the survival of...,0.5,0.991354,0.333333,0.833333,0.501363
5,How does Structured Procrastination suggest us...,Structured Procrastination suggests that inste...,[standing.)\nThe gist of Structured Procrastin...,Structured Procrastination suggests that inste...,1.0,0.957872,1.0,0.916667,0.995478
6,What analogy is used to describe the layers of...,The analogy used to describe the layers of ris...,[as if it’s an onion. Just like you peel an on...,The analogy used to describe the layers of ris...,1.0,1.0,1.0,0.916667,0.747353
7,How can Structured Procrastination be used to ...,Structured Procrastination can be used to one'...,[standing.)\nThe gist of Structured Procrastin...,Structured Procrastination suggests that inste...,1.0,0.987979,1.0,0.805556,0.841482
8,How is the quality of a startup's product defi...,The quality of a startup's product is defined ...,[The quality of a startup’s pr\nproduct\noduct...,The quality of a startup's product in the tech...,1.0,0.973781,0.0,1.0,0.57163
9,What role can a campus computer lab play in he...,A campus computer lab can provide undergraduat...,[What should I do while I’m in school?\nI’m a ...,A campus computer lab can play a role in helpi...,1.0,0.949112,0.333333,0.75,0.807536


## Task : Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####🏗️ Activity #1:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

Use OpenAI text-embedding-3-small model for the embeddings

In [49]:
te3_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Create qdrant vector store using the te3 embeddings
Add TE3 to the collection name to keep the vectors distinct
Use vector size of 1536 and cosine similarity for the similarity measurement
Create the qdrant vector store and add our original documents (chunking/vectorizing etc)

In [50]:
qdrant_client.create_collection(
    collection_name=COLLECTION_NAME+"TE3",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

qdrant_vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=COLLECTION_NAME+"TE3",
    embedding=te3_embeddings,
)

qdrant_vector_store.add_documents(documents)

['b14b56bb6ac04434b0dbd2b4ac4d89e4',
 'af34e1986ee047378a94c7571f67f7f9',
 'ed8f9203df594266bd1c2ae3c2fbe58a',
 '979d6b61d3e24aa09b4e0576a36287f4',
 '5ae70c18b5824e3a9c6878c02943f155',
 '6f3e1e767c5e43818c0daad49337e34f',
 '99d310be450c40d587f5be007fb58f08',
 '8a24c163f29242a6b808a7bdf78c7e39',
 'a6d5f2a98b724244b546a1a0db7c5718',
 'bb767955c43643c3b03b7cf995047760',
 'e4dc12246d434a758926d54b3636daab',
 'd8bded4ec0d5447ca92498952fe29935',
 'a010d12a9f5d4de3a4977c796e3aee73',
 'c603a7da1ae7450d8030c216f2d6ebcb',
 '61a9f1505f6040499f9eb39fc7be19d3',
 'ec503745d2dc4dedaab125248e7c6fa0',
 'dad11c14434b495f9da1f92cac736fbf',
 'b7d88e31b835435f8733adbd8975388f',
 '54171b90a1ab4b1bb4b99327c3052bec',
 'ee23f9e0a78440dfb932a537104020c6',
 'fe85944a0abd4e859c300b50840310e7',
 'ab2e61091e3b44a7a59bd6821e756ce3',
 'ce69bc7a63804dec93d56902245880c6',
 '99cfa9e1d12d4336825df797f730200e',
 'ca4563639f55455cbe423242b693649e',
 '6e51e633db6e48338a77bf0e7be8941c',
 '41d09361e8cd494eb2a535bd992027c5',
 

Wrap the qdrant vector store with the as retriever to be able to find appropriate context

In [51]:
te3_retriever = qdrant_vector_store.as_retriever()

Create a "stuff" document chain useful for question-answering.
Use the LLM (gpt-4o-mini) and the the prompt previously defined

We can then get answers using this chain by passing in:
- question
- documents

#### NOTE: I altered the prompt to use the improved prompt since this used the prompt derrived directly from LangChain. This would then make it hard to have direct comparisons

In [52]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, improved_retrieval_qa_prompt)

Create a retrieval chain by connecting our new TE3 retriever to the front of the newly created document chain

This will first retrieve the documents relevant to the question.

Then it will process the question and the retrieved documents in the document chain

Note: we could test the retrieval chain using: 
- result = te3_retrieval_chain.run({"question": "What are the key features of TE3 embeddings?"})

In [55]:
from langchain.chains import create_retrieval_chain

te3_retrieval_chain = create_retrieval_chain(te3_retriever, document_chain)

Using the new retrieval chain, lets ask our test questions and get the answers

Create a list of answers provided by the LLM

Create a list of the context documents retrieved for the given question

In [56]:
answers = []
contexts = []

for question in test_questions:
  response = te3_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Convert the lists from above into a Hugging Face dataset

In [57]:
te3_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Take our data set created above and use Ragas to evaluate the answers based on the same metrics as before

In [58]:
te3_advanced_retrieval_results = evaluate(te3_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/95 [00:00<?, ?it/s]

Display the results use the TE3 embedding model

In [59]:
te3_advanced_retrieval_results

{'faithfulness': 0.7414, 'answer_relevancy': 0.9725, 'context_recall': 0.6228, 'context_precision': 0.6243, 'answer_correctness': 0.6617}

#### Results of Run 2 - using text-embedding-3-small

| Metric             | Score   |
|--------------------|---------|
| Faithfulness        | 0.7414  |
| Answer Relevancy    | 0.9725  |
| Context Recall      | 0.6228  |
| Context Precision   | 0.6243  |
| Answer Correctness  | 0.6617  |

Lets compare our baseline results with the TE3 results

In [60]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(te3_advanced_retrieval_results.items()), columns=['Metric', 'TE3'])

df_merged = pd.merge(df_baseline, df_comparison, on='Metric')

df_merged['Baseline -> TE3'] = df_merged['TE3'] - df_merged['ADA']

df_merged

Unnamed: 0,Metric,ADA,TE3,Baseline -> TE3
0,faithfulness,0.700877,0.741433,0.040555
1,answer_relevancy,0.865586,0.972506,0.10692
2,context_recall,0.675439,0.622807,-0.052632
3,context_precision,0.716374,0.624269,-0.092105
4,answer_correctness,0.579235,0.661737,0.082502


####❓ Question #3:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

This is a summary of the changes from ADA to TE3:
- Faithfulness - a slight increase of 4% - so the TE3 answer is slightly more factually consistent with the context
- answer_relevancy - an increase of 10% - TE3 answers are better aligned to the question than ADA
- context_recall - a decrease of 5% - TE3 loses out a bit on getting the most relevant context
- context_precision - a decrease of 9% - TE3 also loses out on getting high ranked context
- answer correctness - an increase of 8% - TE3 answers are closer to the ground truth

TE3 - better at creating relevant, factually consistent and correct answers

ADA - better at retrieving more relevant and precise chunks.

## Task 5: Selecting an Advanced Retriever and Evaluating

#### 🏗️ Activity #2

While the changes that occured due to modifying the embedding model were desirable - you're now tasked with improving `context_recall`, or `context_precision` (or both!).

You'll follow these steps:

1. Reason about this list of Advanced Retrieval methods:
  - [Contextual Compression (Reranker)](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/contextual_compression/)
  - [MultiQueryRetriever](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/MultiQueryRetriever/)
  - [Parent Document Retriever](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/parent_document_retriever/)
2. Select the method you think will be the most performant.
3. Implement that method.
4. Create a LCEL chain that utlizes the new Retriever method.
5. Evaluate this LCEL and compare to the TE3 results.

> NOTE: We will spend more time in Session 14 diving into advanced retrieval methods, this activity is only to serve as a basic introduction to the idea of component-wise improvements and how they might impact metrics.

#### My Notes on the different methods

##### Contextual Compression (Reranker)
- aims at improving context precision, might boost context recall
- Compress the retrieved docuements using the context of the given query
- Only relevant information returned
- Compress the contents of an individual document and filter out documents wholesale
- Document Compressor - takes a list of documents and shortens it 
- implement after initial retrieval process - between the retriever and the LLM

##### MultiQueryRetriever
- aims to improve context-recall, might decrease context precision
- generates a broader and more diverse set of relevant documents
- automates process of prompt tuning to create multiple queries from different perspectives for a given query
- retrieves a set of relevant documents and takes the unique union to get a larger set of documents
- overcomes some of limitations of distance based retrieval
- implement by replacing or modifying the retriever to use a MultiQueryRetriever  

##### Parent Document Retriever
- aims to improve context recall and context precision
- provides additional background information
- splits and stores small chunks of data
- when gets a small chunk first it looks up parent ids and returns the larger docuemnts
- implement by modifying the retriever to retrieve parent document as we;; 


#### Implementation of the Contextual Compression

In [62]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

In [67]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

uncompressed_docs = te3_retriever.invoke(
    "Who said, 'This task isn't easy'?"
)

compressor = LLMChainExtractor.from_llm(primary_qa_llm)
te3_compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=te3_retriever
)

compressed_docs = te3_compression_retriever.invoke(
    "Who said, 'This task isn't easy'?"
)
pretty_print_docs(uncompressed_docs)
pretty_print_docs(compressed_docs)

Document 1:

one
is
from
the
great Robert
Evans. (Hold
out
for
the audiobook — trust me.)
It’s really easy to get asked to do something — a new project, a
----------------------------------------------------------------------------------------------------
Document 2:

enough to get real high-quality work done.
This one is far easier to say than do. And it won’t be feasible
during projects where lots of updates during the day really are
----------------------------------------------------------------------------------------------------
Document 3:

As John says, “The list of tasks one has in mind will be ordered
by importance. Tasks that seem most urgent and important are
on top. But there are also worthwhile tasks to perform lower
----------------------------------------------------------------------------------------------------
Document 4:

have to do. But these are the 9 most important.
To quote the great Tommy Lasorda: “This fucking job ain’t that
fucking easy.”
Appendix for media 

Looks the Compressed Retriever is functioning, so hook it up and rerun the previous chain

- create a chain with the te3 compression retriever and the previously created document chain
- use the new chain and ask questions, saving the response and the retrieved context
- create a HuggingFace dataset with the questions, answers, context and ground truth
- use Ragas to evaluate the answers based on the metrics selected

In [68]:
from langchain.chains import create_retrieval_chain

answers = []
contexts = []

te3_compression_retrieval_chain = create_retrieval_chain(te3_compression_retriever, document_chain)

for question in test_questions:
  response = te3_compression_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

te3_compression_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})
te3_compression_advanced_retrieval_results = evaluate(te3_compression_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/95 [00:00<?, ?it/s]

In [69]:
te3_compression_advanced_retrieval_results

{'faithfulness': 0.6632, 'answer_relevancy': 0.9622, 'context_recall': 0.4386, 'context_precision': 0.7310, 'answer_correctness': 0.5384}

#### Results with Contextual Compression

| Metric             | Score   |
|--------------------|---------|
| Faithfulness        | 0.6632  |
| Answer Relevancy    | 0.9622  |
| Context Recall      | 0.4386  |
| Context Precision   | 0.7310  |
| Answer Correctness  | 0.5384  |

Compare the results


In [84]:
df_compression = pd.DataFrame(list(te3_compression_advanced_retrieval_results.items()), columns=['Metric', 'TE3-Compression'])
df_final_merged = pd.merge(df_merged, df_compression, on='Metric')
df_final_merged['Baseline -> TE3-Compression'] = df_final_merged['TE3-Compression'] - df_final_merged['ADA']
df_final_merged['TE3 -> TE3-Compression'] = df_final_merged['TE3-Compression'] - df_final_merged['TE3']
df_final_merged

Unnamed: 0,Metric,ADA,TE3,Baseline -> TE3,TE3-Compression,Baseline -> TE3-Compression,TE3 -> TE3-Compression
0,faithfulness,0.700877,0.741433,0.040555,0.663221,-0.037657,-0.078212
1,answer_relevancy,0.865586,0.972506,0.10692,0.962194,0.096607,-0.010313
2,context_recall,0.675439,0.622807,-0.052632,0.438596,-0.236842,-0.184211
3,context_precision,0.716374,0.624269,-0.092105,0.730994,0.01462,0.106725
4,answer_correctness,0.579235,0.661737,0.082502,0.538383,-0.040852,-0.123354


#### Analysis of Adding Contextual Compression

- Faithfulness for TE3-Compression dropped compared to both ADA and TE3
- Answer Relevancy for TE3-Compression increased against ADA but not as much as TE3
- Context Recall for TE3-Compression decreased substantially compared to both ADA and TE3
- Context Precision for TE3-Compression increased compared to ADA and TE3
- Answer Correctness for TE3-Compression decreased compared to ADA and TE3

So Contextual Compression improved context precision and answer relevancy but at the cost of the other metrics. So the claim to improve Context precision is accurate, but as always it probably should have an asterisk too.

#### MultiQuery Retriever implementation

Set up a LangChain multi query retriever using the retriever used from the te3 chain and the same LLM

In [74]:
from langchain.retrievers.multi_query import MultiQueryRetriever

question = "Who said, 'This task isn't easy'?"

mqr_te3_retriever = MultiQueryRetriever.from_llm(
    retriever=te3_retriever, llm=primary_qa_llm
)

Lets test that this does in fact create multiple queries

In [75]:
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
unique_docs = mqr_te3_retriever.invoke(question)
len(unique_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ["Who is the author of the quote, 'This task isn't easy'?", "Can you tell me who made the statement, 'This task isn't easy'?", "What individual is attributed with the phrase, 'This task isn't easy'?"]


6

OK - so we get 3 similar questions - so looks like it is working

Create the retrieval chain using the new mqr retriever and our original document chain

Loop through the test questions, passing the question into the chain. Save the answer and the context retrieved.

Convert the results into a HuggingFace dataset and have Ragas evaluate our results

In [76]:
mqr_te3_retrieval_chain = create_retrieval_chain(mqr_te3_retriever, document_chain)

answers = []
contexts = []

for question in test_questions:
  response = mqr_te3_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

mqr_te3_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

mqr_te3_advanced_retrieval_results = evaluate(mqr_te3_response_dataset_advanced_retrieval, metrics)
mqr_te3_advanced_retrieval_results

INFO:langchain.retrievers.multi_query:Generated queries: ["What role does the desire to maintain consistency play in people's resistance to change?", "In what ways does the avoidance of inconsistency influence individuals' willingness to embrace change?", "How does the fear of inconsistency affect people's attitudes towards making changes in their lives?"]
INFO:langchain.retrievers.multi_query:Generated queries: ['What obstacles do startups encounter when trying to implement essential systems and processes for achieving success?', 'What difficulties do new businesses face in setting up the necessary frameworks and routines to ensure their success?', 'What are the key hurdles that startups must overcome to create effective systems and practices for sustainable growth?']
INFO:langchain.retrievers.multi_query:Generated queries: ['What key elements should influence my decision on a college major?  ', 'What considerations are important when choosing a field of study for higher education?  '

Evaluating:   0%|          | 0/95 [00:00<?, ?it/s]

{'faithfulness': 0.7401, 'answer_relevancy': 0.9664, 'context_recall': 0.7325, 'context_precision': 0.6257, 'answer_correctness': 0.5729}

#### Results of MultiQueryRetriever run

| Metric             | Value  |
|--------------------|--------|
| Faithfulness        | 0.7401 |
| Answer Relevancy    | 0.9664 |
| Context Recall      | 0.7325 |
| Context Precision   | 0.6257 |
| Answer Correctness  | 0.5729 |

Now lets compare all the results from all of the runs and analyze.

In [86]:
df_mqr = pd.DataFrame(list(mqr_te3_advanced_retrieval_results.items()), columns=['Metric', 'MQR'])
df_final_merged_2 = pd.merge(df_final_merged, df_mqr, on='Metric', how='inner')
df_final_merged_2['Baseline -> MQR'] = df_final_merged_2['MQR'] - df_final_merged_2['ADA']
df_final_merged_2['TE3 -> MQR'] = df_final_merged_2['MQR'] - df_final_merged_2['TE3']
df_final_merged_2['TE3-Compression -> MQR'] = df_final_merged_2['MQR'] - df_final_merged_2['TE3-Compression']
df_final_merged_2

Unnamed: 0,Metric,ADA,TE3,Baseline -> TE3,TE3-Compression,Baseline -> TE3-Compression,TE3 -> TE3-Compression,MQR,Baseline -> MQR,TE3 -> MQR,TE3-Compression -> MQR
0,faithfulness,0.700877,0.741433,0.040555,0.663221,-0.037657,-0.078212,0.740102,0.039225,-0.001331,0.076881
1,answer_relevancy,0.865586,0.972506,0.10692,0.962194,0.096607,-0.010313,0.966397,0.10081,-0.00611,0.004203
2,context_recall,0.675439,0.622807,-0.052632,0.438596,-0.236842,-0.184211,0.732456,0.057018,0.109649,0.29386
3,context_precision,0.716374,0.624269,-0.092105,0.730994,0.01462,0.106725,0.62575,-0.090624,0.001481,-0.105244
4,answer_correctness,0.579235,0.661737,0.082502,0.538383,-0.040852,-0.123354,0.572889,-0.006346,-0.088848,0.034506


#### Analysis of MultiQueryRetriever Results

- MQR did 9% better than baseline in context precision and 10% on answer relevancy
- MQR did 8% better than TE3 on answer correctness and 10% worse on context recall
- MQR did 10% better than TE3_Compression on context precision but 30% worse on context_recall

Since MQR supposedly improves context recall these results are surprising.

Please note I did run this on the te3_retriever.

#### 🚧 BONUS CHALLENGE 🚧

> NOTE: Completing this challenge will provide full marks on the assignment, regardless of the complete of the notebook. You do not need to complete this in the notebook for full marks.

##### **MINIMUM REQUIREMENTS**:

1. Baseline `LCEL RAG` Application using `NAIVE RETRIEVAL`
2. Baseline Evaluation using `RAGAS METRICS`
  - [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
  - [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
  - [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
  - [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
  - [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)
3. Implement a `SEMANTIC CHUNKING STRATEGY`.
4. Create an `LCEL RAG` Application using `SEMANTIC CHUNKING` with `NAIVE RETRIEVAL`.
5. Compare and contrast results.

##### **SEMANTIC CHUNKING REQUIREMENTS**:

Chunk semantically similar (based on designed threshold) sentences, and then paragraphs, greedily, up to a maximum chunk size. Minimum chunk size is a single sentence.

Have fun!