<a href="https://colab.research.google.com/github/robertheubanks/newaiengbootcamp/blob/main/Eubanks_Week_4_Homework_7_Evaluation_of_RAG_using_Ragas_Assignment_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
  

- 🤝 Breakout Room #2:
  1. Synthetic Dataset Generation for Evaluation using the [Ragas](https://github.com/explodinggradients/ragas)
  2. Evaluating our pipeline with Ragas
  3. Making Adjustments to our RAG Pipeline
  4. Evaluating our Adjusted pipeline against our baseline
  5. Testing OpenAI's Claim

The only way to get started is to get started - so let's grab our dependencies for the day!

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://blog.langchain.dev/langchain-v0-1-0/) of LangChain v0.1.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [None]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m807.5/807.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.6/252.6 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.4/66.4 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

We'll also get the "star of the show" today, which is Ragas!

In [None]:
!pip install -qU ragas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h

As well, instead of the remote hosted solution that we used last week (Pinecone), we'll be leveraging Meta's [FAISS](https://github.com/facebookresearch/faiss) as the backend for our LangChain `VectorStore`.

We'll also install `unstructured` (from [Unstructured-IO](https://github.com/Unstructured-IO/unstructured)) and its dependencies which will allow us to load PDFs using the `UnstructuredPDFLoader` in the `langchain-community` package!

In [None]:
!pip install -qU faiss_cpu pymupdf pandas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m46.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.4/345.4 kB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 0.22.0 requires pandas<2.1.4,>=1.5.0, but you have pandas 2.2.1 which is incompatible.
google-colab 1.0.0 requires pandas==1.5.3, but you have pandas 2.2.1 which is incompatible.[0m[31m
[0m

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [None]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

Please provide your OpenAI Key: ··········


## Task 3: Creating a Simple RAG Pipeline with LangChain v0.1.0

Building on what we learned last week, we'll be leveraging LangChain v0.1.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.1.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [None]:
!git clone https://github.com/AI-Maker-Space/DataRepository

Cloning into 'DataRepository'...
remote: Enumerating objects: 50, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 50 (delta 14), reused 20 (delta 7), pack-reused 8[K
Receiving objects: 100% (50/50), 51.19 MiB | 43.25 MiB/s, done.
Resolving deltas: 100% (14/14), done.


In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    "DataRepository/MuskComplaint.pdf",
)

documents = loader.load()

In [None]:
documents[0].metadata

{'source': 'DataRepository/MuskComplaint.pdf',
 'file_path': 'DataRepository/MuskComplaint.pdf',
 'page': 0,
 'total_pages': 46,
 'format': 'PDF 1.7',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': '',
 'creationDate': '',
 'modDate': '',
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [None]:
len(documents)

159

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

In [None]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

#### Creating a FAISS VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

In [None]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(documents, embeddings)

####❓ Question #1:

List out a few of the techniques that FAISS uses that make it performant.

> NOTE: Check the [repository](https://github.com/facebookresearch/faiss) for more information about FAISS!

ANSWER:


*   Use of Huge Memory Pages: FAISS can benefit from large memory pages (e.g., 2M or 1G pages) on x86-64 platforms to reduce the pressure on the TLB (Translation Lookaside Buffer) cache, potentially speeding up operations by up to 20%
*   GPU Acceleration: It leverages GPU resources for accelerated operations, significantly speeding up searches by offloading the computation to the GPU
*   Batch Queries: FAISS allows the processing of multiple query vectors simultaneously, improving throughput by handling searches in batches rather than one at a time
*   Index Partitioning: To improve scalability, FAISS supports partitioning the index into Voronoi cells, optimizing search performance for very large datasets
*   Automatic Tuning: FAISS includes an automatic tuning mechanism that optimizes search-time parameters to achieve the best balance between accuracy and search time, making it highly efficient for large-scale datasets
*   Optimized Distance Computations and Multi-threading: On the CPU side, FAISS utilizes BLAS libraries for efficient exact distance computations and multi-threading to perform parallel searches, exploiting multiple cores and GPUs for faster operations
*   Advanced Indexing Techniques: FAISS incorporates several indexing techniques such as Product Quantization (PQ), Inverted File System with Asymmetric Distance Computation (IVFADC), and the Hierarchical Navigable Small World (HNSW) graph method, among others. These methods provide a variety of options for balancing between search accuracy and speed

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [None]:
retriever = vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [None]:
retrieved_documents = retriever.invoke("Who is the plantiff?")

In [None]:
for doc in retrieved_documents:
  print(doc)

page_content='would be owned by the foundation and used ‘for the good of the world’[.]” Plaintiff \nreplied: “Agree on all.” Ex. 2 at 1.' metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 27, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': ''}
page_content='property and derivative works funded by those monies, Plaintiff is presently unable to ascertain his \ninterest in or the use, allocation, or distribution of assets without an accounting. Plaintiff is therefore \nentitled to an accounting.' metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 32, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': ''}
page_content='1

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [None]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [None]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [None]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [None]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

####🏗️ Activity #1:

Describe the pipeline shown above in simple terms. You can include a diagram if desired.

ANSWER:
The pipeline described is a simple example of a Retrieval-Augmented Generation (RAG) chain designed for answering questions using a two-step process: retrieval and generation. Here's a simplified explanation of each step in the pipeline:

**Retrieval Phase:** The process begins when a question is provided as input. The system uses this question as a key to retrieve relevant context from a pre-built index. This index is a collection of documents or passages that have been processed and stored in a way that they can be quickly searched to find relevant information related to the input question. The retrieval is performed by the retriever component, which is linked to the vector store created earlier. This phase results in a set of documents or passages that are believed to contain useful information to answer the question.

**Generation Phase:** Once relevant contexts have been retrieved, they are fed, along with the original question, into a large language model (LLM), specifically GPT-3.5 in this case. The language model uses the context to generate an answer to the question. The context provides specific information that the model uses to tailor its response, making it relevant to the query. If the context is insufficient to answer the question, the model is instructed to respond with "I don't know."
The system utilizes a custom prompt template that structures how the question and context are presented to the LLM, ensuring that the model understands it is supposed to answer the question based solely on the provided context.

Key Components:

**Retriever:** Searches the index for context relevant to the input question.
LangChain Prompts and LCEL (LangChain Core Execution Layer): Tools and libraries used to facilitate the creation of the custom prompt and manage the execution of the RAG pipeline.

**ChatOpenAI:** The component responsible for interacting with OpenAI's GPT-3.5 model, formatting the prompt, and obtaining the generated response.
RunnablePassthrough and Output Parsers: Utilities to manipulate and pass data through different stages of the pipeline seamlessly.

Overall Process:

1) Input a question.

2) Retrieve relevant contexts based on the question.

3) Combine the context and question into a structured prompt.

4) Generate an answer using the structured prompt and a large language model.

5) Output the generated answer.

This RAG chain efficiently combines the strengths of both retrieval-based methods (for extracting relevant information from a large corpus of texts) and generative AI models (for synthesizing responses that are coherent and contextually appropriate based on the retrieved information).

Let's test it out!

In [None]:
question = "Who is the plantiff?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

Elon Musk


In [None]:
question = "What does this complaint pertain to?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

The complaint pertains to breach of fiduciary duty, unfair business practices, accounting, and a demand for a jury trial.
[Document(page_content='1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \n– 31 – \nCOMPLAINT \n \nTHIRD CAUSE OF ACTION \nBreach of Fiduciary Duty  \nAgainst All Defendants \n133. \nPlaintiff realleges and incorporates by reference only paragraphs of this Complaint \nnecessary for his claim of Breach of Fiduciary Duty. \n134. \nUnder California law, Defendants owe fiduciary duties to Plaintiff, including a duty \nto use Plaintiff’s contributions for the purposes for which they were made. E.g., Cal. Bus. & Prof. \nCode § 17510.8. Defendants have repeatedly breached their fiduciary duties to Plaintiff, including \nby:', metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 30, 'total_pages': 46, 'format': 'PDF 1.7', 'title':

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

# 🤝 Breakout Room #2

## Task 1: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evluating on every core metric today, but in order to do that - we'll need to creat a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

> NOTE: This process will use `gpt-3.5-turbo-16k` as the base generator and `gpt-4` as the critic - if you're attempting to create a lot of samples please be aware of cost, as well as rate limits.

In [None]:
eval_documents = documents

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 400
)

eval_documents = text_splitter.split_documents(eval_documents)

####❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

ANSWER:
Splitting documents using different parameters, such as chunk_size and chunk_overlap, when creating synthetic data for a Retrieval-Augmented Generation (RAG) pipeline evaluation is crucial for several reasons:

*   Variability in Context: Different chunk sizes and overlaps ensure that the synthetic dataset encompasses a wide variety of context lengths and content overlaps. This variability better simulates the diverse nature of real-world queries, where the amount of relevant information can significantly vary from one question to another.
*   Robustness and Generalization: By testing the RAG pipeline against a synthetic dataset with varied context sizes, the evaluation can more accurately gauge the pipeline's robustness and its ability to generalize across different types of questions and contexts. This is especially important for understanding how the pipeline performs under less-than-ideal circumstances, such as when the relevant information is fragmented or spread out over a larger context.
*   Coverage and Recall Improvement: Varying the parameters for splitting documents can improve the coverage of the dataset, ensuring that more aspects of the documents are considered during the retrieval phase. This can lead to a better recall of relevant information, as different splits might highlight different facets of the data that could be pertinent to answering a given question.
*   Evaluation of Retrieval and Generation Quality: Synthetic test sets with varied context sizes allow for a comprehensive evaluation of both the retrieval and generation components of the RAG pipeline. It helps in assessing whether the retrieval component can accurately find relevant information across different context sizes and whether the generation component can effectively utilize this information to generate accurate and coherent answers.
*   Impact on Performance Metrics: The use of different parameters directly impacts core evaluation metrics such as precision, recall, and F1 score. It allows for a more nuanced understanding of the pipeline's performance across different scenarios, providing insights into potential areas of improvement for both retrieval and generation stages.

In [None]:
len(documents)

159

In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()

testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.25, reasoning: 0.25, multi_context: 0.5})

embedding nodes:   0%|          | 0/318 [00:00<?, ?it/s]



Generating:   0%|          | 0/10 [00:00<?, ?it/s]

####❓ Question #3:

`{simple: 0.5, reasoning: 0.25, multi_context: 0.25}`

What exactly does this mapping refer to?

> NOTE: Check out the Ragas documentation on this generation process [here](https://docs.ragas.io/en/stable/concepts/testset_generation.html).

ANSWER:
The mapping {simple: 0.5, reasoning: 0.25, multi_context: 0.25} refers to the distribution of question types or evolution strategies used when generating a synthetic test set with Ragas for evaluating a Retrieval-Augmented Generation (RAG) pipeline. Here's a breakdown of what each key-value pair represents:

*   simple: This key represents questions that are straightforward and likely require retrieval of facts or information directly from the context without the need for complex reasoning or integration of information from multiple sources. The value 0.5 indicates that 50% of the questions in the synthetic test set should be of this simple type.
*   reasoning: This key denotes questions that require some level of reasoning to answer. This might involve drawing inferences from the given context or synthesizing information from different parts of the context. The value 0.25 suggests that 25% of the questions in the test set will involve reasoning.
*   multi_context: This indicates questions that require information from multiple contexts or documents to formulate an answer. These types of questions test the RAG pipeline's ability to aggregate and synthesize information across different sources. The value 0.25 means that 25% of the questions will be multi-contextual.

Let's look at the output and see what we can learn about it!

In [None]:
testset.test_data[0]

DataRow(question="What was Mr. Musk's concern about artificial intelligence-systems in his conversation with Mr. Page?", contexts=['Page, then-CEO of Google’s parent company Alphabet, Inc. Mr. Musk would frequently raise the \ndangers of AI in his conversations with Mr. Page, but to Mr. Musk’s shock, Mr. Page was \nunconcerned. For example, in 2013, Mr. Musk had a passionate exchange with Mr. Page about the \ndangers of AI. He warned that unless safeguards were put in place, “artificial intelligence-systems \nmight replace humans, making our species irrelevant or even extinct.” Mr. Page responded that \nwould merely “be the next stage of evolution,” and claimed Mr. Musk was being a “specist”—that'], ground_truth="Mr. Musk's concern was that artificial intelligence-systems might replace humans, making our species irrelevant or even extinct.", evolution_type='simple')

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [None]:
test_df = testset.to_pandas()

In [None]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,episode_done
0,What was Mr. Musk's concern about artificial i...,"[Page, then-CEO of Google’s parent company Alp...",Mr. Musk's concern was that artificial intelli...,simple,True
1,How did OpenAI use reinforcement learning in t...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning to play Dot...,simple,True
2,What is Microsoft's stance on OpenAI's potenti...,"[Indeed, during an interview shortly after Mr....",Microsoft is confident in their ability to con...,reasoning,True
3,How did OpenAI demonstrate their expertise in ...,[77. \nInitial work at OpenAI followed much in...,OpenAI demonstrated their expertise in a strat...,reasoning,True
4,How would the non-profit business model revolu...,"[business model were valid, it would radically...",The non-profit business model would allow inve...,multi_context,True
5,"""What strategy video game did OpenAI excel in,...",[77. \nInitial work at OpenAI followed much in...,OpenAI excelled in the strategy video game Dot...,multi_context,True
6,"""What are OpenAI's AGI development principles ...",[profit developing AGI for the benefit of huma...,OpenAI's AGI development principles are to dev...,multi_context,True
7,"""What technique showcased the reasoning abilit...",[implementation for others to build on. \n84. ...,chain-of-thought prompting,multi_context,True
8,What were Stephen Hawking's concerns about AGI...,[18. \nMr. Musk has long recognized that AGI p...,,multi_context,True
9,Which architecture did OpenAI use to develop t...,[those connections to the target language. \n7...,OpenAI used Google's Transformer architecture ...,reasoning,True


In [None]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [None]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [None]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [None]:
response_dataset[0]

{'question': "What was Mr. Musk's concern about artificial intelligence-systems in his conversation with Mr. Page?",
 'answer': "Mr. Musk's concern was that artificial intelligence-systems might replace humans, making our species irrelevant or even extinct.",
 'contexts': ['Page, then-CEO of Google’s parent company Alphabet, Inc. Mr. Musk would frequently raise the \ndangers of AI in his conversations with Mr. Page, but to Mr. Musk’s shock, Mr. Page was \nunconcerned. For example, in 2013, Mr. Musk had a passionate exchange with Mr. Page about the \ndangers of AI. He warned that unless safeguards were put in place, “artificial intelligence-systems \nmight replace humans, making our species irrelevant or even extinct.” Mr. Page responded that \nwould merely “be the next stage of evolution,” and claimed Mr. Musk was being a “specist”—that',
  '1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \n– 10 – \n

## Task 2: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [None]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [None]:
results

{'faithfulness': 0.8519, 'answer_relevancy': 0.8306, 'context_recall': 0.9000, 'context_precision': 0.7583, 'answer_correctness': 0.7697}

In [None]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What was Mr. Musk's concern about artificial i...,Mr. Musk's concern was that artificial intelli...,"[Page, then-CEO of Google’s parent company Alp...",Mr. Musk's concern was that artificial intelli...,1.0,0.953027,1.0,1.0,1.0
1,How did OpenAI use reinforcement learning in t...,OpenAI used reinforcement learning to compete ...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning to play Dot...,1.0,0.971244,1.0,1.0,0.74696
2,What is Microsoft's stance on OpenAI's potenti...,Microsoft is confident in its ability to conti...,"[Indeed, during an interview shortly after Mr....",Microsoft is confident in their ability to con...,1.0,0.963952,1.0,1.0,0.61929
3,How did OpenAI demonstrate their expertise in ...,OpenAI demonstrated their expertise in a strat...,[77. \nInitial work at OpenAI followed much in...,OpenAI demonstrated their expertise in a strat...,1.0,0.950176,1.0,1.0,1.0
4,How would the non-profit business model revolu...,The non-profit business model would allow inve...,"[business model were valid, it would radically...",The non-profit business model would allow inve...,,0.905421,1.0,0.916667,0.661179
5,"""What strategy video game did OpenAI excel in,...",Dota 2,[a superhuman level of play in the games of ch...,OpenAI excelled in the strategy video game Dot...,0.0,0.888023,1.0,0.5,0.968325
6,"""What are OpenAI's AGI development principles ...",OpenAI's AGI development principles are to dev...,[to its mission to develop AGI for the benefit...,OpenAI's AGI development principles are to dev...,0.666667,0.0,1.0,0.25,0.542759
7,"""What technique showcased the reasoning abilit...",Chain-of-thought prompting,[implementation for others to build on. \n84. ...,chain-of-thought prompting,1.0,0.867652,1.0,0.916667,0.996956
8,What were Stephen Hawking's concerns about AGI...,Stephen Hawking's concerns about AGI in the wr...,[to its mission to develop AGI for the benefit...,,1.0,0.922451,0.0,0.0,0.186764
9,Which architecture did OpenAI use to develop t...,Google's Transformer architecture,[those connections to the target language. \n7...,OpenAI used Google's Transformer architecture ...,1.0,0.88427,1.0,1.0,0.974456


## Task 3: Making Adjustments to our RAG Pipeline

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

In [None]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [None]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [None]:
response = retrieval_chain.invoke({"input": "Who is the plantiff?"})

In [None]:
print(response["answer"])

The plaintiff is Elon Musk.


In [None]:
response = retrieval_chain.invoke({"input": "What does this complaint pertain to?"})

In [None]:
print(response["answer"])

The complaint pertains to a legal case involving Plaintiff Elon Musk alleging Breach of Fiduciary Duty, Unfair Business Practices, and Accounting against all Defendants. The complaint seeks remedies such as restitution, disgorgement of monies received, prejudgment interest, injunction against future activities, and specific performance. The Plaintiff has also demanded a jury trial for all issues, claims, and causes of action.


Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [None]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Now we can convert this into a dataset, just like we did before.

In [None]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [None]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [None]:
advanced_retrieval_results_df = advanced_retrieval_results.to_pandas()
advanced_retrieval_results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What was Mr. Musk's concern about artificial i...,Mr. Musk's concern about artificial intelligen...,"[Page, then-CEO of Google’s parent company Alp...",Mr. Musk's concern was that artificial intelli...,1.0,1.0,1.0,1.0,0.991474
1,How did OpenAI use reinforcement learning in t...,OpenAI used reinforcement learning to compete ...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning to play Dot...,1.0,0.872318,0.5,1.0,0.540233
2,What is Microsoft's stance on OpenAI's potenti...,"Microsoft's stance, as per Mr. Nadella's state...","[Indeed, during an interview shortly after Mr....",Microsoft is confident in their ability to con...,1.0,0.95472,1.0,1.0,0.608532
3,How did OpenAI demonstrate their expertise in ...,OpenAI demonstrated their expertise in a strat...,[77. \nInitial work at OpenAI followed much in...,OpenAI demonstrated their expertise in a strat...,1.0,0.971075,1.0,1.0,0.842019
4,How would the non-profit business model revolu...,The non-profit business model proposed by Open...,"[business model were valid, it would radically...",The non-profit business model would allow inve...,1.0,0.844621,1.0,0.916667,0.595228
5,"""What strategy video game did OpenAI excel in,...","OpenAI excelled in Dota 2, a strategy video ga...",[77. \nInitial work at OpenAI followed much in...,OpenAI excelled in the strategy video game Dot...,1.0,0.920092,1.0,1.0,0.74022
6,"""What are OpenAI's AGI development principles ...",OpenAI's AGI development principles were initi...,[Agreement. \n113. \nOpenAI’s conduct could ha...,OpenAI's AGI development principles are to dev...,1.0,0.915955,1.0,0.555556,0.647907
7,"""What technique showcased the reasoning abilit...",The technique that showcased the reasoning abi...,[implementation for others to build on. \n84. ...,chain-of-thought prompting,1.0,0.913904,1.0,0.916667,0.714533
8,What were Stephen Hawking's concerns about AGI...,Stephen Hawking's concerns about AGI falling i...,[to its mission to develop AGI for the benefit...,,0.75,0.926971,0.0,0.0,0.181587
9,Which architecture did OpenAI use to develop t...,OpenAI used the first half of Google's Transfo...,[those connections to the target language. \n7...,OpenAI used Google's Transformer architecture ...,1.0,0.996094,1.0,1.0,0.744506


## Task 4: Evaluating our Adjusted Pipeline Against Our Baseline

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [None]:
results

{'faithfulness': 0.8519, 'answer_relevancy': 0.8306, 'context_recall': 0.9000, 'context_precision': 0.7583, 'answer_correctness': 0.7697}

And see how our advanced retrieval modified our chain!

In [None]:
advanced_retrieval_results

{'faithfulness': 0.9750, 'answer_relevancy': 0.9316, 'context_recall': 0.8500, 'context_precision': 0.8389, 'answer_correctness': 0.6606}

In [None]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.851852,0.975,0.123148
1,answer_relevancy,0.830622,0.931575,0.100953
2,context_recall,0.9,0.85,-0.05
3,context_precision,0.758333,0.838889,0.080556
4,answer_correctness,0.769669,0.660624,-0.109045


## Task 5: Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####🏗️ Activity #2:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

In [None]:
new_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
### Initializes a new embeddings model using OpenAI's "text-embedding-3-small". This model will be used to convert text documents into vector embeddings that can be compared for similarity

In [None]:
vector_store = FAISS.from_documents(documents, new_embeddings)
### Creates a vector store using the FAISS library, which takes the documents processed by the `new_embeddings` model and organizes them into an efficient data structure for similarity search

In [None]:
new_retriever = vector_store.as_retriever()
### Converts the FAISS vector store into a retriever object that can be used to find documents similar to a given query based on the vector embeddings

In [None]:
new_advanced_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=primary_qa_llm)
### Enhances the basic retriever with a MultiQueryRetriever, which utilizes a large language model (`primary_qa_llm`) for processing multiple queries simultaneously, improving the relevance and quality of the retrieved documents

In [None]:
new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain)
### Sets up a retrieval chain that combines the advanced retriever with a document processing chain (`document_chain`), enabling complex retrieval tasks that involve both document retrieval and further processing

In [None]:
answers = []
contexts = []

for question in test_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])
  ### Iterates over a set of test questions, using the `new_retrieval_chain` to generate responses. Each response's answer and associated contexts (specifically their page content) are collected and stored in the `answers` and `contexts` lists, respectively

In [None]:
new_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})
### Compiles the responses from the retrieval chain into a structured dataset, including the original questions, generated answers, associated contexts, and ground truth answers for evaluation

In [None]:
new_advanced_retrieval_results = evaluate(new_response_dataset_advanced_retrieval, metrics)
### Evaluates the performance of the advanced retrieval system using a set of predefined metrics. This step calculates how well the system performed in terms of various metrics such as faithfulness, relevancy, and correctness of the answers, as well as precision and recall of the contexts

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [None]:
new_advanced_retrieval_results
### Outputs the evaluation results, showing key performance metrics for the advanced retrieval system

{'faithfulness': 0.8893, 'answer_relevancy': 0.9414, 'context_recall': 0.8500, 'context_precision': 0.7533, 'answer_correctness': 0.7217}

In [None]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_original = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(new_advanced_retrieval_results.items()), columns=['Metric', 'Text Embedding 3'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['Delta - TE3 -> ADA'] = df_merged['Text Embedding 3'] - df_merged['ADA']
df_merged['Delta - TE3 -> Baseline'] = df_merged['Text Embedding 3'] - df_merged['Baseline']

df_merged
### Constructs DataFrames to compare the performance metrics of different retrieval models (`Baseline`, `ADA`, and `Text Embedding 3`). It calculates the delta (difference in performance) between the `Text Embedding 3` model and the others, providing insights into how changes in the retrieval system impact overall performance.

Unnamed: 0,Metric,Baseline,ADA,Text Embedding 3,Delta - TE3 -> ADA,Delta - TE3 -> Baseline
0,faithfulness,0.851852,0.975,0.889286,-0.085714,0.037434
1,answer_relevancy,0.830622,0.931575,0.941413,0.009838,0.110791
2,context_recall,0.9,0.85,0.85,0.0,-0.05
3,context_precision,0.758333,0.838889,0.753333,-0.085556,-0.005
4,answer_correctness,0.769669,0.660624,0.721652,0.061028,-0.048017


####❓ Question #4:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

ANSWER:
Determining whether "text-embedding-3-small" is significantly better than "ada" for a specific application depends on several factors including the nature of the task, the size and diversity of the dataset, computational resources, and the specific requirements for accuracy versus speed. Here are some considerations to help evaluate the performance of these models:



*   Model Size and Computational Efficiency: "text-embedding-3-small" might be designed to be more computationally efficient than larger models like "ada". If your application requires lower latency and less computational power, a smaller model could be more suitable.
*   Accuracy and Depth of Understanding: Larger models like "ada" generally have a deeper understanding of language nuances and can provide more accurate embeddings for complex texts. If the task involves deep semantic understanding, "ada" might perform better.
*   Use Case Specificity: Some models are better suited for specific tasks than others. For example, if "text-embedding-3-small" is optimized for generating embeddings quickly and efficiently for a wide range of texts, it might be the preferred choice for applications requiring fast retrieval of information. However, for tasks that require deep comprehension or nuanced interpretation, "ada" might be more effective.
*   Evaluation Metrics: It's important to evaluate both models on a set of metrics relevant to your specific use case. This could include precision, recall, speed, and computational resource usage. Only through empirical evaluation can you determine which model is "significantly better" for your needs.
*   Cost Considerations: The operational costs associated with using these models (especially if they are accessed via an API) can also be a deciding factor. Smaller models may be less costly to run and thus more suitable for applications with tight budget constraints.

## BONUS ACTIVITY: Showcase Multi-Context Perfomance Changes

Now that we've looked at a number of different examples - showcase the difference on the multi-context *specific* questions that were synthetically generated.

> NOTE: You have all the data you'll need already in the notebook if you made it to this step!

In [None]:
### YOUR CODE HERE