# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
  

- 🤝 Breakout Room #2:
  1. Synthetic Dataset Generation for Evaluation using the [Ragas](https://github.com/explodinggradients/ragas)
  2. Evaluating our pipeline with Ragas
  3. Making Adjustments to our RAG Pipeline
  4. Evaluating our Adjusted pipeline against our baseline
  5. Testing OpenAI's Claim

The only way to get started is to get started - so let's grab our dependencies for the day!

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://blog.langchain.dev/langchain-v0-1-0/) of LangChain v0.1.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [1]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m807.5/807.5 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.9/256.9 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.6/66.6 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

We'll also get the "star of the show" today, which is Ragas!

In [2]:
!pip install -qU ragas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 kB[0m [31m862.6 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h

As well, instead of the remote hosted solution that we used last week (Pinecone), we'll be leveraging Meta's [FAISS](https://github.com/facebookresearch/faiss) as the backend for our LangChain `VectorStore`.

We'll also install `unstructured` (from [Unstructured-IO](https://github.com/Unstructured-IO/unstructured)) and its dependencies which will allow us to load PDFs using the `UnstructuredPDFLoader` in the `langchain-community` package!

In [3]:
!pip install -qU faiss_cpu pymupdf pandas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m56.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.4/345.4 kB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 0.22.0 requires pandas<2.1.4,>=1.5.0, but you have pandas 2.2.1 which is incompatible.
google-colab 1.0.0 requires pandas==1.5.3, but you have pandas 2.2.1 which is incompatible.[0m[31m
[0m

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [4]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

Please provide your OpenAI Key: ··········


## Task 3: Creating a Simple RAG Pipeline with LangChain v0.1.0

Building on what we learned last week, we'll be leveraging LangChain v0.1.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.1.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [5]:
!git clone https://github.com/AI-Maker-Space/DataRepository

Cloning into 'DataRepository'...
remote: Enumerating objects: 54, done.[K
remote: Counting objects: 100% (46/46), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 54 (delta 15), reused 20 (delta 7), pack-reused 8[K
Receiving objects: 100% (54/54), 51.28 MiB | 42.69 MiB/s, done.
Resolving deltas: 100% (15/15), done.


In [6]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    "DataRepository/MuskComplaint.pdf",
)

documents = loader.load()

In [7]:
documents[0].metadata

{'source': 'DataRepository/MuskComplaint.pdf',
 'file_path': 'DataRepository/MuskComplaint.pdf',
 'page': 0,
 'total_pages': 46,
 'format': 'PDF 1.7',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': '',
 'creationDate': '',
 'modDate': '',
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [9]:
len(documents)

159

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

In [10]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

#### Creating a FAISS VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

In [11]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(documents, embeddings)

####❓ Question #1:

List out a few of the techniques that FAISS uses that make it performant.

> NOTE: Check the [repository](https://github.com/facebookresearch/faiss) for more information about FAISS!

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It utilizes several techniques to achieve high performance:

1. Indexing Structures: FAISS implements various indexing structures such as Inverted File, Product Quantization, and Hierarchical Navigable Small World Graph (HNSW) to efficiently store and retrieve vectors.

2. Quantization: FAISS uses quantization techniques to reduce the dimensionality of vectors, making them more compact and faster to process. This helps in reducing memory usage and improving search speed.

3. GPU Support: FAISS provides GPU support, allowing for parallel processing and faster search operations on GPUs. This is particularly beneficial for large-scale similarity search tasks.

4. Multi-Probe Search: FAISS employs multi-probe search techniques to explore multiple regions of the index during search, improving recall without sacrificing efficiency.

5. SIMD Instructions: FAISS leverages Single Instruction, Multiple Data (SIMD) instructions to perform vector operations in parallel, maximizing computational efficiency.

6. Approximate Nearest Neighbor (ANN) Search: FAISS focuses on approximate nearest neighbor search, which trades off some accuracy for significant gains in search speed. This makes it suitable for large-scale similarity search tasks where exact results are not necessary.

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [12]:
retriever = vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [13]:
retrieved_documents = retriever.invoke("Who is the plantiff?")

In [14]:
for doc in retrieved_documents:
  print(doc)

page_content='would be owned by the foundation and used ‘for the good of the world’[.]” Plaintiff \nreplied: “Agree on all.” Ex. 2 at 1.' metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 27, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': ''}
page_content='property and derivative works funded by those monies, Plaintiff is presently unable to ascertain his \ninterest in or the use, allocation, or distribution of assets without an accounting. Plaintiff is therefore \nentitled to an accounting.' metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 32, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': ''}
page_content='1

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [15]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [16]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [17]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [18]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

####🏗️ Activity #1:

Describe the pipeline shown above in simple terms. You can include a diagram if desired.

Let's test it out!

The code defines a pipeline for a retrieval-augmented question-answering system using the OpenAI GPT-3.5-turbo model. Here's a simplified explanation:

1. The pipeline starts with an input dictionary that contains a single key-value pair: `{"question": "<<SOME USER QUESTION>>"}`.

2. The `itemgetter("question")` function retrieves the question from the input dictionary. This question is then passed to the `retriever` (which is not defined in the provided code, but presumably retrieves relevant context for the question from a knowledge base or database).

3. The output from the retriever is combined with the original question to form a new dictionary: `{"context": <<RETRIEVED CONTEXT>>, "question": "<<ORIGINAL QUESTION>>"}`.

4. The `RunnablePassthrough.assign(context=itemgetter("context"))` step takes the context from the previous step and passes it through without any changes. This is useful for preserving the context for use in later steps of the pipeline.

5. The `prompt` function (which is not defined in the provided code, but presumably formats the context and question into a prompt suitable for the language model) takes the context and question from the previous step, formats them into a prompt, and passes this prompt to the `primary_qa_llm` (a GPT-3.5-turbo model).

6. The GPT-3.5-turbo model generates a response to the prompt, which is stored in a new dictionary along with the original context: `{"response": "<<MODEL RESPONSE>>", "context": "<<ORIGINAL CONTEXT>>"}`.

In summary, this pipeline takes a user's question, retrieves relevant context, formats the context and question into a prompt for the GPT-3.5-turbo model, generates a response from the model, and returns the response along with the original context.

In [19]:
question = "Who is the plantiff?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

Elon Musk


In [20]:
question = "What does this complaint pertain to?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

The complaint pertains to breach of fiduciary duty, unfair business practices, accounting, and a demand for a jury trial.
[Document(page_content='1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \n– 31 – \nCOMPLAINT \n \nTHIRD CAUSE OF ACTION \nBreach of Fiduciary Duty  \nAgainst All Defendants \n133. \nPlaintiff realleges and incorporates by reference only paragraphs of this Complaint \nnecessary for his claim of Breach of Fiduciary Duty. \n134. \nUnder California law, Defendants owe fiduciary duties to Plaintiff, including a duty \nto use Plaintiff’s contributions for the purposes for which they were made. E.g., Cal. Bus. & Prof. \nCode § 17510.8. Defendants have repeatedly breached their fiduciary duties to Plaintiff, including \nby:', metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 30, 'total_pages': 46, 'format': 'PDF 1.7', 'title':

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

# 🤝 Breakout Room #2

## Task 1: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evluating on every core metric today, but in order to do that - we'll need to creat a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

> NOTE: This process will use `gpt-3.5-turbo-16k` as the base generator and `gpt-4` as the critic - if you're attempting to create a lot of samples please be aware of cost, as well as rate limits.

In [21]:
eval_documents = documents

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 400
)

eval_documents = text_splitter.split_documents(eval_documents)

####❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

When creating synthetic data, it is important to split our documents using different parameters for several reasons:

1. **Representativeness**: Splitting the data using different parameters helps ensure that the synthetic data accurately represents the original data. By considering various parameters, such as time, location, or user demographics, we can capture the diversity and patterns present in the real data.

2. **Generalization**: Splitting the data using different parameters allows us to create synthetic data that generalizes well. By including different subsets of the data, we can capture a wider range of scenarios and variations. This helps avoid overfitting, where the synthetic data is too specific to the original data and does not generalize well to new, unseen data.

3. **Evaluation and Testing**: Splitting the data using different parameters enables us to evaluate and test the quality of the synthetic data. By comparing the synthetic data generated from different subsets of the original data, we can assess how well the synthetic data captures the characteristics and patterns of the real data. This evaluation is crucial to ensure the reliability and usefulness of the synthetic data for downstream tasks, such as machine learning model training or data analysis.

4. **Privacy and Security**: Splitting the data using different parameters helps protect the privacy and security of sensitive information. By carefully selecting and anonymizing the parameters used for splitting, we can prevent the identification of individuals or sensitive data points in the synthetic data. This is particularly important when working with sensitive or confidential data, where privacy regulations and ethical considerations come into play.

Overall, splitting our documents using different parameters when creating synthetic data allows us to generate representative, generalized, and privacy-preserving data that can be used for various purposes, such as research, testing, or training machine learning models.

In [22]:
len(documents)

159

In [23]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()

testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.25, reasoning: 0.25, multi_context: 0.5})

embedding nodes:   0%|          | 0/318 [00:00<?, ?it/s]



Generating:   0%|          | 0/10 [00:00<?, ?it/s]

####❓ Question #3:

`{simple: 0.5, reasoning: 0.25, multi_context: 0.25}`

What exactly does this mapping refer to?

> NOTE: Check out the Ragas documentation on this generation process [here](https://docs.ragas.io/en/stable/concepts/testset_generation.html).

The mapping `{simple: 0.5, reasoning: 0.25, multi_context: 0.25}` refers to the distribution of testset generation strategies used by Ragas.

- `simple` represents a simple generation strategy that focuses on generating straightforward questions and answers.
- `reasoning` represents a reasoning-based generation strategy that focuses on generating questions that require logical reasoning or inference.
- `multi_context` represents a multi-context generation strategy that focuses on generating questions that require understanding and integration of multiple contexts.

The numbers associated with each strategy represent the relative proportions of each strategy used in the testset generation process. In this case, `simple` is used 50% of the time, `reasoning` is used 25% of the time, and `multi_context` is used 25% of the time.

Let's look at the output and see what we can learn about it!

In [24]:
testset.test_data[0]

DataRow(question='What is the significance of the "175 billion parameters" in OpenAI\'s GPT-3 model?', contexts=['1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \n– 19 – \nCOMPLAINT \n \n82. \nTheir publication did prove to be useful to the developers of future, powerful models. \nEntire communities sprung up to enhance and extend the models released by OpenAI. These \ncommunities spread to open-source, grass-roots efforts and commercial entities alike. \n83. \nIn 2020, OpenAI announced a third version of its model, GPT-3. It used “175 billion \nparameters, 10x more than any previous non-sparse language model.” Again, OpenAI announced \nthe development of this model with the publication of a research paper describing its complete'], ground_truth='The significance of the "175 billion parameters" in OpenAI\'s GPT-3 model is that it is 10x more than any previous non-sparse language model.', evolution_t

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [25]:
test_df = testset.to_pandas()

In [26]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,episode_done
0,"What is the significance of the ""175 billion p...",[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,"The significance of the ""175 billion parameter...",simple,True
1,What strategy video game did OpenAI compete in?,[77. \nInitial work at OpenAI followed much in...,"OpenAI competed in Dota 2, a strategy video game.",simple,True
2,What are the two principles OpenAI is based on...,[profit developing AGI for the benefit of huma...,The two principles OpenAI is based on for AGI ...,reasoning,True
3,How did OpenAI use reinforcement learning in t...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning to compete ...,reasoning,True
4,How would OpenAI's proposed business model imp...,"[business model were valid, it would radically...",OpenAI's proposed business model would impact ...,multi_context,True
5,"""What game did OpenAI use reinforcement learni...",[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning in the game...,multi_context,True
6,"What was Alphabet, Inc.'s response to AI conce...","[Page, then-CEO of Google’s parent company Alp...",,multi_context,True
7,What was Mr. Page's response to Mr. Musk's con...,"[Page, then-CEO of Google’s parent company Alp...",Mr. Page responded that would merely “be the n...,multi_context,True
8,What were Stephen Hawking's concerns about AGI...,[18. \nMr. Musk has long recognized that AGI p...,Stephen Hawking's concerns about AGI in the wr...,multi_context,True
9,"""What is the role of the Transformer architect...",[those connections to the target language. \n7...,The Transformer architecture is used in both G...,multi_context,True


In [27]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [28]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [29]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [30]:
response_dataset[0]

{'question': 'What is the significance of the "175 billion parameters" in OpenAI\'s GPT-3 model?',
 'answer': 'The significance of the "175 billion parameters" in OpenAI\'s GPT-3 model is that it was 10 times more than any previous non-sparse language model, indicating a significant advancement in model complexity and capabilities.',
 'contexts': ['1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \n– 19 – \nCOMPLAINT \n \n82. \nTheir publication did prove to be useful to the developers of future, powerful models. \nEntire communities sprung up to enhance and extend the models released by OpenAI. These \ncommunities spread to open-source, grass-roots efforts and commercial entities alike. \n83. \nIn 2020, OpenAI announced a third version of its model, GPT-3. It used “175 billion \nparameters, 10x more than any previous non-sparse language model.” Again, OpenAI announced \nthe development of this model 

## Task 2: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [31]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [32]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [33]:
results

{'faithfulness': 0.9630, 'answer_relevancy': 0.9574, 'context_recall': 0.7500, 'context_precision': 0.8472, 'answer_correctness': 0.6421}

In [34]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,"What is the significance of the ""175 billion p...","The significance of the ""175 billion parameter...",[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,"The significance of the ""175 billion parameter...",1.0,0.999356,1.0,1.0,0.621103
1,What strategy video game did OpenAI compete in?,Dota 2,[77. \nInitial work at OpenAI followed much in...,"OpenAI competed in Dota 2, a strategy video game.",1.0,0.95385,1.0,1.0,0.716374
2,What are the two principles OpenAI is based on...,The two principles OpenAI is based on for AGI ...,[a key role in recruiting world-class talent t...,The two principles OpenAI is based on for AGI ...,1.0,1.0,1.0,0.833333,0.999059
3,How did OpenAI use reinforcement learning in t...,OpenAI used reinforcement learning to compete ...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning to compete ...,1.0,0.903244,1.0,1.0,0.745112
4,How would OpenAI's proposed business model imp...,OpenAI's proposed business model would impact ...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,OpenAI's proposed business model would impact ...,,0.985768,1.0,1.0,0.617759
5,"""What game did OpenAI use reinforcement learni...",OpenAI used reinforcement learning to compete ...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning in the game...,1.0,0.927906,0.5,1.0,0.539314
6,"What was Alphabet, Inc.'s response to AI conce...","Alphabet, Inc.'s response to AI concerns raise...","[Page, then-CEO of Google’s parent company Alp...",,0.666667,0.999369,0.0,0.0,0.186539
7,What was Mr. Page's response to Mr. Musk's con...,Mr. Page responded that the potential replacem...,"[Page, then-CEO of Google’s parent company Alp...",Mr. Page responded that would merely “be the n...,1.0,0.913005,1.0,0.805556,0.598265
8,What were Stephen Hawking's concerns about AGI...,Stephen Hawking's concerns about AGI in the wr...,[18. \nMr. Musk has long recognized that AGI p...,Stephen Hawking's concerns about AGI in the wr...,1.0,0.96459,0.0,0.833333,0.724574
9,"""What is the role of the Transformer architect...",The Transformer architecture is used in the GP...,[those connections to the target language. \n7...,The Transformer architecture is used in both G...,1.0,0.927162,1.0,1.0,0.673377


## Task 3: Making Adjustments to our RAG Pipeline

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

In [35]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [36]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [37]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [38]:
response = retrieval_chain.invoke({"input": "Who is the plantiff?"})

In [39]:
print(response["answer"])

The plaintiff is Elon Musk.


In [40]:
response = retrieval_chain.invoke({"input": "What does this complaint pertain to?"})

In [41]:
print(response["answer"])

The complaint pertains to a legal case involving Plaintiff Elon Musk alleging breach of fiduciary duty, unfair business practices, and seeking an accounting, restitution, disgorgement of funds, and specific performance from the Defendants. The complaint also includes a demand for a jury trial.


Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [42]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Now we can convert this into a dataset, just like we did before.

In [43]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [44]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [45]:
advanced_retrieval_results_df = advanced_retrieval_results.to_pandas()
advanced_retrieval_results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,"What is the significance of the ""175 billion p...","The ""175 billion parameters"" in OpenAI's GPT-3...",[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,"The significance of the ""175 billion parameter...",1.0,0.979156,1.0,1.0,0.616704
1,What strategy video game did OpenAI compete in?,"OpenAI competed in Dota 2, a strategy video ga...",[77. \nInitial work at OpenAI followed much in...,"OpenAI competed in Dota 2, a strategy video game.",1.0,1.0,1.0,1.0,0.741034
2,What are the two principles OpenAI is based on...,The two principles OpenAI is based on for AGI ...,[a key role in recruiting world-class talent t...,The two principles OpenAI is based on for AGI ...,1.0,1.0,1.0,0.833333,0.995852
3,How did OpenAI use reinforcement learning in t...,OpenAI used reinforcement learning to compete ...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning to compete ...,1.0,0.933311,1.0,1.0,0.618101
4,How would OpenAI's proposed business model imp...,OpenAI's proposed business model could potenti...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,OpenAI's proposed business model would impact ...,1.0,0.901728,1.0,0.75,0.490269
5,"""What game did OpenAI use reinforcement learni...",OpenAI used reinforcement learning in the stra...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning in the game...,1.0,0.93784,0.5,1.0,0.536569
6,"What was Alphabet, Inc.'s response to AI conce...","Alphabet, Inc.'s response to AI concerns raise...","[Page, then-CEO of Google’s parent company Alp...",,1.0,1.0,0.0,0.0,0.184103
7,What was Mr. Page's response to Mr. Musk's con...,Mr. Page responded to Mr. Musk's concerns abou...,"[Page, then-CEO of Google’s parent company Alp...",Mr. Page responded that would merely “be the n...,1.0,0.974393,1.0,0.805556,0.828274
8,What were Stephen Hawking's concerns about AGI...,"Stephen Hawking, along with other luminaries l...",[18. \nMr. Musk has long recognized that AGI p...,Stephen Hawking's concerns about AGI in the wr...,1.0,0.924122,1.0,0.7,0.528843
9,"""What is the role of the Transformer architect...",The Transformer architecture plays a crucial r...,[those connections to the target language. \n7...,The Transformer architecture is used in both G...,1.0,0.940645,1.0,1.0,0.573755


## Task 4: Evaluating our Adjusted Pipeline Against Our Baseline

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [46]:
results

{'faithfulness': 0.9630, 'answer_relevancy': 0.9574, 'context_recall': 0.7500, 'context_precision': 0.8472, 'answer_correctness': 0.6421}

And see how our advanced retrieval modified our chain!

In [47]:
advanced_retrieval_results

{'faithfulness': 1.0000, 'answer_relevancy': 0.9591, 'context_recall': 0.8500, 'context_precision': 0.8089, 'answer_correctness': 0.6114}

In [48]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.962963,1.0,0.037037
1,answer_relevancy,0.957425,0.95912,0.001694
2,context_recall,0.75,0.85,0.1
3,context_precision,0.847222,0.808889,-0.038333
4,answer_correctness,0.642148,0.61135,-0.030797


## Task 5: Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####🏗️ Activity #2:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

The following line of code new_embeddings = OpenAIEmbeddings(model="text-embedding-3-small") is creating an instance of the OpenAIEmbeddings class with the specified model "text-embedding-3-small".

In [49]:
new_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

This following line of code is creating a vector store using the FAISS (Facebook AI Similarity Search) library.

In [50]:
vector_store = FAISS.from_documents(documents, new_embeddings)

The following line of code new_retriever = vector_store.as_retriever() is creating a retriever object from the vector_store object.

In [51]:
new_retriever = vector_store.as_retriever()

The following line of code is creating an instance of the MultiQueryRetriever class using the from_llm class method.

In [52]:
new_advanced_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=primary_qa_llm)

The following line of code new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain) is calling a function named create_retrieval_chain with two arguments: new_advanced_retriever and document_chain.

In [53]:
new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain)

The following lines of code is running a set of test questions through a retrieval chain and storing the answers and contexts.

In [54]:
answers = []
contexts = []

for question in test_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

The following lines of code is creating a new dataset using the Dataset.from_dict method from the datasets library, which is a common library used in Natural Language Processing tasks.

In [55]:
new_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

The following line of code is calling an evaluate function with two arguments: new_response_dataset_advanced_retrieval and metrics.

In [56]:
new_advanced_retrieval_results = evaluate(new_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

The following line of code "variable new_advanced_retrieval_results" is the result of an evaluation function that was called previously in the previous line of code.

In [57]:
new_advanced_retrieval_results

{'faithfulness': 1.0000, 'answer_relevancy': 0.9522, 'context_recall': 0.8750, 'context_precision': 0.8700, 'answer_correctness': 0.6420}

The following lines of code is creating pandas DataFrames from dictionaries that contain evaluation results, merging these DataFrames, and calculating the differences in results between different models.

In [58]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_original = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(new_advanced_retrieval_results.items()), columns=['Metric', 'Text Embedding 3'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['Delta - TE3 -> ADA'] = df_merged['Text Embedding 3'] - df_merged['ADA']
df_merged['Delta - TE3 -> Baseline'] = df_merged['Text Embedding 3'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,ADA,Text Embedding 3,Delta - TE3 -> ADA,Delta - TE3 -> Baseline
0,faithfulness,0.962963,1.0,1.0,0.0,0.037037
1,answer_relevancy,0.957425,0.95912,0.952203,-0.006917,-0.005222
2,context_recall,0.75,0.85,0.875,0.025,0.125
3,context_precision,0.847222,0.808889,0.87,0.061111,0.022778
4,answer_correctness,0.642148,0.61135,0.641967,0.030617,-0.00018


####❓ Question #4:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

text-embedding-3-small is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the text-embedding-ada-002 model released in December 2022.

Stronger performance. Comparing text-embedding-ada-002 to text-embedding-3-small, the average score on a commonly used benchmark for multi-language retrieval (MIRACL) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks (MTEB) has increased from 61.0% to 62.3%.

Reduced price. text-embedding-3-small is also substantially more efficient than our previous generation text-embedding-ada-002 model. Pricing for text-embedding-3-small has therefore been reduced by 5X compared to text-embedding-ada-002, from a price per 1k tokens of $0.0001 to $0.00002.

## BONUS ACTIVITY: Showcase Multi-Context Perfomance Changes

Now that we've looked at a number of different examples - showcase the difference on the multi-context *specific* questions that were synthetically generated.

> NOTE: You have all the data you'll need already in the notebook if you made it to this step!

In [None]:
import pandas as pd

# Define the metrics to evaluate
metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

# Create a DataFrame to store the results
results_df = pd.DataFrame(columns=['Pipeline', 'Metric', 'Score'])

# Evaluate the baseline pipeline
baseline_results = evaluate(response_dataset, metrics)
baseline_df = baseline_results.to_pandas()
baseline_df['Pipeline'] = 'Baseline'
results_df = pd.concat([results_df, baseline_df])

# Evaluate the advanced retrieval pipeline
advanced_results = evaluate(response_dataset_advanced_retrieval, metrics)
advanced_df = advanced_results.to_pandas()
advanced_df['Pipeline'] = 'Advanced Retrieval'
results_df = pd.concat([results_df, advanced_df])

# Evaluate the pipeline with text-embedding-3-small
new_results = evaluate(new_response_dataset_advanced_retrieval, metrics)
new_df = new_results.to_pandas()
new_df['Pipeline'] = 'Text Embedding 3'
results_df = pd.concat([results_df, new_df])

# Pivot the DataFrame for better visualization
results_pivot = results_df.pivot(index='Metric', columns='Pipeline', values='Score')

results_pivot