<a href="https://colab.research.google.com/github/vprzybylo/AIE5/blob/main/Evaluating_RAG_with_Ragas_(2025)_AI_Makerspace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

> NOTE: Please skip the pip install commands if you are running the notebook locally.

In [9]:
!pip install -qU ragas==0.2.10

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [8]:
!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m389.1/981.5 kB[0m [31m13.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m66.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.2/137.2 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 kB[0m [31m2.8 M

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [14]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

Please enter your OpenAI API key!··········


OPTIONALLY:

We can also provide a Ragas API key - which you can sign-up for [here](https://app.ragas.io/).

In [15]:
os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")

Please enter your Ragas API key!··········


## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [4]:
!mkdir data

In [5]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31427    0 31427    0     0  32810      0 --:--:-- --:--:-- --:--:-- 32804


In [6]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70286    0 70286    0     0  71860      0 --:--:-- --:--:-- --:--:-- 71867


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [16]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [17]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [18]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [19]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How has Anthropic's Claude series contributed ...,[Prompt driven app generation is a commodity a...,Anthropic's Claude series has made significant...,single_hop_specifc_query_synthesizer
1,Wht are the cost implications of using GPT-3.5...,"[gets you OpenAI’s most expensive model, o1. G...",GPT-3.5 is significantly more expensive compar...,single_hop_specifc_query_synthesizer
2,Wht is the WebRTC API and how is it used in ap...,[feed with the model and talk about what you c...,"The WebRTC API, announced by OpenAI in Decembe...",single_hop_specifc_query_synthesizer
3,Wht is Claud?,[dependent on AGI itself. A model that’s robus...,Claude is associated with Anthropic’s Amanda A...,single_hop_specifc_query_synthesizer
4,How have the training costs and environmental ...,[<1-hop>\n\nCode may be the best application T...,The training costs for large language models (...,multi_hop_abstract_query_synthesizer
5,How has OpenAI contributed to the development ...,[<1-hop>\n\nCode may be the best application T...,OpenAI has played a significant role in the de...,multi_hop_abstract_query_synthesizer
6,How have the training costs and environmental ...,[<1-hop>\n\nCode may be the best application T...,The training costs for Large Language Models (...,multi_hop_abstract_query_synthesizer
7,How does the black box nature of Large Languag...,[<1-hop>\n\nCode may be the best application T...,The black box nature of Large Language Models ...,multi_hop_abstract_query_synthesizer
8,How has Meta's approach to training data and m...,[<1-hop>\n\nAnother common technique is to use...,Meta's approach to training data and model acc...,multi_hop_specific_query_synthesizer
9,How have advancements in Large Language Models...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,"In 2023, Large Language Models (LLMs) experien...",multi_hop_specific_query_synthesizer


#### OPTIONAL:

If you've provided your Ragas API key - you can use this web interface to look at the created data!

In [20]:
dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/98d0e874-152e-479d-b309-7613123b1769


'https://app.ragas.io/dashboard/alignment/testset/98d0e874-152e-479d-b309-7613123b1769'

## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [21]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [22]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

74

#### ❓ Question:

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

The chunk_overlap parameter in the RecursiveCharacterTextSplitter controls how many characters from the previous chunk are included at the beginning of the next chunk, ensuring context continuity between split text segments by maintaining some overlap between them; essentially, it helps preserve meaning when splitting large pieces of text into smaller chunks.


Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [23]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [25]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [26]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [27]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [28]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [29]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-mini` to avoid using the same model as our judge model.

In [30]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

Then we can create a `generate` node!

In [31]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [32]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [33]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [None]:
response = graph.invoke({"question" : "How are LLM agents useful?"})

In [None]:
response["response"]

'LLM agents can be useful in a couple of ways, primarily through their ability to act on behalf of users and to solve problems using various tools. They can function similarly to traditional agents, like travel agents, by helping users navigate tasks or make decisions. Additionally, LLMs can be enhanced with access to tools, allowing them to run processes in loops to address specific challenges.\n\nDespite some skepticism about their utility, LLMs are relatively easy to build, requiring only a few hundred lines of code and a substantial amount of quality training data. While training an LLM is not feasible for hobbyists due to costs, it is becoming more accessible than it once was.\n\nMoreover, LLMs can be run on personal devices, which has become increasingly possible thanks to advancements in model accessibility. However, the technology does warrant criticism due to concerns such as hallucination (producing information not aligned with reality), environmental impact, and ethical cons

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [34]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [35]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,How has Anthropic's Claude series contributed ...,[Getting back to models that beat GPT-4: Anthr...,[Prompt driven app generation is a commodity a...,Anthropic's Claude series has made significant...,Anthropic's Claude series has made significant...,single_hop_specifc_query_synthesizer
1,Wht are the cost implications of using GPT-3.5...,[Today $30/mTok gets you OpenAI’s most expensi...,"[gets you OpenAI’s most expensive model, o1. G...",The cost implications of using GPT-3.5 compare...,GPT-3.5 is significantly more expensive compar...,single_hop_specifc_query_synthesizer
2,Wht is the WebRTC API and how is it used in ap...,[These abilities are just a few weeks old at t...,[feed with the model and talk about what you c...,The WebRTC API is a tool that facilitates real...,"The WebRTC API, announced by OpenAI in Decembe...",single_hop_specifc_query_synthesizer
3,Wht is Claud?,[Anthropic kicked this idea into high gear whe...,[dependent on AGI itself. A model that’s robus...,"Claud refers to Claude, which is a series of m...",Claude is associated with Anthropic’s Amanda A...,single_hop_specifc_query_synthesizer
4,How have the training costs and environmental ...,[OpenAI are not the only game in town here. Go...,[<1-hop>\n\nCode may be the best application T...,The training costs and environmental impact of...,The training costs for large language models (...,multi_hop_abstract_query_synthesizer
5,How has OpenAI contributed to the development ...,"[Since then, almost every major LLM (and most ...",[<1-hop>\n\nCode may be the best application T...,OpenAI has played a significant role in the de...,OpenAI has played a significant role in the de...,multi_hop_abstract_query_synthesizer
6,How have the training costs and environmental ...,"[If you can gather the right data, and afford ...",[<1-hop>\n\nCode may be the best application T...,The training costs of Large Language Models (L...,The training costs for Large Language Models (...,multi_hop_abstract_query_synthesizer
7,How does the black box nature of Large Languag...,[Code may be the best application\n\nThe ethic...,[<1-hop>\n\nCode may be the best application T...,The black box nature of Large Language Models ...,The black box nature of Large Language Models ...,multi_hop_abstract_query_synthesizer
8,How has Meta's approach to training data and m...,[I wrote about how Large language models are h...,[<1-hop>\n\nAnother common technique is to use...,Meta's approach to training data and model acc...,Meta's approach to training data and model acc...,multi_hop_specific_query_synthesizer
9,How have advancements in Large Language Models...,[Law is not ethics. Is it OK to train models o...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,Advancements in Large Language Models (LLMs) i...,"In 2023, Large Language Models (LLMs) experien...",multi_hop_specific_query_synthesizer


Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [36]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [37]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

Next up - we simply evaluate on our desired metrics!

In [38]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[25]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-iGld80UX4JBP4962ymkitwd4 on tokens per min (TPM): Limit 30000, Used 29615, Requested 1680. Please try again in 2.59s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[7]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-iGld80UX4JBP4962ymkitwd4 on tokens per min (TPM): Limit 30000, Used 29048, Requested 2252. Please try again in 2.6s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[22]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-iGld80UX4JBP4962ymkitwd4

{'context_recall': 0.8000, 'faithfulness': 0.8214, 'factual_correctness': 0.4773, 'answer_relevancy': 0.9648, 'context_entity_recall': 0.4759, 'noise_sensitivity_relevant': 0.2003}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model (which was updated fairly [recently](https://docs.cohere.com/v2/changelog/rerank-v3.5)) - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [39]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")

Please enter your Cohere API key!··········


In [40]:
!pip install -qU cohere langchain_cohere

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/252.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━[0m [32m153.6/252.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.9/252.9 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/3.3 MB[0m [31m45.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.3/3.3 MB[0m [31m66.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [41]:
retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [42]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [43]:
class State(TypedDict):
  question: str
  context: List[Document]
  response: str

graph_builder = StateGraph(State).add_sequence([retrieve_adjusted, generate])
graph_builder.add_edge(START, "retrieve_adjusted")
graph = graph_builder.compile()

In [None]:
response = graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

'LLM agents can be useful in certain contexts, particularly in writing code, as they demonstrate a strong capability in this area. The grammar rules of programming languages are less complex than natural languages, making it easier for LLMs to generate code effectively. However, their overall utility is questioned due to issues such as gullibility, where LLMs may struggle to distinguish truth from fiction. This skepticism is further amplified by the lack of real-world examples of LLM agents operating successfully in production environments, despite the excitement surrounding their potential. Thus, while LLM agents show promise, particularly in coding, concerns about their reliability and decision-making capabilities remain significant challenges that need to be addressed.'

In [44]:
import time

for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [45]:
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[1]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-iGld80UX4JBP4962ymkitwd4 on tokens per min (TPM): Limit 30000, Used 29423, Requested 2374. Please try again in 3.594s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[13]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-iGld80UX4JBP4962ymkitwd4 on tokens per min (TPM): Limit 30000, Used 28832, Requested 2249. Please try again in 2.162s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[25]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-iGld80UX4JBP4962ymkit

{'context_recall': 0.7800, 'faithfulness': 0.7596, 'factual_correctness': 0.5291, 'answer_relevancy': 0.9673, 'context_entity_recall': 0.5386, 'noise_sensitivity_relevant': 0.1672}

#### ❓ Question:

Which system performed better, on what metrics, and why?

baseline: {'context_recall': 0.8000, 'faithfulness': 0.8214, 'factual_correctness': 0.4773, 'answer_relevancy': 0.9648, 'context_entity_recall': 0.4759, 'noise_sensitivity_relevant': 0.2003}

with reranking: {'context_recall': 0.7800, 'faithfulness': 0.7596, 'factual_correctness': 0.5291, 'answer_relevancy': 0.9673, 'context_entity_recall': 0.5386, 'noise_sensitivity_relevant': 0.1672}

The reranking system appears to have traded off some context recall and faithfulness for improved factual accuracy and noise reduction. While both systems achieved similar answer relevancy, the reranking approach demonstrates advantages in extracting the correct information and being less susceptible to irrelevant content.

