# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

> NOTE: Please skip the pip install commands if you are running the notebook locally.

In [1]:
#!pip install -qU ragas==0.2.10

In [2]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [3]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [4]:
!mkdir data

mkdir: data: File exists


In [5]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31554    0 31554    0     0  62590      0 --:--:-- --:--:-- --:--:-- 62607


In [6]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70721    0 70721    0     0   413k      0 --:--:-- --:--:-- --:--:--  411k


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [7]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [8]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [9]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [10]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"Which organizations, including Stability AI, h...",[We don’t yet know how to build GPT-4 Vibes Ba...,Organizations that have produced better-than-G...,single_hop_specifc_query_synthesizer
1,How does the use of Python by large language m...,[I’m surprised that no-one has beaten the now ...,"According to the context, writing code is one ...",single_hop_specifc_query_synthesizer
2,Wut is AI?,[Simon Willison’s Weblog Subscribe Stuff we fi...,"AI refers to Large Language Models, which are ...",single_hop_specifc_query_synthesizer
3,Whaat is the signifficance of OpenAI in the co...,[Microsoft over this issue. The 69 page PDF is...,"According to the blog's 2023 tag cloud, 'opena...",single_hop_specifc_query_synthesizer
4,Considering the applications of LLMs such as c...,[<1-hop>\n\nWe don’t yet know how to build GPT...,The rise of LLMs has significantly impacted so...,multi_hop_abstract_query_synthesizer
5,How have large language models impacted softwa...,[<1-hop>\n\nWe don’t yet know how to build GPT...,Large language models (LLMs) have significantl...,multi_hop_abstract_query_synthesizer
6,How does the use of synthetic training data pr...,[<1-hop>\n\nThe rise of inference-scaling “rea...,Synthetic training data offers several direct ...,multi_hop_abstract_query_synthesizer
7,How have improvements in LLM efficiency over t...,[<1-hop>\n\nThe rise of inference-scaling “rea...,Improvements in LLM efficiency over the past y...,multi_hop_abstract_query_synthesizer
8,How has ChatGPT been discussed in relation to ...,[<1-hop>\n\nof very bad decisions are being ma...,ChatGPT has been a recurring topic in blog pos...,multi_hop_specific_query_synthesizer
9,"Anthropic make LLMs, but what is problem with ...",[<1-hop>\n\nskeptical as to their utility base...,Anthropic is one of the organizations that has...,multi_hop_specific_query_synthesizer


In [11]:
dataset.to_csv('./data/dataset.csv')

## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [12]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


Now that we have our data loaded, let's split it into chunks!

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

75

#### Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

The `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter` serves a crucial purpose in preserving context across adjacent chunks.

**Purpose of `chunk_overlap`:**

* **Maintains continuity** between adjacent text chunks
* **Reduces the risk of context loss** when splitting long documents
* Ensures that **overlapping information** (e.g., an important sentence or phrase that spans two chunks) is **present in both chunks**
* Improves **retrieval and generation quality** in RAG pipelines, especially when working with embeddings that rely on semantic coherence

**Example:**

If you have:

* `chunk_size = 1000`
* `chunk_overlap = 200`

Then:

* Chunk 1 spans characters 0–999
* Chunk 2 spans characters 800–1799 (i.e., 200-character overlap with Chunk 1)

This overlap helps ensure that the **semantic glue between chunks** (e.g., sentences that start at the end of one chunk and finish at the beginning of another) is preserved during embedding and retrieval.

**Factors to Consider**

+ **1. LLM Context Window**

  * Use chunk sizes **well below the model's token limit** to leave room for the prompt, context headers, and system messages.
  * For example:

    * `gpt-3.5-turbo`: \~4,096 tokens → Safe chunk size: **512–1,000 tokens**
    * `gpt-4o` / `gpt-4`: \~128k tokens → Can handle **larger chunks**, up to 2,000–4,000 tokens if needed

+ **2. Document Type**

  | Document Type         | Recommended Chunk Size | Chunk Overlap | Reason                                         |
  | --------------------- | ---------------------- | ------------- | ---------------------------------------------- |
  | Legal contracts       | 512–1,024 chars        | 100–200       | Dense, needs precision                         |
  | Web pages / Blogs     | 1,000–1,500 chars      | 200–300       | Semi-structured                                |
  | Research papers       | 800–1,200 chars        | 150–200       | Technical language                             |
  | Transcripts / Dialog  | 512–1,000 chars        | 100–300       | Important back-references                      |
  | Code / Technical Docs | 300–800 chars          | 100–200       | Functions or classes often span multiple lines |

+ **3. Embedding Model Limit**

  * Some models (e.g., `text-embedding-3-small`) max out at 8,192 tokens; prefer **shorter chunks (512–768 tokens)** to ensure reliable embedding quality.

---

**General Best Practices**

| Parameter         | Recommendation                                                               |
| ----------------- | ---------------------------------------------------------------------------- |
| `chunk_size`      | 500–1,500 characters for most cases                                          |
| `chunk_overlap`   | 10–30% of chunk\_size (e.g., 200 overlap for 1,000 chunk)                    |
| Ensure boundaries | Split on semantic boundaries (sentences or paragraphs) using smart splitters |

---

**Example Configuration in LangChain**

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " ", ""]
)
```

Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [14]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [15]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [16]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [17]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [18]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [19]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4.1-nano` to avoid using the same model as our judge model.

In [20]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-nano")

Then we can create a `generate` node!

In [21]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [22]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [23]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [24]:
response = graph.invoke({"question" : "How are LLM agents useful?"})

In [25]:
response["response"]

'LLM agents are useful because they can act on your behalf, such as functioning as a travel agent or digital assistant. They can also be given access to tools and run in loops to help solve problems. Additionally, they are quite easy to build if you have the right data and resources, and can be run on personal devices, making them accessible. Their effectiveness is notable, especially in tasks like code generation, where they can even execute and test their own output to improve accuracy. Despite concerns about their reliability and potential drawbacks, LLM agents demonstrate significant utility in automating tasks and assisting with problem-solving.'

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated using SDG above through our application to get context and responses.

In [26]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [27]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,"Which organizations, including Stability AI, h...",[I’m relieved that this has changed completely...,[We don’t yet know how to build GPT-4 Vibes Ba...,"Yes, according to the provided context, organi...",Organizations that have produced better-than-G...,single_hop_specifc_query_synthesizer
1,How does the use of Python by large language m...,[Code may be the best application\n\nThe ethic...,[I’m surprised that no-one has beaten the now ...,The use of Python in building and training lar...,"According to the context, writing code is one ...",single_hop_specifc_query_synthesizer
2,Wut is AI?,[A lot of people are excited about AI agents—a...,[Simon Willison’s Weblog Subscribe Stuff we fi...,"AI, based on the provided context, refers to a...","AI refers to Large Language Models, which are ...",single_hop_specifc_query_synthesizer
3,Whaat is the signifficance of OpenAI in the co...,[Law is not ethics. Is it OK to train models o...,[Microsoft over this issue. The 69 page PDF is...,According to the blog's 2023 tag cloud and tra...,"According to the blog's 2023 tag cloud, 'opena...",single_hop_specifc_query_synthesizer
4,Considering the applications of LLMs such as c...,[Code may be the best application\n\nThe ethic...,[<1-hop>\n\nWe don’t yet know how to build GPT...,The rise of LLMs has significantly impacted so...,The rise of LLMs has significantly impacted so...,multi_hop_abstract_query_synthesizer
5,How have large language models impacted softwa...,[Code may be the best application\n\nThe ethic...,[<1-hop>\n\nWe don’t yet know how to build GPT...,Large language models have significantly impac...,Large language models (LLMs) have significantl...,multi_hop_abstract_query_synthesizer
6,How does the use of synthetic training data pr...,[One of the best descriptions I’ve seen of thi...,[<1-hop>\n\nThe rise of inference-scaling “rea...,The use of synthetic training data offers seve...,Synthetic training data offers several direct ...,multi_hop_abstract_query_synthesizer
7,How have improvements in LLM efficiency over t...,"[If you can gather the right data, and afford ...",[<1-hop>\n\nThe rise of inference-scaling “rea...,Improvements in LLM efficiency over the past y...,Improvements in LLM efficiency over the past y...,multi_hop_abstract_query_synthesizer
8,How has ChatGPT been discussed in relation to ...,[How should we feel about this as software eng...,[<1-hop>\n\nof very bad decisions are being ma...,The discussion of ChatGPT in relation to pract...,ChatGPT has been a recurring topic in blog pos...,multi_hop_specific_query_synthesizer
9,"Anthropic make LLMs, but what is problem with ...",[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nskeptical as to their utility base...,"The main problems with LLMs, according to the ...",Anthropic is one of the organizations that has...,multi_hop_specific_query_synthesizer


In [28]:
dataset.to_csv('./data/evaluation-dataset-baseline.csv')

Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [29]:
from ragas import EvaluationDataset

baseline_evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [30]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))

Next up - we simply evaluate on our desired metrics!

In [31]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

baseline_result = evaluate(
    dataset=baseline_evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
baseline_result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[2]: APIConnectionError(Connection error.)
Exception raised in Job[13]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-ULKD0BfHsBV4junY2FxMu8EA on tokens per min (TPM): Limit 30000, Used 28829, Requested 2229. Please try again in 2.116s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[11]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-ULKD0BfHsBV4junY2FxMu8EA on tokens per min (TPM): Limit 30000, Used 29274, Requested 1702. Please try again in 1.951s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[22]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-ULKD0BfHsBV4ju

{'context_recall': 0.4722, 'faithfulness': 0.7415, 'factual_correctness': 0.4225, 'answer_relevancy': 0.9538, 'context_entity_recall': 0.4993, 'noise_sensitivity_relevant': 0.1831}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model (which was updated fairly [recently](https://docs.cohere.com/v2/changelog/rerank-v3.5)) - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [32]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")

In [33]:
#!pip install -qU cohere langchain_cohere


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [34]:
retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [35]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [36]:
class State(TypedDict):
  question: str
  context: List[Document]
  response: str

graph_builder = StateGraph(State).add_sequence([retrieve_adjusted, generate])
graph_builder.add_edge(START, "retrieve_adjusted")
graph = graph_builder.compile()

In [37]:
response = graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

'LLM agents are useful because they can act on your behalf, such as performing tasks like travel planning or research, by accessing and utilizing tools within a loop to solve problems. Additionally, writing code is one of the most effective applications of LLMs, as they excel at generating and understanding programming languages like Python and JavaScript.'

In [38]:
import time

for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [39]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,"Which organizations, including Stability AI, h...","[If you can gather the right data, and afford ...",[We don’t yet know how to build GPT-4 Vibes Ba...,Organizations that have produced better-than-G...,Organizations that have produced better-than-G...,single_hop_specifc_query_synthesizer
1,How does the use of Python by large language m...,[It’s still astonishing to me how effective th...,[I’m surprised that no-one has beaten the now ...,The use of Python by large language models (LL...,"According to the context, writing code is one ...",single_hop_specifc_query_synthesizer
2,Wut is AI?,[A lot of people are excited about AI agents—a...,[Simon Willison’s Weblog Subscribe Stuff we fi...,"Based on the provided context, AI refers to sy...","AI refers to Large Language Models, which are ...",single_hop_specifc_query_synthesizer
3,Whaat is the signifficance of OpenAI in the co...,[Law is not ethics. Is it OK to train models o...,[Microsoft over this issue. The 69 page PDF is...,According to the blog's 2023 tag cloud and tra...,"According to the blog's 2023 tag cloud, 'opena...",single_hop_specifc_query_synthesizer
4,Considering the applications of LLMs such as c...,[It’s still astonishing to me how effective th...,[<1-hop>\n\nWe don’t yet know how to build GPT...,The rise of LLMs in software engineering and p...,The rise of LLMs has significantly impacted so...,multi_hop_abstract_query_synthesizer
5,How have large language models impacted softwa...,[It’s still astonishing to me how effective th...,[<1-hop>\n\nWe don’t yet know how to build GPT...,Large language models have significantly impac...,Large language models (LLMs) have significantl...,multi_hop_abstract_query_synthesizer
6,How does the use of synthetic training data pr...,[One of the best descriptions I’ve seen of thi...,[<1-hop>\n\nThe rise of inference-scaling “rea...,The use of synthetic training data offers seve...,Synthetic training data offers several direct ...,multi_hop_abstract_query_synthesizer
7,How have improvements in LLM efficiency over t...,[Simon Willison’s Weblog\n\nSubscribe\n\nThing...,[<1-hop>\n\nThe rise of inference-scaling “rea...,Improvements in LLM efficiency over the past y...,Improvements in LLM efficiency over the past y...,multi_hop_abstract_query_synthesizer
8,How has ChatGPT been discussed in relation to ...,[Law is not ethics. Is it OK to train models o...,[<1-hop>\n\nof very bad decisions are being ma...,ChatGPT has been discussed extensively in rela...,ChatGPT has been a recurring topic in blog pos...,multi_hop_specific_query_synthesizer
9,"Anthropic make LLMs, but what is problem with ...",[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nskeptical as to their utility base...,"The main problems with LLMs, as highlighted in...",Anthropic is one of the organizations that has...,multi_hop_specific_query_synthesizer


In [40]:
dataset.to_csv('./data/evaluation-dataset-using-cohere.csv')

In [41]:
evaluation_dataset_with_cohere = EvaluationDataset.from_pandas(dataset.to_pandas())

In [42]:
result_with_cohere = evaluate(
    dataset=evaluation_dataset_with_cohere,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result_with_cohere

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[1]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-ULKD0BfHsBV4junY2FxMu8EA on tokens per min (TPM): Limit 30000, Used 29657, Requested 1778. Please try again in 2.87s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[13]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-ULKD0BfHsBV4junY2FxMu8EA on tokens per min (TPM): Limit 30000, Used 29498, Requested 1871. Please try again in 2.738s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[25]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-ULKD0BfHsBV4junY2FxMu8EA on tokens per min (TPM): Limit 30000, Used 29606, Request

{'context_recall': 0.5694, 'faithfulness': 0.6740, 'factual_correctness': 0.5580, 'answer_relevancy': 0.8825, 'context_entity_recall': 0.3670, 'noise_sensitivity_relevant': 0.1346}

#### **Baseline Result:**

In [43]:
baseline_result

{'context_recall': 0.4722, 'faithfulness': 0.7415, 'factual_correctness': 0.4225, 'answer_relevancy': 0.9538, 'context_entity_recall': 0.4993, 'noise_sensitivity_relevant': 0.1831}

#### **Result using Cohere:**

In [44]:
result_with_cohere

{'context_recall': 0.5694, 'faithfulness': 0.6740, 'factual_correctness': 0.5580, 'answer_relevancy': 0.8825, 'context_entity_recall': 0.3670, 'noise_sensitivity_relevant': 0.1346}

#### **Comparison between Baseline and after Re-ranking**

In [47]:
import plotly.graph_objects as go

# Extracted metrics from the image
baseline_result = {
    'context_recall': 0.4722,
    'faithfulness': 0.7415,
    'factual_correctness': 0.4225,
    'answer_relevancy': 0.9538,
    'context_entity_recall': 0.4993,
    'noise_sensitivity_relevant': 0.1831
}

cohere_result = {
    'context_recall': 0.5694,
    'faithfulness': 0.6740,
    'factual_correctness': 0.5500,
    'answer_relevancy': 0.8825,
    'context_entity_recall': 0.3670,
    'noise_sensitivity_relevant': 0.1346
}

# Common x-axis
metrics = list(baseline_result.keys())

# Bar chart
fig = go.Figure()

fig.add_trace(go.Bar(
    x=metrics,
    y=[baseline_result[m] for m in metrics],
    name='Baseline',
    textposition='auto'
))

fig.add_trace(go.Bar(
    x=metrics,
    y=[cohere_result[m] for m in metrics],
    name='Using Cohere',
    textposition='auto'
))

# Layout settings
fig.update_layout(
    title='RAG Evaluation Metrics: Baseline vs Using Cohere Reranker',
    yaxis_title='Score',
    barmode='group',
    xaxis_tickangle=-45
)

fig.show()

#### Question: 

Which system performed better, on what metrics, and why?

| **Metric**                | **Baseline** | **Using Cohere** | **Better System** | **Explanation**                                                                       |
| ------------------------- | ------------ | ---------------- | ----------------- | ------------------------------------------------------------------------------------- |
| **Context Recall**        | 0.4722       | **0.5694**       | ✅ Cohere          | Retrieved more relevant content; higher recall due to reranking.                      |
| **Faithfulness (↓)**          | **0.7415**   | 0.6740           | ✅ Baseline        | Answers better aligned with provided context in baseline.                             |
| **Factual Correctness**   | 0.4225       | **0.5500**       | ✅ Cohere          | Improved factual grounding with top reranked context.                                 |
| **Answer Relevancy (↓)**      | **0.9538**   | 0.8825           | ✅ Baseline        | Slightly more precise answers in baseline; less dilution.                             |
| **Context Entity Recall (↓)** | **0.4993**   | 0.3670           | ✅ Baseline        | More named entities preserved in baseline; Cohere compression may have filtered some. |
| **Noise Sensitivity (↓)** | 0.1831       | **0.1346**       | ✅ Cohere          | Less affected by noisy/unrelated context; better precision under Cohere.              |


**Conclusion**

* **Cohere reranking improved**:

  * **Context recall**, **factual correctness**, and **noise robustness**
  * Ideal for improving accuracy and grounding

* **Baseline performed better** in:

  * **Faithfulness**, **answer relevancy**, and **entity recall**
  * Better if retaining broader context fidelity is more important

---


Therefore, if **factual accuracy and robustness** are your goals, **Cohere** outperforms.  If you prioritize **tight alignment to reference context and high answer relevancy**, stick with the **Baseline**.
