# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [1]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - which should hopefull be familiar at this point since it's our Loan Data use-case!

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [2]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [3]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  for match in re.finditer('{0}\s*'.format(re.escape(sent)), self.original_text):
  txt = re.sub('(?<={0})\.'.format(am), '∯', txt)
  txt = re.sub('(?<={0})\.'.format(am), '∯', txt)


In [4]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/18 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/27 [00:00<?, ?it/s]

Property 'summary' already exists in node '2c219b'. Skipping!
Property 'summary' already exists in node 'da573a'. Skipping!
Property 'summary' already exists in node '39b019'. Skipping!
Property 'summary' already exists in node '965590'. Skipping!
Property 'summary' already exists in node '1ff6db'. Skipping!
Property 'summary' already exists in node 'dd2303'. Skipping!
Property 'summary' already exists in node 'f12304'. Skipping!
Property 'summary' already exists in node 'b1e2f6'. Skipping!
Property 'summary' already exists in node 'b1e74f'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/18 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/59 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '2c219b'. Skipping!
Property 'summary_embedding' already exists in node '1ff6db'. Skipping!
Property 'summary_embedding' already exists in node 'b1e74f'. Skipping!
Property 'summary_embedding' already exists in node 'da573a'. Skipping!
Property 'summary_embedding' already exists in node 'f12304'. Skipping!
Property 'summary_embedding' already exists in node 'dd2303'. Skipping!
Property 'summary_embedding' already exists in node 'b1e2f6'. Skipping!
Property 'summary_embedding' already exists in node '965590'. Skipping!
Property 'summary_embedding' already exists in node '39b019'. Skipping!


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [5]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Why FAFSA keep changin and what new stuff we g...,[Application and Verification Guide Introducti...,FAFSA keep changin cause the FAFSA Simplificat...,single_hop_specifc_query_synthesizer
1,Wut is FAFSA?,[Chapter 1: The Application Process We removed...,FAFSA renewal functionality has been deferred ...,single_hop_specifc_query_synthesizer
2,What is an ISIR and how does a school receive it?,[The FPS also checks the application for possi...,The Institutional Student Information Record (...,single_hop_specifc_query_synthesizer
3,"As a Financial Aid Administrator, what are the...",[2. The disclosure of their FTI by the IRS to ...,"Federal student aid information, including Fed...",single_hop_specifc_query_synthesizer
4,"According to FAFSA guidelines, how is family s...",[<1-hop>\n\nequal the tax filer(s) plus depend...,Family size determination on the FAFSA for an ...,multi_hop_abstract_query_synthesizer
5,According to the FAFSA form and Title IV progr...,[<1-hop>\n\nSubmission of a court order or off...,To determine a student's independent status fo...,multi_hop_abstract_query_synthesizer
6,What specific types of Federal Tax Information...,[<1-hop>\n\n2. The disclosure of their FTI by ...,The specific types of Federal Tax Information ...,multi_hop_abstract_query_synthesizer
7,According to federal financial aid regulations...,"[<1-hop>\n\nOrphan, Ward of the Court, or in F...",The receipt of child support directly impacts ...,multi_hop_abstract_query_synthesizer
8,How does the ISIR reflect both the results of ...,[<1-hop>\n\nThe FPS also checks the applicatio...,The ISIR (Institutional Student Information Re...,multi_hop_specific_query_synthesizer
9,How does the IRS provide Federal Tax Informati...,[<1-hop>\n\n2. The disclosure of their FTI by ...,The IRS provides Federal Tax Information (FTI)...,multi_hop_specific_query_synthesizer


## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [6]:
path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

1102

#### ❓ Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [8]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [9]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="loan_data",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="loan_data",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [10]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [11]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [12]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [13]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-nano` to avoid using the same model as our judge model.

In [14]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-nano")

Then we can create a `generate` node!

In [15]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [16]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [17]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [32]:
response = graph.invoke({"question" : "What are the different kinds of loans?"})

In [33]:
response["response"]

'Based on the provided context, the different kinds of loans mentioned are:\n\n1. **Direct Loan** – This includes various types of federal student loans such as the Direct Unsubsidized Loan, which the borrower may choose to pay interest on while in school.\n\nThe context also references the types of programs and loan periods associated with these loans but does not specify additional distinct loan types beyond the Direct Loan.\n\n**Note:** The document emphasizes the importance of understanding different repayment plans and options related to the Direct Loan, but it does not list other specific loan categories outside of the Direct Loan within this excerpt.'

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [20]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [48]:
dataset.samples[0].eval_sample.response

"The FAFSA keeps changing because of efforts to simplify and improve the application process, making it easier for students and their families to access financial aid. One of the major updates is the implementation of the FAFSA Simplification Act, which aims to streamline how applicants provide their financial information.\n\nA significant change is the introduction of the FUTURE Act, which has led to the creation of the FUTURE Act Direct Data Exchange (FA-DDX). This system allows the IRS to directly share certain tax and income information with the Department of Education through a secure connection. As a result, the previous tool called the IRS Data Retrieval Tool (IRS DRT) was retired after the 2023-24 application cycle.\n\nWith the FA-DDX, most applicants no longer need to manually enter their income and tax data; instead, the IRS automatically provides this information with the applicant's consent. This process enhances accuracy and reduces the burden of self-reporting. However, a

Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [22]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [23]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

Next up - we simply evaluate on our desired metrics!

In [24]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[23]: TimeoutError()
Exception raised in Job[35]: TimeoutError()


{'context_recall': 0.7816, 'faithfulness': 0.8478, 'factual_correctness': 0.5367, 'answer_relevancy': 0.9354, 'context_entity_recall': 0.5322, 'noise_sensitivity_relevant': 0.2411}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [25]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [26]:
adjusted_example_retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [27]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=adjusted_example_retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [28]:
class AdjustedState(TypedDict):
  question: str
  context: List[Document]
  response: str

adjusted_graph_builder = StateGraph(AdjustedState).add_sequence([retrieve_adjusted, generate])
adjusted_graph_builder.add_edge(START, "retrieve_adjusted")
adjusted_graph = adjusted_graph_builder.compile()

In [31]:
response = adjusted_graph.invoke({"question" : "What are the different kinds of loans?"})
response["response"]

'The context mentions different types of federal loans, specifically:\n\n1. Federal PLUS Loans\n2. Direct Subsidized Loans\n3. Direct Unsubsidized Loans\n4. Student Direct PLUS Loans (a specific type of PLUS Loan)\n\nThese are the loan types referenced in the provided information.'

In [39]:
import time
import copy

rerank_dataset = copy.deepcopy(dataset)

for test_row in rerank_dataset:
  response = adjusted_graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [47]:
rerank_dataset.samples[0].eval_sample.response

"FAFSA keeps changing mainly to make the application process easier and more efficient for students and families. One of the key updates is the implementation of the FUTURE Act, which introduced new technology and policies to streamline how income and tax information are used in FAFSA.\n\nRegarding the FUTURE Act, it authorized a new secure data exchange called the FUTURE Act Direct Data Exchange (FA-DDX) with the IRS. This allows the Department of Education to directly access tax information from the IRS for FAFSA applicants and their spouses or parents, with the applicant's consent. This change replaces the older IRS Data Retrieval Tool (IRS DRT), which was used previously to import tax data into FAFSA.\n\nThe main reason for these changes is to reduce the need for applicants to manually report their income and tax details, making the process faster and more accurate. However, because the IRS now shares data directly through FA-DDX, applicants must give their consent for this exchang

In [40]:
rerank_evaluation_dataset = EvaluationDataset.from_pandas(rerank_dataset.to_pandas())

In [41]:
result = evaluate(
    dataset=rerank_evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.6892, 'faithfulness': 0.8610, 'factual_correctness': 0.5308, 'answer_relevancy': 0.9372, 'context_entity_recall': 0.4116, 'noise_sensitivity_relevant': 0.2407}

#### ❓ Question: 

Which system performed better, on what metrics, and why?