 # Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

> NOTE: Please skip the pip install commands if you are running the notebook locally.

In [1]:
#!pip install -qU ragas==0.2.10

In [3]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [1]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

OPTIONALLY:

We can also provide a Ragas API key - which you can sign-up for [here](https://app.ragas.io/).

In [2]:
os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [3]:
!mkdir data

In [4]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31413    0 31413    0     0  86683      0 --:--:-- --:--:-- --:--:-- 86776


In [7]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70272    0 70272    0     0   574k      0 --:--:-- --:--:-- --:--:--  571k


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [3]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [4]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  for match in re.finditer('{0}\s*'.format(re.escape(sent)), self.original_text):
  txt = re.sub('(?<={0})\.'.format(am), '∯', txt)
  txt = re.sub('(?<={0})\.'.format(am), '∯', txt)


In [5]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/24 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [6]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What advancements has Mistral made in the fiel...,[Code may be the best application The ethics o...,The team behind Mistral is working to beat GPT...,single_hop_specifc_query_synthesizer
1,How do Large Language Models (LLMs) handle the...,[Based Development As a computer scientist and...,LLMs are more effective at handling the gramma...,single_hop_specifc_query_synthesizer
2,What significant advancements in AI were obser...,[Simon Willison’s Weblog Subscribe Stuff we fi...,2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
3,What Bing say about harm and how many people r...,[easy to follow. The rest of the document incl...,"Bing said, 'I will not harm you unless you har...",single_hop_specifc_query_synthesizer
4,What are the ethical considerations and enviro...,[<1-hop>\n\nCode may be the best application T...,The ethical considerations of Large Language M...,multi_hop_abstract_query_synthesizer
5,How do the challenges of understanding LLMs as...,[<1-hop>\n\nCode may be the best application T...,The challenges of understanding LLMs as black ...,multi_hop_abstract_query_synthesizer
6,How have ethical considerations and environmen...,[<1-hop>\n\nCode may be the best application T...,The development and deployment of Large Langua...,multi_hop_abstract_query_synthesizer
7,How OpenAI and AI ethics related in context of...,[<1-hop>\n\nCode may be the best application T...,OpenAI was the first organization to release a...,multi_hop_abstract_query_synthesizer
8,What were the key advancements and challenges ...,[<1-hop>\n\neasy to follow. The rest of the do...,"In 2023, Large Language Models (LLMs) experien...",multi_hop_specific_query_synthesizer
9,How did the advancements in GPT-4 and Claude A...,[<1-hop>\n\nlive video. ChatGPT voice mode now...,"In 2023, advancements in GPT-4 and Claude Arti...",multi_hop_specific_query_synthesizer


#### OPTIONAL:

If you've provided your Ragas API key - you can use this web interface to look at the created data!

In [7]:
dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/6cca9c56-7624-4bcf-bacb-8377c1deddb3


'https://app.ragas.io/dashboard/alignment/testset/6cca9c56-7624-4bcf-bacb-8377c1deddb3'

## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [8]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

73

#### ❓ Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

#### Answer:

`chunk_overlap` ensures we are preserving the context and infomation of the past chunk, while processing the new one.
This could make the model even more efficient.

Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [10]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [11]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [12]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [13]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [14]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [15]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-mini` to avoid using the same model as our judge model.

In [16]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

Then we can create a `generate` node!

In [17]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [18]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [19]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [20]:
response = graph.invoke({"question" : "How are LLM agents useful?"})

In [21]:
response["response"]

'LLM agents can be useful in a few key ways:\n\n1. **Task Automation**: They can act on behalf of users, similar to a travel agent or a digital assistant, by automating various tasks and making decisions based on user input.\n\n2. **Problem-Solving**: LLMs can execute a series of tools in a loop to solve complex problems, indicating their potential for iterative problem-solving processes.\n\n3. **Ease of Development**: Building LLMs is surprisingly straightforward, requiring only a few hundred lines of code and a significant amount of quality training data. This accessibility allows more individuals and organizations to develop and utilize LLMs.\n\n4. **Local Execution**: Recent advancements have made it possible to run useful LLMs on personal devices, expanding their utility beyond centralized servers.\n\n5. **Code Generation**: LLMs are particularly effective in generating code, as they can test and refine their outputs through interactive processes, like using a Code Interpreter, wh

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [22]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [23]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What advancements has Mistral made in the fiel...,[I wrote about how Large language models are h...,[Code may be the best application The ethics o...,Mistral has made significant advancements in t...,The team behind Mistral is working to beat GPT...,single_hop_specifc_query_synthesizer
1,How do Large Language Models (LLMs) handle the...,[So training an LLM still isn’t something a ho...,[Based Development As a computer scientist and...,Large Language Models (LLMs) handle the gramma...,LLMs are more effective at handling the gramma...,single_hop_specifc_query_synthesizer
2,What significant advancements in AI were obser...,[OpenAI are not the only game in town here. Go...,[Simon Willison’s Weblog Subscribe Stuff we fi...,"In 2023, several significant advancements in A...",2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
3,What Bing say about harm and how many people r...,[Article Visitors Pageviews Bing: “I will not ...,[easy to follow. The rest of the document incl...,"Bing stated, “I will not harm you unless you h...","Bing said, 'I will not harm you unless you har...",single_hop_specifc_query_synthesizer
4,What are the ethical considerations and enviro...,"[Since then, almost every major LLM (and most ...",[<1-hop>\n\nCode may be the best application T...,The ethical considerations and environmental i...,The ethical considerations of Large Language M...,multi_hop_abstract_query_synthesizer
5,How do the challenges of understanding LLMs as...,[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nCode may be the best application T...,The challenges of understanding LLMs (Large La...,The challenges of understanding LLMs as black ...,multi_hop_abstract_query_synthesizer
6,How have ethical considerations and environmen...,[Code may be the best application\n\nThe ethic...,[<1-hop>\n\nCode may be the best application T...,The development and deployment of Large Langua...,The development and deployment of Large Langua...,multi_hop_abstract_query_synthesizer
7,How OpenAI and AI ethics related in context of...,"[Since then, almost every major LLM (and most ...",[<1-hop>\n\nCode may be the best application T...,OpenAI and AI ethics are intricately related i...,OpenAI was the first organization to release a...,multi_hop_abstract_query_synthesizer
8,What were the key advancements and challenges ...,[Simon Willison’s Weblog\n\nSubscribe\n\nStuff...,[<1-hop>\n\neasy to follow. The rest of the do...,"In 2023, significant advancements in Large Lan...","In 2023, Large Language Models (LLMs) experien...",multi_hop_specific_query_synthesizer
9,How did the advancements in GPT-4 and Claude A...,[Prompt driven app generation is a commodity a...,[<1-hop>\n\nlive video. ChatGPT voice mode now...,The advancements in GPT-4 and Claude Artifacts...,"In 2023, advancements in GPT-4 and Claude Arti...",multi_hop_specific_query_synthesizer


Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [24]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [25]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

Next up - we simply evaluate on our desired metrics!

In [27]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[1]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-vrNSdyUA8IGU8hCWFWTPnENr on tokens per min (TPM): Limit 30000, Used 29691, Requested 2137. Please try again in 3.656s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[25]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-vrNSdyUA8IGU8hCWFWTPnENr on tokens per min (TPM): Limit 30000, Used 29618, Requested 1984. Please try again in 3.204s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[16]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-vrNSdyUA8IGU8hCWFWTPnENr on tokens per min (TPM): Limit 30000, Used 28876, Requested

{'context_recall': 0.6833, 'faithfulness': 0.7667, 'factual_correctness': 0.4991, 'answer_relevancy': 0.9516, 'context_entity_recall': 0.3308, 'noise_sensitivity_relevant': 0.2407}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model (which was updated fairly [recently](https://docs.cohere.com/v2/changelog/rerank-v3.5)) - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [28]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")

In [29]:
#!pip install -qU cohere langchain_cohere


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [30]:
retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [31]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [32]:
class State(TypedDict):
  question: str
  context: List[Document]
  response: str

graph_builder = StateGraph(State).add_sequence([retrieve_adjusted, generate])
graph_builder.add_edge(START, "retrieve_adjusted")
graph = graph_builder.compile()

In [33]:
response = graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

'LLM agents can be useful in various ways, primarily in two categories: \n\n1. **Acting on Behalf of Users**: Some people view LLM agents as tools that can act autonomously for users, similar to a travel agent. This perspective emphasizes the potential for these AI systems to handle tasks and make decisions without constant human intervention.\n\n2. **Problem Solving with Tools**: Others think of LLMs as systems that can utilize various tools in a loop to solve specific problems. This involves leveraging their capabilities to perform tasks methodically, such as writing code, which is noted to be one of their strong suits due to the simpler grammar rules of programming languages compared to natural languages.\n\nDespite these potential uses, there is skepticism regarding their overall utility. A significant concern is the inherent gullibility of LLMs, as they tend to believe any information provided to them. This raises questions about their reliability, especially when making meaningfu

In [34]:
import time

for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [35]:
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[19]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-vrNSdyUA8IGU8hCWFWTPnENr on tokens per min (TPM): Limit 30000, Used 29911, Requested 1997. Please try again in 3.816s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[7]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-vrNSdyUA8IGU8hCWFWTPnENr on tokens per min (TPM): Limit 30000, Used 29218, Requested 2582. Please try again in 3.6s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[25]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-vrNSdyUA8IGU8hCWFWTPnENr on tokens per min (TPM): Limit 30000, Used 28319, Requested 1

{'context_recall': 0.7833, 'faithfulness': 0.7261, 'factual_correctness': 0.4778, 'answer_relevancy': 0.9518, 'context_entity_recall': 0.3941, 'noise_sensitivity_relevant': 0.2086}

#### ❓ Question: 

Which system performed better, on what metrics, and why?

* context_recall: reranking helped better
* faithfulness: more documents and reranking could have interfered with faithfulness
* factual_correctness: almost similar
* answer_relevancy: almost similar
* context_entity_recall: better document selection through reranking could have improved the score
* noise_sensetivity_relevant: reranking on more documents slightly improved the score.