# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

> NOTE: Please skip the pip install commands if you are running the notebook locally.

In [1]:
#!pip install -qU ragas==0.2.10

In [3]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [1]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

OPTIONALLY:

We can also provide a Ragas API key - which you can sign-up for [here](https://app.ragas.io/).

In [2]:
os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [3]:
!mkdir data

mkdir: data: File exists


In [4]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31427    0 31427    0     0   353k      0 --:--:-- --:--:-- --:--:--  369k


In [5]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70286    0 70286    0     0   540k      0 --:--:-- --:--:-- --:--:--  553k


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [19]:
import os
import nltk
from langchain_community.document_loaders import DirectoryLoader, UnstructuredHTMLLoader

# nltk.download('punkt_tab')
# nltk.download('averaged_perceptron_tagger_eng')
docs = []

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()
print("DOCS:", docs)

DOCS: [Document(metadata={'source': 'data/2023_llms.html'}, page_content="Simon Willison’s Weblog\n\nSubscribe\n\nStuff we figured out about AI in 2023\n\n31st December 2023\n\n2023 was the breakthrough year for Large Language Models (LLMs). I think it’s OK to call these AI—they’re the latest and (currently) most interesting development in the academic field of Artificial Intelligence that dates back to the 1950s.\n\nHere’s my attempt to round up the highlights in one place!\n\nLarge Language Models\n\nThey’re actually quite easy to build\n\nYou can run LLMs on your own devices\n\nHobbyists can build their own fine-tuned models\n\nWe don’t yet know how to build GPT-4\n\nVibes Based Development\n\nLLMs are really smart, and also really, really dumb\n\nGullibility is the biggest unsolved problem\n\nCode may be the best application\n\nThe ethics of this space remain diabolically complex\n\nMy blog in 2023\n\nHere’s the sequel to this post: Things we learned about LLMs in 2024.\n\nLarge La

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [20]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [21]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [22]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What role has EleutherAI played in the develop...,[Code may be the best application The ethics o...,EleutherAI is one of the organizations that ha...,single_hop_specifc_query_synthesizer
1,How does the use of Python in programming rela...,[Based Development As a computer scientist and...,"The context highlights that writing code, part...",single_hop_specifc_query_synthesizer
2,Wut iz the main toopic of Simon Willison's Web...,[Simon Willison’s Weblog Subscribe Stuff we fi...,The main topic of Simon Willison’s Weblog in 2...,single_hop_specifc_query_synthesizer
3,What are some of the key discussions around LL...,[easy to follow. The rest of the document incl...,"In 2023, key discussions around LLMs included ...",single_hop_specifc_query_synthesizer
4,How has OpenAI contributed to the development ...,[<1-hop>\n\nCode may be the best application T...,OpenAI has played a significant role in the de...,multi_hop_abstract_query_synthesizer
5,What advancements in LLMs were highlighted by ...,[<1-hop>\n\nCode may be the best application T...,"Google's Gemini 1.5 Pro, released in February,...",multi_hop_abstract_query_synthesizer
6,How does the black box nature of AI impact the...,[<1-hop>\n\nCode may be the best application T...,The black box nature of AI significantly impac...,multi_hop_abstract_query_synthesizer
7,How did Google's Gemini 1.5 Pro and GPT-4 cont...,[<1-hop>\n\nCode may be the best application T...,Google's Gemini 1.5 Pro and GPT-4 played signi...,multi_hop_abstract_query_synthesizer
8,In what ways did the advancements in GPT-4 and...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,The advancements in GPT-4 and other leading mo...,multi_hop_specific_query_synthesizer
9,What advancements did Gemini 1.5 Pro introduce...,[<1-hop>\n\nneeds guidance. Those of us who un...,"In 2024, Gemini 1.5 Pro introduced several adv...",multi_hop_specific_query_synthesizer


#### OPTIONAL:

If you've provided your Ragas API key - you can use this web interface to look at the created data!

In [23]:
dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/7a8f0606-e3c5-4871-b5b0-a800e5862b8a


'https://app.ragas.io/dashboard/alignment/testset/7a8f0606-e3c5-4871-b5b0-a800e5862b8a'

## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [24]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [25]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

73

#### ❓ Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

- It determines how many characters or tokens are shared between consecutive chunks (chars in this case)

Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [26]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [27]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [28]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [29]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [30]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [31]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-mini` to avoid using the same model as our judge model.

In [32]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

Then we can create a `generate` node!

In [33]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [34]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [35]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [36]:
response = graph.invoke({"question" : "How are LLM agents useful?"})

In [37]:
response["response"]

"LLM agents are useful in several ways, particularly in the context of problem-solving and automation. Here are some points based on the provided context:\n\n1. **Acting on Behalf of Users**: Some people view LLM agents as tools that can perform tasks on behalf of users, similar to a travel agent. This model suggests that LLMs can help manage and automate various activities, making them valuable in areas where decision-making and action are required.\n\n2. **Access to Tools**: LLMs can be integrated with various tools and run in loops to solve problems. This ability to interact with other systems and tools enhances their utility in practical scenarios.\n\n3. **Ease of Development**: The surprising ease of building LLMs is a significant advantage. With relatively simple code and the right training data, developers can create functional models. This democratizes access to LLM technology, making it more available to a wider range of users and developers.\n\n4. **Local Deployment**: The ab

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [38]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [39]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What role has EleutherAI played in the develop...,"[If you can gather the right data, and afford ...",[Code may be the best application The ethics o...,EleutherAI has played a significant role in th...,EleutherAI is one of the organizations that ha...,single_hop_specifc_query_synthesizer
1,How does the use of Python in programming rela...,[Code may be the best application\n\nThe ethic...,[Based Development As a computer scientist and...,The use of Python in programming relates to th...,"The context highlights that writing code, part...",single_hop_specifc_query_synthesizer
2,Wut iz the main toopic of Simon Willison's Web...,[Simon Willison’s Weblog\n\nSubscribe\n\nStuff...,[Simon Willison’s Weblog Subscribe Stuff we fi...,The main topic of Simon Willison's Weblog in 2...,The main topic of Simon Willison’s Weblog in 2...,single_hop_specifc_query_synthesizer
3,What are some of the key discussions around LL...,[This is Things we learned about LLMs in 2024 ...,[easy to follow. The rest of the document incl...,"In 2023, several key discussions around Large ...","In 2023, key discussions around LLMs included ...",single_hop_specifc_query_synthesizer
4,How has OpenAI contributed to the development ...,"[Since then, almost every major LLM (and most ...",[<1-hop>\n\nCode may be the best application T...,OpenAI has made significant contributions to t...,OpenAI has played a significant role in the de...,multi_hop_abstract_query_synthesizer
5,What advancements in LLMs were highlighted by ...,[I’m relieved that this has changed completely...,[<1-hop>\n\nCode may be the best application T...,Google's Gemini 1.5 Pro highlighted several ad...,"Google's Gemini 1.5 Pro, released in February,...",multi_hop_abstract_query_synthesizer
6,How does the black box nature of AI impact the...,[Another common technique is to use larger mod...,[<1-hop>\n\nCode may be the best application T...,The black box nature of AI significantly impac...,The black box nature of AI significantly impac...,multi_hop_abstract_query_synthesizer
7,How did Google's Gemini 1.5 Pro and GPT-4 cont...,[I wrote about this at the time in The killer ...,[<1-hop>\n\nCode may be the best application T...,Google's Gemini 1.5 Pro and GPT-4 have signifi...,Google's Gemini 1.5 Pro and GPT-4 played signi...,multi_hop_abstract_query_synthesizer
8,In what ways did the advancements in GPT-4 and...,"[The most recent twist, again from December (D...",[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,The advancements in GPT-4 and other leading mo...,The advancements in GPT-4 and other leading mo...,multi_hop_specific_query_synthesizer
9,What advancements did Gemini 1.5 Pro introduce...,[I wrote about this at the time in The killer ...,[<1-hop>\n\nneeds guidance. Those of us who un...,Gemini 1.5 Pro introduced significant advancem...,"In 2024, Gemini 1.5 Pro introduced several adv...",multi_hop_specific_query_synthesizer


Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [40]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [41]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

Next up - we simply evaluate on our desired metrics!

In [43]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[41]: TimeoutError()


{'context_recall': 0.7083, 'faithfulness': 0.7809, 'factual_correctness': 0.4883, 'answer_relevancy': 0.9583, 'context_entity_recall': 0.4868, 'noise_sensitivity_relevant': 0.2925}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model (which was updated fairly [recently](https://docs.cohere.com/v2/changelog/rerank-v3.5)) - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [44]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")

In [45]:
#!pip install -qU cohere langchain_cohere


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [46]:
retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [47]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [48]:
class State(TypedDict):
  question: str
  context: List[Document]
  response: str

graph_builder = StateGraph(State).add_sequence([retrieve_adjusted, generate])
graph_builder.add_edge(START, "retrieve_adjusted")
graph = graph_builder.compile()

In [49]:
response = graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

'LLM agents are seen as potentially useful in two main ways. First, they can act on behalf of users, similar to a travel agent or digital assistant. However, there is skepticism about their utility due to the issue of gullibility, as LLMs struggle to distinguish truth from fiction. This raises concerns about the effectiveness of such agents in making meaningful decisions.\n\nSecond, LLMs have demonstrated particular strength in writing code. The simpler grammatical structure of programming languages compared to natural languages makes it less surprising that they excel in this area. Despite the skepticism surrounding their overall utility, the ability of LLMs to assist in coding tasks is recognized as a significant application.\n\nOverall, while LLM agents hold promise, especially in coding, there are considerable challenges and criticisms regarding their reliability and decision-making capabilities.'

In [50]:
import time

for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [51]:
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.7250, 'faithfulness': 0.7632, 'factual_correctness': 0.4550, 'answer_relevancy': 0.9592, 'context_entity_recall': 0.4638, 'noise_sensitivity_relevant': 0.2964}

#### ❓ Question: 

Which system performed better, on what metrics, and why?

RAG without reranker:
{'context_recall': 0.7083, 'faithfulness': 0.7809, 'factual_correctness': 0.4883, 'answer_relevancy': 0.9583, 'context_entity_recall': 0.4868, 'noise_sensitivity_relevant': 0.2925}


RAG with reranker:
{'context_recall': 0.7250, 'faithfulness': 0.7632, 'factual_correctness': 0.4550, 'answer_relevancy': 0.9592, 'context_entity_recall': 0.4638, 'noise_sensitivity_relevant': 0.2964}

- The adjusted system where we use the reranker performed better on context_recall and context_entity_recall. It did slightly worse on faithfulness, factual_correctness, and very similar on answer_relevancy.
- The commpression used in the reranker seems to have improved the quality of the retrieved context, which in turn improved the quality of the response, but it's scored slightly worse on faithfulness and factual_correctness.
