# RAG evaluation with RAGAS

====================================

### 1. Theory: A Deep Dive into RAG Evaluation with RAGAS

**RAGAS** is a powerful, open-source framework specifically designed for evaluating Retrieval-Augmented Generation (RAG) pipelines. A RAG pipeline's quality depends on multiple factors: Did it retrieve the right information? Did it generate a faithful answer based on that information? Was the answer itself correct? RAGAS provides a suite of metrics to dissect and score each part of this process.

This notebook demonstrates how to integrate these comprehensive RAGAS metrics into the **LangSmith** evaluation platform. This allows you to track, visualize, and compare sophisticated RAG metrics over time, all within a unified testing environment.

We will evaluate a simple RAG application using the following RAGAS metrics:

#### Generator-Focused Metrics
- `answer_correctness` (Labeled): Measures the factual accuracy of the generated answer compared to a ground-truth answer. This requires a labeled dataset.
- `faithfulness` (Reference-Free): Measures how much the generated answer is grounded in the provided context. It checks for hallucinations by breaking the answer into claims and verifying each one against the retrieved documents.

#### Retriever-Focused Metrics
- `context_relevancy` (Reference-Free): Measures how relevant the retrieved context is to the input question. It does this by identifying sentences in the context that are crucial for answering the question.
- `context_recall` (Labeled): Measures the retriever's ability to fetch all the necessary information required to answer the question, based on a ground-truth answer.
- `context_precision` (Labeled): Measures if the most relevant documents are ranked higher by the retriever. It assesses whether the truly important context appears at the top of the retrieval results.

By using these metrics together, we can get a holistic view of our RAG system's performance, identifying whether poor results stem from the retriever, the generator, or both.

## 2. Prerequisites

First, we'll install the necessary Python packages and configure our environment variables to connect to LangSmith and the OpenAI API.

In [None]:
# %%capture --no-stderr
# # The '%%capture --no-stderr' magic command prevents the output of this cell (except for errors) from being displayed.
# # The '%pip install' command installs python packages from the notebook.
# # -U flag ensures we get the latest versions of the specified libraries.
# %pip install -U langsmith ragas numpy openai

In [1]:
import getpass # Import the getpass library to securely prompt for credentials.
import os # Import the 'os' module to interact with the operating system.

# Set the environment variable to enable LangSmith tracing.
os.environ["LANGCHAIN_TRACING_V2"] = "true"
# # Prompt for the LangSmith API key and set it as an environment variable.
# os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LANGCHAIN_API_KEY")
# # Prompt for the OpenAI API key and set it as an environment variable.
# os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY")

In [2]:
from dotenv import load_dotenv # Import function to load environment variables
import os # Import the 'os' module to interact with the operating system.

# Load environment variables from the .env file. The `override=True` argument
# ensures that variables from the .env file will overwrite existing environment variables.
load_dotenv(dotenv_path=".env", override=True)



# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint as an environment variable.
# Update with your API key
os.environ["LANGCHAIN_API_KEY"] = os.getenv('LANGSMITH_API_KEY')# Set your LangSmith API key as an environment variable.
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY') # Set your OpenAI API key as an environment variable.

## 3. Dataset

We will use a pre-existing public dataset hosted on LangSmith called **"BaseCamp Q&A"**. This dataset was created by scraping the 37signals (the makers of BaseCamp) employee handbook and then synthetically generating question-answer pairs based on the content. We will clone this dataset into our own LangSmith account to use for evaluation.

In [3]:
import langsmith # Import the langsmith library.

client = langsmith.Client() # Instantiate the LangSmith client.
# The URL of the public dataset we want to use.
dataset_url = (
    "https://smith.langchain.com/public/56fe54cd-b7d7-4d3b-aaa0-88d7a2d30931/d"
)
dataset_name = "BaseCamp Q&A" # The name for our local copy of the dataset.
# Use the client to clone the public dataset into your own LangSmith workspace.
client.clone_public_dataset(dataset_url)

Dataset(name='BaseCamp Q&A', description='Taken from: https://basecamp.com/handbook', data_type=<DataType.kv: 'kv'>, id=UUID('f6fc7700-571c-4710-ad3e-88afa82d6a17'), created_at=datetime.datetime(2025, 8, 8, 21, 51, 30, 198212, tzinfo=datetime.timezone.utc), modified_at=datetime.datetime(2025, 8, 8, 21, 51, 30, 198212, tzinfo=datetime.timezone.utc), example_count=0, session_count=0, last_session_start_time=None, inputs_schema=None, outputs_schema=None, transformations=None)

## 4. Define your RAG Pipeline

We will now build our RAG pipeline from scratch. This involves fetching the source documents, creating a retriever to search through them, and defining the main RAG bot that synthesizes answers.

### 4.1 Fetch Source Documents
First, we download the raw markdown files from the BaseCamp handbook, which will serve as our knowledge base.

In [4]:
import io # Import the io module for in-memory binary streams.
import os # Import the os module for interacting with the file system.
import zipfile # Import the zipfile module to handle zip archives.

import requests # Import the requests library to make HTTP requests.

# The URL of the zip file containing our source documents.
url = "https://storage.googleapis.com/benchmarks-artifacts/basecamp-data/basecamp-data.zip"

# Fetch the content from the URL.
response = requests.get(url)


# Open the downloaded content as an in-memory zip file.
with io.BytesIO(response.content) as zipped_file:
    # Create a ZipFile object.
    with zipfile.ZipFile(zipped_file, "r") as zip_ref:
        # Extract all the contents into the current directory.
        zip_ref.extractall()

# Define the directory where the data was extracted.
data_dir = os.path.join(os.getcwd(), "data")
# Initialize an empty list to store the documents.
docs = []
# Iterate through all files in the data directory.
for filename in os.listdir(data_dir):
    # Check if the file is a markdown file.
    if filename.endswith(".md"):
        # Open and read the file content.
        with open(os.path.join(data_dir, filename), "r") as file:
            # Append the file's name and content to our list of docs.
            docs.append({"file": filename, "content": file.read()})

In [14]:
docs

[{'file': 'product-histories.md',
  'content': '# Product Histories\n\n## Basecamp\n\n[Basecamp](http://basecamp.com/). On our website, we say _Basecamp is a private, secure space online where people working together can organize and discuss everything they need to get a project done. See it, track it, discuss it, act on it. Tasks, discussions, deadlines, and files — everything’s predictably organized in Basecamp_. But we want it to be more than that. In mid-2016, we decided to focus on helping small business owners manage their businesses and help alleviate growth and organizational pains they’re facing. Project management was our bread and butter for over 15 years. But we were inspired by how we, internally, have used Basecamp to run 37signals, and we want to extend that philosophy to our current customers and scores of brand-new customers.\n\nBut let’s take a look back. In 2003, 37signals was a web design firm made up of 4 people. We always had work, but we were disorganized. With s

### 4.2 Create the Retriever
Next, we create a simple in-memory vector store retriever. This component will be responsible for:
1.  Converting all source documents into numerical vector embeddings.
2.  When given a query, converting the query into a vector.
3.  Finding the documents with embeddings most similar to the query's embedding.

In [5]:
from typing import List # Import typing hints.

import numpy as np # Import numpy for numerical operations.
import openai # Import the openai library.
from langsmith import traceable # Import the traceable decorator for LangSmith.


# Define a custom class for our simple vector store retriever.
class VectorStoreRetriever:
    def __init__(self, docs: list, vectors: list, oai_client):
        self._arr = np.array(vectors) # The numpy array of document vectors.
        self._docs = docs # The list of original documents.
        self._client = oai_client # The OpenAI client.

    # A class method to create an instance asynchronously by embedding documents.
    @classmethod
    async def from_docs(cls, docs, oai_client):
        # Generate embeddings for all document contents.
        embeddings = await oai_client.embeddings.create(
            model="text-embedding-3-small", input=[doc["content"] for doc in docs]
        )
        # Extract the vector embeddings from the response.
        vectors = [emb.embedding for emb in embeddings.data]
        # Return a new instance of the class.
        return cls(docs, vectors, oai_client)

    # The traceable decorator ensures that calls to this method are logged to LangSmith.
    @traceable
    async def query(self, query: str, k: int = 5) -> List[dict]:
        # Generate an embedding for the input query.
        embed = await self._client.embeddings.create(
            model="text-embedding-3-small", input=[query]
        )
        # Calculate similarity scores using matrix multiplication (dot product).
        scores = np.array(embed.data[0].embedding) @ self._arr.T
        # Find the indices of the top K most similar documents.
        top_k_idx = np.argpartition(scores, -k)[-k:]
        # Sort the top K indices by their scores in descending order.
        top_k_idx_sorted = top_k_idx[np.argsort(-scores[top_k_idx])]
        # Return the top K documents along with their similarity scores.
        return [
            {**self._docs[idx], "similarity": scores[idx]} for idx in top_k_idx_sorted
        ]

### 4.3 Create the RAG Bot

Now we define the main `NaiveRagBot` class. This class orchestrates the RAG process. Its `get_answer` method will first call the retriever to get relevant documents and then call an LLM with a prompt that includes these documents as context. 

A crucial detail is the format of the output. The RAGAS evaluators expect the final output to be a dictionary containing an `"answer"` key (the generated text) and a `"contexts"` key (a list of the document contents used). We must structure our output accordingly.

In [6]:
from langsmith import traceable # Import the traceable decorator.
from langsmith.wrappers import wrap_openai # Import the wrapper to instrument the OpenAI client.


# Define the class for our RAG bot.
class NaiveRagBot:
    def __init__(self, retriever, model: str = "gpt-3.5-turbo"):
        self._retriever = retriever # The retriever instance.
        # Wrapping the client with `wrap_openai` automatically instruments all calls to the OpenAI API for LangSmith tracing.
        self._client = wrap_openai(openai.AsyncClient())
        self._model = model # The name of the LLM to use.

    # Decorate the main method with @traceable to create a root run in LangSmith.
    @traceable
    async def get_answer(self, question: str):
        # 1. Retrieve relevant documents.
        similar = await self._retriever.query(question)
        # 2. Call the LLM with the retrieved context and the question.
        response = await self._client.chat.completions.create(
            model=self._model,
            messages=[
                {
                    "role": "system",
                    # The system prompt instructs the model on how to use the context.
                    "content": "You are a helpful AI assistant."
                    " Use the following docs to help answer the user's question.\n\n"
                    f"## Docs\n\n{similar}",
                },
                {"role": "user", "content": question},
            ],
        )

        # 3. Format the output to be compatible with RAGAS evaluators.
        # This dictionary structure with 'answer' and 'contexts' keys is required.
        return {
            "answer": response.choices[0].message.content,
            "contexts": [str(doc) for doc in similar],
        }

### 4.4 Instantiate and Test the Pipeline
Now we create an instance of our retriever and RAG bot. We'll also do a quick test run to ensure it's working as expected.

In [7]:
# Asynchronously create an instance of our retriever from the documents.
retriever = await VectorStoreRetriever.from_docs(docs, openai.AsyncClient())
# Create an instance of our RAG bot, passing the retriever to it.
rag_bot = NaiveRagBot(retriever)

In [8]:
# Run a single prediction with a sample question.
response = await rag_bot.get_answer("How much time off do we get?")
# Display the first 150 characters of the answer to verify it's working.
response["answer"][:150]

'Employees at 37signals are provided with 18 days of paid time off (PTO) annually, along with 11 local holidays. Additionally, there is a sabbatical po'

## 5. Evaluate.

Now it's time to define our evaluators. RAGAS provides a variety of metrics that we can use. To integrate them with LangSmith, we simply wrap each desired RAGAS metric in LangChain's `EvaluatorChain`. This makes them compatible with the LangSmith evaluation framework.

In [12]:
from langchain.smith import RunEvalConfig # Import the evaluation configuration class.
from ragas.integrations.langchain import EvaluatorChain # Import the RAGAS LangChain wrapper.
from ragas.metrics import ( # Import the specific metrics we want to use from RAGAS.
    answer_correctness,
    answer_relevancy,
    context_precision,
    context_recall,
    # context_relevancy,
    faithfulness,
)

# A list to hold our wrapped RAGAS evaluator chains.
evaluators = [
    # Wrap each RAGAS metric in an EvaluatorChain.
    EvaluatorChain(metric)
    for metric in [
        answer_correctness,
        answer_relevancy,
        context_precision,
        context_recall,
        faithfulness,
    ]
]
# Create an evaluation configuration that uses our list of RAGAS evaluators.
eval_config = RunEvalConfig(custom_evaluators=evaluators)

Finally, we run the evaluation. LangSmith's `arun_on_dataset` function will asynchronously run our `rag_bot` on every example in the dataset and then apply all the configured RAGAS evaluators to each run.

In [13]:
# Asynchronously run the evaluation on the dataset.
results = await client.arun_on_dataset(
    dataset_name=dataset_name, # The name of the dataset to test against.
    llm_or_chain_factory=rag_bot.get_answer, # A reference to the RAG bot's answer method.
    evaluation=eval_config, # The evaluation configuration with our RAGAS metrics.
)

View the evaluation results for project 'passionate-talk-41' at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/f6fc7700-571c-4710-ad3e-88afa82d6a17/compare?selectedSessions=003d27f0-ba6c-4733-9324-32a67f3a0cb0

View all tests for Dataset BaseCamp Q&A at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/f6fc7700-571c-4710-ad3e-88afa82d6a17
[----------->                                      ] 5/21

Error evaluating run 38d1810c-6740-4915-8214-abce2f2b6366 with EvaluatorChain
Traceback (most recent call last):
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/openai/_base_client.py", line 1526, in request
    response = await self._client.send(
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1629, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/httpx/_

[---------------------------->                     ] 12/21

Error evaluating run 0f9fa6bb-88ae-41e7-b6b2-61766a38dc49 with EvaluatorChain
Traceback (most recent call last):
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/openai/_base_client.py", line 1526, in request
    response = await self._client.send(
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1629, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/httpx/_

[----------------------------------->              ] 15/21

Error evaluating run a22d4bbd-80e5-4052-8611-5fe970486b9a with EvaluatorChain
Traceback (most recent call last):
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/openai/_base_client.py", line 1526, in request
    response = await self._client.send(
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1629, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/httpx/_

[-------------------------------------------->     ] 19/21

Error evaluating run 4abccab6-9560-4ebc-9fd3-f4eee39aff57 with EvaluatorChain
Traceback (most recent call last):
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/openai/_base_client.py", line 1526, in request
    response = await self._client.send(
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1629, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/tankwin08/Desktop/projects/professional/ai-agents-eval-techniques/.venv/lib/python3.11/site-packages/httpx/_

[------------------------------------------------->] 21/21
