# RAG Evaluation using Fixed Sources

====================================

### 1. Theory: Component-wise Evaluation for RAG Systems

A Retrieval-Augmented Generation (RAG) pipeline is a multi-stage system, typically composed of at least two core components: a **Retriever** (which fetches relevant documents) and a **Response Generator** (which synthesizes an answer based on those documents). While end-to-end evaluation is useful for measuring the overall performance, it can be difficult to pinpoint the source of errors. Is a bad answer due to poor retrieval or a faulty generation?

To get more actionable insights, it's highly beneficial to evaluate each component in isolation. This walkthrough focuses on a key technique for evaluating the **Response Generator**. The strategy is to "fix" the source documents for each question, thereby removing the retriever from the equation. We can then measure how well the generator performs its specific task: answering a question faithfully based on a given context.

We will create a dataset where the inputs include both the `question` and the `documents` the generator should use. We will then evaluate the generator on two criteria:
1.  **Correctness**: Is the final answer factually correct, based on a reference label?
2.  **Faithfulness**: Is the answer *only* based on the provided documents, without hallucinating or using outside knowledge?

The final evaluation results in LangSmith will look something like this:

![Custom Evaluator](./img/example_results.png)

### 2. Prerequisites and Setup

First, we'll install the necessary Python packages and configure our environment variables to connect to LangSmith and the model providers.

In [None]:
# The '%pip install' command installs python packages from the notebook.
# -U flag ensures we get the latest versions.
# %pip install -U langchain openai anthropic

In [1]:
import os # Import the 'os' module to interact with the operating system.
import uuid # Import the uuid library to generate unique identifiers.

# Update with your API URL if using a hosted instance of Langsmith.
# os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint.
# os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"  # Update with your API key.
uid = uuid.uuid4() # Generate a unique ID to keep dataset names unique.

In [2]:
from dotenv import load_dotenv # Import function to load environment variables
import os # Import the 'os' module to interact with the operating system.

# Load environment variables from the .env file. The `override=True` argument
# ensures that variables from the .env file will overwrite existing environment variables.
load_dotenv(dotenv_path=".env", override=True)



# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint as an environment variable.
# Update with your API key
os.environ["LANGCHAIN_API_KEY"] = os.getenv('LANGSMITH_API_KEY')# Set your LangSmith API key as an environment variable.
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY') # Set your OpenAI API key as an environment variable.

### Step 1: Create a Dataset with Fixed Context

We will now create our evaluation dataset. The key feature of this dataset is its structure. For each example, the `inputs` dictionary will contain both the user's `question` and a list of `documents`. This list of documents represents the fixed context that our response generator will use. The `outputs` dictionary contains the `label`, which is the ground-truth answer for the correctness check.

The examples below are designed to test if the response generator can correctly extract information and whether it will ignore its pre-trained knowledge in favor of the provided (and sometimes counter-intuitive) context.

In [3]:
# A simple example dataset to illustrate the concept.
examples = [
    {
        "inputs": {
            "question": "What's the company's total revenue for q2 of 2022?",
            # The 'documents' are part of the input for the component we are testing.
            "documents": [
                {
                    "metadata": {},
                    "page_content": "In q1 the lemonade company made $4.95. In q2 revenue increased by a sizeable amount to just over $2T dollars.",
                }
            ],
        },
        "outputs": {
            # The 'label' is the ground-truth answer for correctness evaluation.
            "label": "2 trillion dollars",
        },
    },
    {
        "inputs": {
            "question": "Who is Lebron?",
            # This document provides a fictional, counter-intuitive context.
            "documents": [
                {
                    "metadata": {},
                    "page_content": "On Thursday, February 16, Lebron James was nominated as President of the United States.",
                }
            ],
        },
        "outputs": {
            "label": "Lebron James is the President of the USA.",
        },
    },
]

In [4]:
from langsmith import Client # Import the Client class to interact with LangSmith.

client = Client() # Instantiate the LangSmith client.

dataset_name = f"Faithfulness Example - {uid}" # Create a unique name for our dataset.
dataset = client.create_dataset(dataset_name=dataset_name) # Create the dataset on the LangSmith platform.
# Create the examples in the dataset.
client.create_examples(
    inputs=[e["inputs"] for e in examples], # Pass the list of input dictionaries.
    outputs=[e["outputs"] for e in examples], # Pass the list of output dictionaries.
    dataset_id=dataset.id, # Link these examples to the dataset we just created.
)

{'example_ids': ['4599fd5f-c2b4-43da-9c0e-b10f6dfc7f46',
  'c3ea196b-e756-4e4e-9a44-670c2ffa183c'],
 'count': 2}

## Step 2: Define the Chain Component

Next, we define our RAG system. We'll show the full chain for context, but we will clearly separate the **`response_synthesizer`** component. This synthesizer is the specific part of the chain that we will be evaluating. It takes a dictionary containing `documents` and a `question` and generates the final answer.

In [5]:
from langchain import chat_models, prompts # Import core LangChain components.
from langchain_core.documents import Document # Import the Document class.
from langchain_core.retrievers import BaseRetriever # Import the base retriever class.
from langchain_core.runnables import RunnablePassthrough # Import a passthrough runnable.ß
from langchain_openai import ChatOpenAI # OpenAI chat model wrapper.

# This is a placeholder retriever to illustrate the full chain. It will not be used in our evaluation.
class MyRetriever(BaseRetriever):
    def _get_relevant_documents(self, query, *, run_manager):
        return [Document(page_content="Example")]

model = "gpt-3.5-turbo"
# This is the specific component we will be evaluating.
response_synthesizer = prompts.ChatPromptTemplate.from_messages(
    [
        ("system", "Respond using the following documents as context:\n{documents}"),
        ("user", "{question}"),
    ]
) | ChatOpenAI(model=model, max_tokens=1000) # We pipe the prompt to an LLM.

# The full RAG chain is shown below for illustrative purposes only.
chain = {
    "documents": MyRetriever(),
    "qusetion": RunnablePassthrough(),
} | response_synthesizer

## Step 3: Define a Custom Faithfulness Evaluator

To measure faithfulness, we need an evaluator that checks if the model's `prediction` is consistent with the provided `documents`. Standard evaluators assume the reference context comes from the `outputs` of a dataset example. In our case, the context (the documents) is in the `inputs`.

To handle this, we'll create a custom `FaithfulnessEvaluator`. This class will wrap a standard LangChain scoring evaluator but will override the data mapping. It will tell the underlying evaluator to use:
- The model's generation as the `prediction`.
- The `question` from the run's inputs as the `input`.
- The `documents` from the *example's inputs* as the `reference` context.

This allows us to use an off-the-shelf LLM-based scoring mechanism with our custom dataset structure.

In [6]:
from langsmith.evaluation import RunEvaluator, EvaluationResult # Import the base classes for custom evaluation.
from langchain.evaluation import load_evaluator # Import a helper to load built-in evaluators.


# Define our custom evaluator class, inheriting from RunEvaluator.
class FaithfulnessEvaluator(RunEvaluator):
    def __init__(self):
        # Initialize a built-in 'labeled_score_string' evaluator.
        # This evaluator uses an LLM to score a prediction on a 1-10 scale based on given criteria.
        self.evaluator = load_evaluator(
            "labeled_score_string",
            criteria={
                "faithful": "How faithful is the submission to the reference context?"
            },
            normalize_by=10, # Normalize the score to be between 0 and 1.
        )

    # This is the core method that LangSmith will call for each run.
    def evaluate_run(self, run, example) -> EvaluationResult:
        # Call the underlying evaluator's 'evaluate_strings' method with custom-mapped fields.
        res = self.evaluator.evaluate_strings(
            prediction=next(iter(run.outputs.values())), # The LLM's generated answer.
            input=run.inputs["question"], # The user's question.
            # This is the key part: we use the 'documents' from the example's INPUTS as the reference context.
            reference=str(example.inputs["documents"]),
        )
        # Return the result in the standard EvaluationResult format.
        return EvaluationResult(key="labeled_criteria:faithful", **res)

### Step 4: Run the Evaluation

Now we can run the evaluation. We will configure it to use two evaluators:
1. The standard `"qa"` evaluator, which will measure correctness against the `label` in our dataset outputs.
2. Our custom `FaithfulnessEvaluator`, which will measure how grounded the response is in the provided documents.

We will pass the `response_synthesizer` directly as the system to be tested.

In [7]:
from langchain.smith import RunEvalConfig # Import the evaluation configuration class.

# Create an evaluation configuration.
eval_config = RunEvalConfig(
    evaluators=["qa"], # Include the standard 'qa' correctness evaluator.
    custom_evaluators=[FaithfulnessEvaluator()], # Include our custom faithfulness evaluator.
    input_key="question", # Tell the 'qa' evaluator to use the 'question' field from the inputs.
)
# Run the evaluation on the dataset.
results = client.run_on_dataset(
    llm_or_chain_factory=response_synthesizer, # The specific component to be tested.
    dataset_name=dataset_name, # The name of our dataset in LangSmith.
    evaluation=eval_config, # The evaluation configuration.
)

View the evaluation results for project 'slight-cat-16' at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/6f4aa3b5-0f2b-4490-a3f0-66f09360fe81/compare?selectedSessions=eb5508b6-322f-48e9-b9ff-778d77bcde49

View all tests for Dataset Faithfulness Example - 84799d49-fed3-44f8-aeab-4b72b546db4d at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/6f4aa3b5-0f2b-4490-a3f0-66f09360fe81
[------------------------------------------------->] 2/2


You can now review the results in LangSmith by clicking the link in the output above. You will see scores for both correctness (`qa`) and faithfulness (`labeled_criteria:faithful`). Inspecting the trace for the faithfulness evaluator will show how the LLM judged the response against the provided documents.

[![](./img/example_score.png)](https://smith.langchain.com/public/9a4e6ee2-f26c-4bcd-a050-04766fbfd350/r)

## Discussion

You have now successfully evaluated a RAG system's response generator in isolation, testing both its **correctness** and its **faithfulness** to the provided context. This is an effective way to debug and improve a specific component of your RAG pipeline.

The key technical insight was the use of a custom `RunEvaluator`. While most of LangChain's built-in evaluators are `StringEvaluator`s, which have a rigid expectation of where to find the `input`, `prediction`, and `reference` strings, the `RunEvaluator` interface gives you full control. It provides access to the entire `Run` and `Example` objects, allowing you to flexibly map any field from your run traces or dataset to the inputs of an underlying evaluator.

In our case, we used this flexibility to take the reference context (`documents`) from the example's **inputs**, whereas a standard `StringEvaluator` would have looked for it in the example's `outputs`. This powerful pattern enables a wide range of custom evaluation strategies tailored to your specific application and dataset structure.