# Comparing Q&A System Outputs

====================================

### 1. Theory: Absolute Grading vs. Pairwise Comparison

When comparing two versions of an LLM system (e.g., trying a new prompt, model, or RAG strategy), the standard approach is to benchmark both on a dataset and compare their aggregate metrics (like correctness or faithfulness). This **absolute grading** is a great starting point for understanding overall performance.

However, what happens when the aggregate scores are nearly identical? Does this mean the two systems are of equal quality? Not necessarily. One system might produce slightly more concise answers, while the other might be better at handling nuance. These subtle differences are often lost in aggregate scores.

This is where **pairwise comparison** becomes a powerful tool. Instead of asking an LLM judge, "Is this answer correct?" (a 1-0 score), we ask, "Given these two answers, which one is better?" This forces a preference and can reveal qualitative differences that absolute metrics miss. It is often an easier and more reliable task for an LLM judge.

In this tutorial, we will:
1.  Create a dataset and two versions of a RAG Q&A system (by varying the document chunk size).
2.  Benchmark both systems using a standard correctness evaluator.
3.  Run a **pairwise evaluator** to directly compare the outputs of the two systems for each question.
4.  Log the resulting preference scores back to LangSmith for analysis.

This will give us a much richer understanding of the impact of our changes.

### 2. Prerequisites and Setup

This tutorial uses OpenAI, ChromaDB, and LangChain. First, we will configure our environment variables to connect to the necessary services.

**Action Required**: You must replace `"YOUR API KEY"` with your actual key.

In [None]:
# import os # Import the 'os' module to interact with the operating system.

# # Update with your API URL if using a hosted instance of Langsmith.
# os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint.
# os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"  # Update with your API key.

In [1]:
from dotenv import load_dotenv # Import function to load environment variables
import os # Import the 'os' module to interact with the operating system.

# Load environment variables from the .env file. The `override=True` argument
# ensures that variables from the .env file will overwrite existing environment variables.
load_dotenv(dotenv_path=".env", override=True)



# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint as an environment variable.
# Update with your API key
os.environ["LANGCHAIN_API_KEY"] = os.getenv('LANGSMITH_API_KEY')# Set your LangSmith API key as an environment variable.
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY') # Set your OpenAI API key as an environment variable.

Next, we install the required Python packages and set the OpenAI API key.

**Action Required**: Replace `<YOUR-API-KEY>` with your actual OpenAI key.

In [None]:
# The '%pip install' command installs python packages from the notebook. '--quiet' suppresses the output.
# %pip install -U "langchain[openai]" --quiet
# %pip install chromadb --quiet
# %pip install lxml --quiet
# %pip install html2text --quiet
# %pip install pandas --quiet

In [None]:
# The '%env' magic command sets an environment variable for the notebook session.
# %env OPENAI_API_KEY=<YOUR-API-KEY>

## Step 1: Setup

#### a. Create a Dataset

First, we'll create our evaluation dataset. A good dataset is the cornerstone of any reliable testing process. We've provided a small, hard-coded set of question-answer pairs about LangSmith for this example. In a real-world scenario, you would want a much larger and more diverse set of examples.

In [2]:
# Define a list of tuples, where each tuple contains a (question, answer) pair.
examples = [
    (
        "What is LangChain?",
        "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.",
    ),
    (
        "How might I query for all runs in a project?",
        "client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})",
    ),
    (
        "What's a langsmith dataset?",
        "A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point.",
    ),
    (
        "How do I use a traceable decorator?",
        """The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,\
import the required function, decorate your function, and then call the function. Below is an example:
```python
from langsmith.run_helpers import traceable
@traceable(run_type="chain") # or "llm", etc.
def my_function(input_param):
    # Function logic goes here
    return output
result = my_function(input_param)
```""",
    ),
    (
        "Can I trace my Llama V2 llm?",
        "So long as you are using one of LangChain's LLM implementations, all your calls can be traced",
    ),
    (
        "Why do I have to set environment variables?",
        "Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith."
        " While there are other ways to connect, environment variables tend to be the simplest way to configure your application.",
    ),
    (
        "How do I move my project between organizations?",
        "LangSmith doesn't directly support moving projects between organizations.",
    ),
]

In [3]:
from langsmith import Client # Import the Client class to interact with LangSmith.

client = Client() # Instantiate the LangSmith client.

In [4]:
import uuid # Import the uuid library to generate unique identifiers.

dataset_name = f"Retrieval QA Questions {str(uuid.uuid4())}" # Create a unique name for the dataset.
dataset = client.create_dataset(dataset_name=dataset_name) # Create the dataset on the LangSmith platform.
# Iterate through our question-answer pairs.
for q, a in examples:
    # Create an example in our LangSmith dataset for each pair.
    client.create_example(
        inputs={"question": q}, outputs={"answer": a}, dataset_id=dataset.id
    )

#### b. Define Two RAG Q&A Systems to Compare

Now we'll define our RAG system. The experiment we want to run is to see how the retriever's **chunk size** affects the quality of the final answer. We will create two versions of our RAG chain that are identical in every way except for the `chunk_size` and `chunk_overlap` used when splitting the source documents.

First, we load and process the source documents.

In [5]:
from langchain_community.document_loaders import RecursiveUrlLoader # A loader for recursively scraping a website.
from langchain_community.document_transformers import Html2TextTransformer # A transformer to convert HTML to plain text.
from langchain_community.vectorstores import Chroma # The Chroma vector store implementation.
from langchain_text_splitters import TokenTextSplitter # A text splitter that splits based on token count.
from langchain_openai import OpenAIEmbeddings # The class for using OpenAI's embedding models.

api_loader = RecursiveUrlLoader("https://docs.smith.langchain.com") # Initialize a loader for the LangSmith docs.
doc_transformer = Html2TextTransformer() # Initialize the HTML to text transformer.
raw_documents = api_loader.load() # Load the raw documents.
transformed = doc_transformer.transform_documents(raw_documents) # Transform them into plain text.


# Define a factory function to create a retriever based on a given text splitter.
def create_retriever(transformed_documents, text_splitter):
    documents = text_splitter.split_documents(transformed_documents) # Split the documents.
    embeddings = OpenAIEmbeddings() # Initialize the embeddings model.
    vectorstore = Chroma.from_documents(documents, embeddings) # Create the vector store.
    return vectorstore.as_retriever(search_kwargs={"k": 4}) # Return the retriever.


Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(raw_html, "html.parser")


In [11]:
print(transformed[0])

page_content='Skip to main content

**OurBuilding Ambient Agents with LangGraph course is now available on
LangChain Academy!**

API Reference

  * REST
  * Python
  * JS/TS

Search

Region

  * US
  * EU

Go to App

  * Get Started
  * Observability

  * Evaluation

  * Prompt Engineering

  * Deployment (LangGraph Platform)
  * * * *

  * Administration

  * Self-hosting

  * Pricing

  * * * *

  * Reference

    * Cloud architecture and scalability
    * Authz and Authn

      * Authentication methods
    * data_formats

    * Evaluation

      * Dataset transformations
    * Regions FAQ
    * sdk_reference

  *   * Get Started

On this page

# Get started with LangSmith

**LangSmith** is a platform for building production-grade LLM applications. It
allows you to closely monitor and evaluate your application, so you can ship
quickly and with confidence.

### Observability

Analyze traces in LangSmith and configure metrics, dashboards, alerts based on
these.

### Evals

Evaluate you

In [10]:
print(raw_documents[0])

page_content='<!doctype html>
<html lang="en" dir="ltr" class="docs-wrapper plugin-docs plugin-id-default docs-version-current docs-doc-page docs-doc-id-index" data-has-hydrated="false">
<head>
<meta charset="UTF-8">
<meta name="generator" content="Docusaurus v3.5.2">
<title data-rh="true">Get started with LangSmith | 🦜️🛠️ LangSmith</title><meta data-rh="true" name="viewport" content="width=device-width,initial-scale=1"><meta data-rh="true" name="twitter:card" content="summary_large_image"><meta data-rh="true" property="og:image" content="https://docs.smith.langchain.com/img/langsmith-preview.png"><meta data-rh="true" name="twitter:image" content="https://docs.smith.langchain.com/img/langsmith-preview.png"><meta data-rh="true" property="og:url" content="https://docs.smith.langchain.com/"><meta data-rh="true" property="og:locale" content="en"><meta data-rh="true" name="docusaurus_locale" content="en"><meta data-rh="true" name="docsearch:language" content="en"><meta data-rh="true" name="

Next, we'll define a factory function for our chain. It will take a retriever as an argument, allowing us to easily create different chain versions.

In [12]:
from datetime import datetime # Import datetime to include the current time in the prompt.
from operator import itemgetter # Import itemgetter for convenient data routing.
from langchain_core.output_parsers import StrOutputParser # Import the string output parser.
from langchain_core.prompts import ChatPromptTemplate # Import the chat prompt template class.
from langchain_openai import ChatOpenAI # Import the OpenAI chat model class.


# Define a factory function that creates a RAG chain from a given retriever.
def create_chain(retriever):
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a helpful documentation Q&A assistant, trained to answer"
                " questions from LangSmith's documentation."
                " LangChain is a framework for building applications using large language models."
                "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages.",
            ),
            ("system", "{context}"),
            ("human", "{question}"),
        ]
    ).partial(time=str(datetime.now()))

    model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
    response_generator = prompt | model | StrOutputParser()
    # Define the final chain using LangChain Expression Language (LCEL).
    chain = (
        {
            "context": itemgetter("question")
            | retriever
            | (lambda docs: "\n".join([doc.page_content for doc in docs])),
            "question": itemgetter("question"),
        }
        | response_generator
    )
    return chain

Now we can create our two chains to compare. `chain_1` will use larger document chunks, while `chain_2` will use smaller chunks.

In [13]:
# Define the text splitter for our first chain (large chunks).
text_splitter = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=2000,
    chunk_overlap=200,
)
# Create the first retriever.
retriever = create_retriever(transformed, text_splitter)

# Create the first chain.
chain_1 = create_chain(retriever)

In [14]:
# Define the text splitter for our second chain (small chunks).
text_splitter_2 = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=500,
    chunk_overlap=50,
)
# Create the second retriever.
retriever_2 = create_retriever(transformed, text_splitter_2)

# Create the second chain.
chain_2 = create_chain(retriever_2)

#### c. Evaluate the Chains with Absolute Grading

Now we'll run our initial benchmark. We will run both `chain_1` and `chain_2` on our dataset and use a standard correctness evaluator (`cot_qa`) to score each response. This will give us an initial impression of their performance.

In [15]:
from langchain.smith import RunEvalConfig # Import the evaluation configuration class.

eval_config = RunEvalConfig(
    # We will use the chain-of-thought Q&A correctness evaluator for a more robust grade.
    evaluators=["cot_qa"],
)

In [16]:
# Run the evaluation for the first chain.
results = client.run_on_dataset(
    dataset_name=dataset_name, llm_or_chain_factory=chain_1, evaluation=eval_config
)
# Store the project name for later retrieval.
project_name = results["project_name"]

View the evaluation results for project 'ample-dress-94' at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/75eeb2e6-1fa5-435c-8362-8387ca8781cc/compare?selectedSessions=c23bce47-03be-4657-a88f-b0389ab98df4

View all tests for Dataset Retrieval QA Questions 5e1748c5-deea-447f-9468-bf794a2768b1 at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/75eeb2e6-1fa5-435c-8362-8387ca8781cc
[------------------------------------------------->] 7/7


In [17]:
# Run the evaluation for the second chain.
results_2 = client.run_on_dataset(
    dataset_name=dataset_name, llm_or_chain_factory=chain_2, evaluation=eval_config
)
# Store the project name for the second run.
project_name_2 = results_2["project_name"]

View the evaluation results for project 'ample-match-34' at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/75eeb2e6-1fa5-435c-8362-8387ca8781cc/compare?selectedSessions=0c64b311-74a3-4cf4-be58-d84ebaf84078

View all tests for Dataset Retrieval QA Questions 5e1748c5-deea-447f-9468-bf794a2768b1 at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/75eeb2e6-1fa5-435c-8362-8387ca8781cc
[------------------------------------------------->] 7/7


You now have two test runs in LangSmith over the same dataset. You can view the aggregate metrics in the UI, or you can fetch the results and compare them directly in a pandas DataFrame.

In [19]:
import pandas as pd # Import the pandas library.

# Fetch all the runs from the first test project.
runs_1 = list(client.list_runs(project_name=project_name, execution_order=1))
# Fetch all the runs from the second test project.
runs_2 = list(client.list_runs(project_name=project_name_2, execution_order=1))


# Helper function to convert a list of runs into a DataFrame.
def get_project_df(runs):
    return pd.DataFrame(
        [
            # For each run, create a dictionary with its outputs and the average feedback scores.
            {**run.outputs, **{k: v.get("avg") for k, v in run.feedback_stats.items()}}
            for run in runs
        ],
        # Use the reference example ID as the index for easy joining.
        index=[run.reference_example_id for run in runs],
    )


runs_1_df = get_project_df(runs_1) # Create a DataFrame for the first run.
runs_2_df = get_project_df(runs_2) # Create a DataFrame for the second run.
# Join the two DataFrames on their index (the example ID).
joined_df = runs_1_df.join(runs_2_df, lsuffix="_1", rsuffix="_2")
# Reorder the columns for a clean side-by-side comparison.
columns_1 = [col for col in joined_df.columns if col.endswith("_1")]
columns_2 = [col for col in joined_df.columns if col.endswith("_2")]
new_columns_order = [col for pair in zip(columns_1, columns_2) for col in pair]
joined_df = joined_df[new_columns_order]

In [20]:
joined_df

Unnamed: 0,output_1,output_2,cot contextual accuracy_1,cot contextual accuracy_2
3af4f831-baba-4d3a-aa50-8ae04353fb2d,LangChain is a framework for building applicat...,LangChain is a framework for building applicat...,1.0,1.0
ac6ce8ce-a3c2-4971-b271-02139ca36014,To query for all runs in a project using LangS...,To query for all runs in a project using LangS...,0.0,0.0
7e59b921-674e-4ae5-ba61-04197275b4a5,To use the `traceable` decorator in your LangC...,To use the `traceable` decorator in your LangC...,0.0,0.0
d640e433-64f5-4293-bb9b-e61797046c68,"At the moment, LangSmith does not support movi...","At the moment, LangSmith does not support movi...",1.0,1.0
035c4797-6c30-4883-8d2c-717a11f194e9,Setting environment variables is necessary in ...,Setting environment variables is a common prac...,1.0,1.0
b8d6d658-fec1-440d-98a0-5cacaa87fd55,A LangSmith dataset refers to a collection of ...,A LangSmith dataset refers to a collection of ...,1.0,1.0
3e64a644-dfaa-4f9a-b1fd-785da68a770e,To trace your Llama V2 language model using La...,To trace your Llama V2 language model using La...,1.0,1.0


It looks like the benchmark performance is identical for both chains. This is a perfect scenario for using pairwise comparison to find more subtle differences.

### Compare in LangSmith

In the LangSmith UI, you can navigate to the dataset, select the two test runs you just completed, and click "Compare" to get a detailed, side-by-side view.

![Compare Tests](./img/compare_tests_select.png)

Manual comparison is powerful, but it can be time-consuming. Below, we'll demonstrate how to automate this process using an LLM as a judge.

## Step 2: Pairwise Evaluation

Since the absolute scores were the same, we'll now run a pairwise evaluator to determine a preference between the two outputs for each question. We will define a helper function, `predict_preference`, to orchestrate this process for each example in our dataset. This function will:

1.  Fetch the completed runs for a given example from both of our test projects (`project_a` and `project_b`).
2.  **Randomize the order** of the two predictions (A and B). This is a crucial step to mitigate **positional bias**, where LLM judges may tend to prefer the first option they see.
3.  Call a pairwise evaluation chain, which asks an LLM to state its preference (A or B).
4.  Log the preference as feedback to the corresponding runs in LangSmith. The preferred run gets a score of `1`, and the other gets `0`.


In [21]:
import random # Import the random module for shuffling.
import logging # Import the logging module to handle potential errors.


# A helper function to fetch a specific run and its prediction.
def _get_run_and_prediction(example_id, project_name):
    # List runs, filtering by the reference example and project.
    run = next(
        client.list_runs(
            reference_example_id=example_id,
            project_name=project_name,
            execution_order=1,
        )
    )
    # Extract the output from the run.
    prediction = next(iter(run.outputs.values()))
    return run, prediction


# A helper function to log the preference feedback to LangSmith.
def _log_feedback(run_ids):
    # The 'preference' key is used. The preferred run gets a score of 1, the other gets 0.
    for score, run_id in enumerate(run_ids):
        client.create_feedback(run_id, key="preference", score=score)


# The main function to predict and log preference for a single example.
def predict_preference(example, project_a, project_b, eval_chain):
    example_id = example.id # Get the ID of the current example.
    print(example) # Print the example for progress tracking.
    # Fetch the runs and predictions for both projects (A and B).
    run_a, pred_a = _get_run_and_prediction(example_id, project_a)
    run_b, pred_b = _get_run_and_prediction(example_id, project_b)
    # Prepare the inputs for the evaluator.
    input_, answer = example.inputs["question"], example.outputs["answer"]
    result = {"input": input_, "answer": answer, "A": pred_a, "B": pred_b}

    # Randomly swap A and B to mitigate positional bias in the LLM judge.
    if random.random() < 0.5:
        result["A"], result["B"] = result["B"], result["A"]
        run_a, run_b = run_b, run_a
    try:
        # Call the pairwise evaluator.
        eval_res = eval_chain.evaluate_string_pairs(
            prediction=result["A"],
            prediction_b=result["B"],
            input=input_,
            reference=answer,
        )
    except Exception as e:
        # Log a warning if the evaluator fails.
        logging.warning(e)
        return result

    # If the evaluator returns a 'None' value (e.g., a tie), we don't log feedback.
    if eval_res["value"] is None:
        return result

    # Determine which run was preferred.
    preferred_run = (run_a.id, "A") if eval_res["value"] == "A" else (run_b.id, "B")
    runner_up_run = (run_b.id, "B") if eval_res["value"] == "A" else (run_a.id, "A")
    # Log the feedback (0 for runner-up, 1 for preferred).
    _log_feedback((runner_up_run[0], preferred_run[0]))
    # Add the preference to our results dictionary.
    result["Preferred"] = preferred_run[1]
    return result

We will use LangChain's off-the-shelf `labeled_pairwise_string` evaluator. By default, this evaluator asks the LLM judge to choose its preference based on helpfulness, relevance, correctness, and depth. For a real application, you would likely want to customize these criteria to match your specific goals.

In [22]:
from langchain.evaluation import load_evaluator # Import the evaluator loading helper.

# Load the pre-built labeled pairwise string evaluator.
pairwise_evaluator = load_evaluator("labeled_pairwise_string")

Now we'll set up a runnable to execute our `predict_preference` function across the entire dataset.

In [23]:
import functools # Import functools to create partial functions.
from langchain_core.runnables import RunnableLambda # Import RunnableLambda to wrap our function.


# Create a partial function with our project names and evaluator pre-filled.
eval_func = functools.partial(
    predict_preference,
    project_a=project_name,
    project_b=project_name_2,
    eval_chain=pairwise_evaluator,
)


# Wrap our partial function in a RunnableLambda to get access to the convenient .batch() method.
runnable = RunnableLambda(eval_func)

In [24]:
# Fetch the list of examples from our dataset.
examples = list(client.list_examples(dataset_id=dataset.id))
# Run the evaluation in a batch across all examples.

In [26]:
# Fetch the list of examples from our dataset.
examples = list(client.list_examples(dataset_id=dataset.id))
# Run the evaluation in a batch across all examples.
values = runnable.batch(examples)

dataset_id=UUID('75eeb2e6-1fa5-435c-8362-8387ca8781cc') inputs={'question': 'Why do I have to set environment variables?'} outputs={'answer': 'Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith. While there are other ways to connect, environment variables tend to be the simplest way to configure your application.'} metadata={'dataset_split': ['base']} id=UUID('035c4797-6c30-4883-8d2c-717a11f194e9') created_at=datetime.datetime(2025, 8, 11, 16, 24, 31, 383719, tzinfo=datetime.timezone.utc) modified_at=datetime.datetime(2025, 8, 11, 16, 24, 31, 383719, tzinfo=datetime.timezone.utc) runs=[] source_run_id=None attachments={}dataset_id=UUID('75eeb2e6-1fa5-435c-8362-8387ca8781cc') inputs={'question': 'Can I trace my Llama V2 llm?'} outputs={'answer': "So long as you are using one of LangChain's LLM implementations, all your calls can be traced"} metadata={'dataset_split': ['base']} id=UUID('3e64a644-

By running the function above, the `"preference"` feedback was automatically logged to the test projects. You can now see these preference scores in the LangSmith UI, allowing you to quickly identify which version is performing better on a case-by-case basis.

![Preference Tags](img/with_preferences.png)

We can also display the results in a DataFrame to see the side-by-side comparison and the final preference.

In [27]:
import pandas as pd # Import pandas.

df = pd.DataFrame(values) # Create a DataFrame from our evaluation results.
df.head(10) # Display the first 10 rows of the DataFrame.

Unnamed: 0,input,answer,A,B,Preferred
0,How do I move my project between organizations?,LangSmith doesn't directly support moving proj...,"At the moment, LangSmith does not support movi...","At the moment, LangSmith does not support movi...",
1,Why do I have to set environment variables?,Environment variables can tell your LangChain ...,Setting environment variables is a common prac...,Setting environment variables is necessary in ...,B
2,Can I trace my Llama V2 llm?,So long as you are using one of LangChain's LL...,To trace your Llama V2 language model using La...,To trace your Llama V2 language model using La...,B
3,How do I use a traceable decorator?,The traceable decorator is available in the la...,To use the `traceable` decorator in your LangC...,To use the `traceable` decorator in your LangC...,
4,What's a langsmith dataset?,A LangSmith dataset is a collection of example...,A LangSmith dataset refers to a collection of ...,A LangSmith dataset refers to a collection of ...,A
5,How might I query for all runs in a project?,client.list_runs(project_name='my-project-name...,To query for all runs in a project using LangS...,To query for all runs in a project using LangS...,
6,What is LangChain?,LangChain is an open-source framework for buil...,LangChain is a framework for building applicat...,LangChain is a framework for building applicat...,


## Conclusion

In this walkthrough, you compared two versions of a RAG Q&A chain by first running a standard benchmark and then running a pairwise comparison to determine preference scores. This approach provides a much richer signal than absolute grading alone, especially when aggregate metrics are close.

Pairwise evaluation is a flexible and powerful technique. You can enhance this method by:

- **Ensembling**: Using multiple LLM judges and taking a majority vote to get a more robust preference score.
- **Continuous Scores**: Instructing the model to output a continuous score (e.g., from 1-10) for each response instead of just a binary preference.
- **Custom Criteria**: Modifying the evaluator's prompt to judge based on criteria that are most important to your application, such as conciseness, tone, or creativity.

For more examples of advanced evaluation techniques, check out the [evaluation examples](https://python.langchain.com/docs/guides/productionization/evaluation/examples) in the LangChain documentation.