# Evaluating Q&A Systems with Dynamic Data

====================================

### 1. Theory: The Challenge of Evaluating on Dynamic Data

In many real-world applications, a Q&A system doesn't operate on static documents; it connects to live, changing data sources like a production database or an external API. For example, a system answering "How many users signed up today?" needs to be correct even though the answer changes every minute. Standard evaluation, which relies on fixed, hand-labeled answers, breaks down in this scenario. A test that passes in the morning might fail in the afternoon simply because the data has changed.

To solve this, we can borrow a classic computer science concept: **indirection**. Instead of storing the ground-truth answers (the labels) as static values in our dataset, we store *references* that tell us how to fetch the correct answer at the moment of evaluation. In this notebook, our "references" will be executable Python code snippets. 

This tutorial will walk you through the following steps:

1.  **Create a Dataset**: We will build a dataset where the inputs are questions and the "outputs" are code snippets that can retrieve the live answer from a data source.
2.  **Define the Q&A System**: We will set up a LangChain agent that can query a pandas DataFrame.
3.  **Run Evaluation with a Custom Evaluator**: We will design a custom evaluator in LangSmith that, at runtime, executes the code snippet from the dataset to get the current ground-truth answer before comparing it to the model's prediction.
4.  **Re-test the System Over Time**: We will simulate a change in the underlying data and re-run the evaluation to prove our system remains accurate.

> **Note**: We use a simple CSV file and pandas DataFrame to simulate a dynamic data source. This is for illustrative purposes; in a real-world scenario, this could be a SQL database, a GraphQL API, or any other data source.

### 2. Prerequisites and Setup

This tutorial requires OpenAI models and LangChain. First, we will configure our environment variables to connect to the necessary services.

- **`LANGCHAIN_ENDPOINT`**: This URL tells LangChain to send all tracing data to the LangSmith platform.
- **`LANGCHAIN_API_KEY`**: This is your secret key for authenticating with LangSmith.

**Action Required**: You must replace `"YOUR API KEY"` with your actual key for this notebook to run.

In [1]:
# import os # Import the 'os' module to interact with the operating system's environment variables.

# # Update with your API URL if using a hosted instance of Langsmith.
# os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the API endpoint for LangSmith.
# os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"  # Update with your personal LangSmith API key.

In [2]:
from dotenv import load_dotenv # Import function to load environment variables
import os # Import the 'os' module to interact with the operating system.

# Load environment variables from the .env file. The `override=True` argument
# ensures that variables from the .env file will overwrite existing environment variables.
load_dotenv(dotenv_path=".env", override=True)



# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint as an environment variable.
# Update with your API key
os.environ["LANGCHAIN_API_KEY"] = os.getenv('LANGSMITH_API_KEY')# Set your LangSmith API key as an environment variable.
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY') # Set your OpenAI API key as an environment variable.

Next, we install the required Python packages and set the OpenAI API key.

- `langchain[openai]`: Installs the core LangChain library and integrations for OpenAI models.
- `pandas`: The library for data manipulation and analysis, used here as our data source.

**Action Required**: Replace `<YOUR-API-KEY>` with your actual OpenAI key.

In [None]:
# The '%pip install' command installs python packages. '> /dev/null' suppresses the output.
# %pip install -U "langchain[openai]" > /dev/null
# %pip install pandas > /dev/null
# The '%env' magic command sets an environment variable for the notebook session.
# %env OPENAI_API_KEY=<YOUR-API-KEY>

## Step 1: Create a Dataset with Dynamic References

We will use the classic Titanic dataset as our data source. The key difference in our approach is how we define the labels. Instead of calculating the answers beforehand and storing them as static values, we will store Python code snippets that can be executed on the DataFrame to get the correct answer.

This is the principle of **indirection** in action. The label is not the answer itself, but a *recipe for finding the answer*. This ensures that our evaluation always compares against the most up-to-date data.

In [3]:
# Define a list of tuples, where each tuple is a (question, code_snippet) pair.
questions = [
    ("How many passengers were on the Titanic?", "len(df)"),
    ("How many passengers survived?", "df['Survived'].sum()"),
    ("What was the average age of the passengers?", "df['Age'].mean()"),
    ("How many male and female passengers were there?", "df['Sex'].value_counts()"),
    ("What was the average fare paid for the tickets?", "df['Fare'].mean()"),
    ("How many passengers were in each class?", "df['Pclass'].value_counts()"),
    (
        "What was the survival rate for each gender?",
        "df.groupby('Sex')['Survived'].mean()",
    ),
    (
        "What was the survival rate for each class?",
        "df.groupby('Pclass')['Survived'].mean()",
    ),
    (
        "Which port had the most passengers embark from?",
        "df['Embarked'].value_counts().idxmax()",
    ),
    (
        "How many children under the age of 18 survived?",
        "df[df['Age'] < 18]['Survived'].sum()",
    ),
]

Next, we'll create the dataset in LangSmith. This allows us to version our test cases, share them with team members, and associate multiple test runs with the same set of questions over time. We will upload our `questions` list, where the question is the input and the code snippet is the output (our dynamic label).

In [4]:
import uuid # Import the uuid library to generate unique identifiers.

from langsmith import Client # Import the Client class to interact with LangSmith.

client = Client() # Instantiate the LangSmith client.
# Define a unique name for the dataset using a short hex code from a UUID.
dataset_name = f"Dynamic Titanic CSV {uuid.uuid4().hex[:4]}"
# Create the dataset on the LangSmith platform.
dataset = client.create_dataset(
    dataset_name=dataset_name, # The name for the new dataset.
    description="Test QA over CSV", # An optional description for the dataset.
)

# Create all the examples in the dataset in a single API call for efficiency.
client.create_examples(
    # The inputs are a list of dictionaries, each with a 'question' key.
    inputs=[{"question": example[0]} for example in questions],
    # The outputs are a list of dictionaries, each with a 'code' key containing the reference snippet.
    outputs=[{"code": example[1]} for example in questions],
    dataset_id=dataset.id, # Link these examples to the dataset we just created.
)

{'example_ids': ['1f32646b-7bd3-4568-b756-96c21eb9a50d',
  'eb04bc28-e740-42cd-a269-deb9ae1fa138',
  'd43b0eef-f995-4e99-b6aa-a7a7831615d2',
  '85f6c394-2f60-4fa0-a1a7-468e773c3286',
  'd24d4150-48dd-4354-a5af-85646e3b748f',
  'b9518d7f-ea78-4cc0-b227-7c6432e24794',
  '6e88725e-20d4-4fdf-a8fd-d2b4ae59c343',
  'e0c497ae-0278-4d46-80fb-ce7e5dd4b539',
  '7c7658a5-5826-4c77-b371-1eecf4612c5c',
  '7047638c-a8f2-46eb-ba80-c5c89163e4f9'],
 'count': 10}

## Step 2: Define the Q&A System

With the dataset created, it's time to define our question-answering system. For this tutorial, we'll use a pre-built LangChain component: the **pandas dataframe agent**. This agent is specifically designed to answer questions about a pandas DataFrame by generating and executing Python code.

First, we load the Titanic data into a DataFrame. Then, we create a constructor function for our agent that we can pass to the evaluator.

In [6]:
import pandas as pd # Import the pandas library for data manipulation.

# The URL of the raw CSV file for the Titanic dataset.
titanic_path = "https://raw.githubusercontent.com/jorisvandenbossche/pandas-tutorial/master/data/titanic.csv"
# Read the CSV data from the URL into a pandas DataFrame.
df = pd.read_csv(titanic_path)

In [7]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Now, we define the `predict` function. This function will be our "system under test". For each run, it initializes a new pandas dataframe agent with our designated LLM and the current state of the DataFrame `df`. It then invokes the agent with the user's question.

In [16]:
from langchain_core.prompts import ChatPromptTemplate # Import prompt templates.
from langchain_experimental.agents import create_pandas_dataframe_agent # Import the agent constructor.
from langchain_openai import ChatOpenAI # Import the OpenAI chat model wrapper.

# Initialize the LLM. We use a powerful model like GPT-4 for code generation tasks and set temperature to 0 for deterministic outputs.
llm = ChatOpenAI(
                # model="gpt-4-turbo-preview", 
                 model="gpt-3.5-turbo",
                 temperature=0.0
                 )


# Define the function to be evaluated.
def predict(inputs: dict):
    # Inside the function, create an instance of the pandas dataframe agent.
    agent = create_pandas_dataframe_agent(agent_type="openai-tools", llm=llm, df=df,allow_dangerous_code=True)
    # Invoke the agent with the question from the input dictionary.
    return agent.invoke({"input": inputs["question"]})

In [17]:
# Run an example prediction to see the agent in action.
predict({"question": "How many passengers were on the Titanic?"})

{'input': 'How many passengers were on the Titanic?',
 'output': 'There were 891 passengers on the Titanic.'}

## Step 3: Run Evaluation with a Custom Evaluator

This is the most critical part of our setup. We need an evaluator that understands our dynamic labels. We'll create a custom evaluator by inheriting from `LabeledCriteriaEvalChain`. This base class is an LLM-powered evaluator that assesses a prediction based on a given criterion (e.g., "correctness") and a reference label.

Our customization is simple but powerful: we will override the `_get_eval_input` method. This method is responsible for preparing the inputs that get passed to the evaluator's LLM. In our overridden version, we will first call the parent method to get the standard inputs, and then we will **execute the `reference` value (our code snippet) using Python's `eval()` function**. This replaces the code snippet with its live result.

The result is that the evaluator's LLM never sees the code; it only sees the prediction and the freshly fetched, up-to-the-minute correct answer.

> **Security Warning**: Using `eval()` on untrusted code is extremely dangerous as it can execute arbitrary commands. In this tutorial, we are only evaluating code that we have written ourselves in a controlled environment. **Never** use this `eval()` approach in a production system where the code snippets could come from untrusted users.

In [18]:
from typing import Optional # Import typing hints.

from langchain.evaluation.criteria.eval_chain import LabeledCriteriaEvalChain # Import the base class for our custom evaluator.


# Define our custom evaluator by inheriting from the base class.
class CustomCriteriaEvalChain(LabeledCriteriaEvalChain):
    def _get_eval_input(
        self,
        prediction: str,
        reference: Optional[str],
        input: Optional[str],
    ) -> dict:
        # First, get the standard dictionary of inputs from the parent class.
        raw = super()._get_eval_input(prediction, reference, input)
        # This is the key step: we take the 'reference' (our code snippet) and execute it.
        # The result of the execution replaces the code snippet in the dictionary.
        # WARNING: This uses `eval`, which is a security risk with untrusted code.
        raw["reference"] = eval(raw["reference"])
        # Return the modified dictionary with the live, dereferenced answer.
        return raw

With our custom evaluator class defined, we can now configure and run the full evaluation using LangSmith's `evaluate` function. This will iterate through our dataset, run our `predict` function on each question, and use our `CustomCriteriaEvalChain` to score the correctness of each result against the live data.

In [19]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate # Import the necessary evaluation functions.

# Instantiate our custom evaluator. We'll use GPT-4 as the judge for high-quality grading.
base_evaluator = CustomCriteriaEvalChain.from_llm(
    criteria="correctness", 
    llm=ChatOpenAI(model="gpt-4", 
                   temperature=0.0
                   )
)


# Define a helper function to prepare the data format that our evaluator expects.
def prepare_inputs(run, example):
    return {
        "prediction": next(iter(run.outputs.values())), # Get the model's predicted output.
        "reference": next(iter(example.outputs.values())), # Get the reference (our code snippet).
        "input": example.inputs["question"], # Get the original input question.
    }


# Wrap our custom evaluator in a LangChainStringEvaluator to make it compatible with the `evaluate` function.
criteria_evaluator = LangChainStringEvaluator(
    base_evaluator, prepare_data=prepare_inputs
)
# Run the evaluation.
chain_results = evaluate(
    predict, # The function representing our Q&A system.
    data=dataset_name, # The name of our dataset in LangSmith.
    evaluators=[criteria_evaluator], # The list of evaluators to apply.
    # The pandas agent does not currently support parallel execution.
    max_concurrency=1,
    metadata={
        "time": "T1", # Add metadata to tag this run as our first time point.
    },
)

  from .autonotebook import tqdm as notebook_tqdm


View the evaluation results for experiment: 'new-wave-52' at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/56494b0e-42e9-4e60-a3aa-e12c549204dd/compare?selectedSessions=0d076163-4e07-4f13-83ac-033c8c703b18




10it [00:52,  5.30s/it]


With that evaluation running, you can navigate to the linked project in LangSmith to review the agent's predictions and the feedback scores from our custom evaluator.

## Step 4: Re-evaluate After Data Changes

Now, we'll demonstrate the power of our dynamic evaluation setup. While the Titanic dataset is static, we can simulate a data update in a real-world system. We will modify the DataFrame by duplicating all the rows and shuffling some of the columns. This will drastically change the correct answer to every question in our dataset.

Because our dataset contains *instructions* for finding the answer, not the answers themselves, we can re-run the exact same evaluation on the new data and get a meaningful correctness score.

In [27]:
# Simulate a data update by doubling the number of rows.
df_doubled = pd.concat([df, df], ignore_index=True)
# Shuffle some of the columns to make the data changes less trivial.
df_doubled["Age"] = df_doubled["Age"].sample(frac=1).reset_index(drop=True)
df_doubled["Sex"] = df_doubled["Sex"].sample(frac=1).reset_index(drop=True)
# Overwrite the original DataFrame with the new, modified data.
df = df_doubled

Now, we run the evaluation again. Note that the code is identical to our first evaluation run, except for the metadata tag, which we'll change to `"T2"` to signify the second time point.

In [28]:
# Re-run the evaluation on the modified DataFrame.
chain_results = evaluate(
    predict, # The same Q&A system function.
    data=dataset_name, # The same dataset of questions and code snippets.
    evaluators=[criteria_evaluator], # The same custom evaluator.
    max_concurrency=1, # The agent still doesn't support concurrent runs.
    metadata={
        "time": "T2", # Update the metadata to mark this as the second run.
    },
)

View the evaluation results for experiment: 'best-store-65' at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/56494b0e-42e9-4e60-a3aa-e12c549204dd/compare?selectedSessions=5eb4da16-3a41-417f-91ab-ba1089d9254a




10it [00:48,  4.84s/it]


#### Review the results

Now that you've tested twice on the "changing" data source, you can check out the results in LangSmith. If you navigate to the dataset's page and click on the "Examples" tab, you can select any question and see the predictions from both test runs side-by-side. 

![Examples Table Page](./img/dynamic_data_examples_list.png)
 
Let's inspect the example for the question, "How many male and female passengers were there?". The table of linked runs clearly shows two different predictions for our two test runs (`T1` and `T2`).

- In the first run, the agent correctly predicted 577 male and 314 female passengers.
- In the second run, after we doubled the data, it correctly predicted 1154 male and 628 female passengers.

Crucially, **both test runs were marked as correct**. This demonstrates that our evaluation setup is working perfectly. The agent's predictions changed to reflect the new data, and our evaluator correctly fetched the new ground truth, confirming that both answers were correct *at the time they were generated*.

![Examples Page](./img/dynamic_data_example_page.png)

To be absolutely sure, we can inspect the traces of the evaluator itself. By clicking on the "correctness" feedback chips, we can see exactly what inputs the evaluator's LLM received. The screenshots below show the `reference` value that was passed to the LLM judge. You can see that for the `T1` run, the dereferenced value was `(577, 314)`, and for the `T2` run, it was `(1154, 628)`. This confirms our custom evaluator is successfully dereferencing the labels and fetching the live data before making its judgment.

![Evaluator Trace at T1](./img/dynamic_data_feedback_trace_t1.png)

![Evaluator Trace at T2](./img/dynamic_data_feedback_trace_t2.png)


## Conclusion

In this walkthrough, you have learned a powerful technique for evaluating a Q&A system connected to a dynamic, evolving data store. By using a custom evaluator that dynamically fetches the ground-truth answer based on a static reference (a code snippet), you can create a robust, end-to-end testing process that remains valid even as your data changes.

This approach directly tests the correctness of your system on up-to-date data and is excellent for periodic performance checks.

However, it's important to be aware of the trade-offs. This method is less suitable for A/B testing two different prompts or models, as a change in the underlying data between the two test runs could confound the results. Additionally, as highlighted, the use of `eval` requires extreme caution and should only be done in a secure environment with trusted code.

Other strategies to consider for this problem include:
- **Freezing or Mocking Data**: For A/B tests, you can use a static, frozen snapshot of your data source. This ensures that both models are tested on the exact same data, providing a fair comparison. The snapshot should be updated periodically to remain representative of the production environment.
- **Evaluating Query Generation**: Instead of testing the final answer, you could test the agent's ability to generate the *correct query* (e.g., the correct SQL or Python code). This isolates the agent's logic from the data itself but is a less end-to-end test.

We hope this gives you a solid framework for thinking about and implementing evaluations for your own dynamic Q&A systems! 

If you have questions or suggestions, please let us know at support@langchain.dev.