# Q&A System Correctness

====================================

### 1. Theory: The Challenge of Evaluating Q&A Systems

Evaluating a question-and-answer (Q&A) system, especially one powered by a Large Language Model (LLM), is a complex task. Unlike simple classification or regression models, a Q&A system's output is unstructured, free-form text. A single question can have multiple valid answers, each phrased differently.

Traditional text metrics like BLEU or ROUGE, which measure word overlap, often fail to capture the semantic correctness of a lengthy response. An answer could be factually correct but use different words, leading to a low score, or it could be factually wrong while using many of the same keywords as the reference answer. 

To overcome this, we can leverage **LLM-assisted evaluation**. This approach uses another powerful LLM as an impartial "judge" to grade the Q&A system's response based on a reference answer. This allows for a more nuanced understanding of correctness, complementing human review and providing a scalable way to measure performance.

In this walkthrough, we will use **LangSmith** to evaluate the correctness of a Retrieval-Augmented Generation (RAG) Q&A system. The process will follow these key steps:

1.  **Create a Dataset**: We will build a collection of question-and-answer pairs to serve as our ground truth.
2.  **Define the Q&A System**: We will construct a RAG pipeline that retrieves information from documentation and generates an answer.
3.  **Run Evaluation**: We will use LangSmith to run our system against the dataset and automatically grade the responses for correctness.
4.  **Iterate and Improve**: We will analyze the results to identify failures, modify our system, and re-evaluate to confirm the improvements.

The complete test run, including all feedback and traces, will be saved in a LangSmith project for easy analysis.

![test project](./img/test_project.png)

> **Note 1:** This walkthrough focuses on testing the end-to-end performance of the system. It's also crucial to evaluate individual components. For instance, the retriever can be tested separately using standard information retrieval metrics (e.g., hit rate, MRR) to ensure it's fetching relevant documents effectively.

> **Note 2:** If your knowledge base (the documents your system answers from) is constantly changing, your reference answers might become outdated. It's important to have a strategy to manage this, such as freezing the knowledge source during testing or regularly updating your evaluation dataset.

### 2. Prerequisites and Setup

First, we'll configure our environment variables. These are essential for connecting our code to the LangSmith and OpenAI services.

- **`LANGCHAIN_ENDPOINT`**: This URL tells LangChain to send all tracing data to the LangSmith platform.
- **`LANGCHAIN_API_KEY`**: This is your secret key for authenticating with LangSmith.
- **`PROJECT_NAME`**: (Optional) This allows you to group related runs in LangSmith under a specific project. It's highly recommended for organization.

This tutorial uses OpenAI models, ChromaDB for the vector store, and LangChain for building the RAG chain. You will also need to set your OpenAI API key.

In [1]:
from dotenv import load_dotenv # Import function to load environment variables
import os # Import the 'os' module to interact with the operating system.

# Load environment variables from the .env file. The `override=True` argument
# ensures that variables from the .env file will overwrite existing environment variables.
load_dotenv(dotenv_path=".env", override=True)



# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint as an environment variable.
# Update with your API key
os.environ["LANGCHAIN_API_KEY"] = os.getenv('LANGSMITH_API_KEY')# Set your LangSmith API key as an environment variable.
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY') # Set your OpenAI API key as an environment variable.

Next, we install the required Python packages. We use `%pip` to install them directly within the notebook.

- `langchain[openai]`: Installs the core LangChain library along with the specific integrations for OpenAI models.
- `chromadb`: The vector database we will use to store and retrieve document embeddings.
- `lxml`: A robust parser for HTML and XML, used by our document loader.
- `html2text`: A utility to convert HTML into clean, readable plain text.

In [2]:
# The '%pip install' command installs python packages. The '> /dev/null' part suppresses the output for a cleaner notebook.
# %pip install -U "langchain[openai]" > /dev/null
# %pip install chromadb > /dev/null
# %pip install lxml > /dev/null
# %pip install html2text > /dev/null

Set your OpenAI API Key. This is required to use OpenAI's embedding and language models. Replace `<YOUR-API-KEY>` with your actual key.

In [3]:
# The '%env' magic command sets an environment variable for the notebook session.
# %env OPENAI_API_KEY=<YOUR-API-KEY>

## Step 1: Create a Dataset

A high-quality dataset is the foundation of any reliable evaluation. For our Q&A system, the dataset will consist of question-answer pairs. The questions represent typical user queries, and the answers are the "ground truth" or reference responses we expect the system to provide.

For this example, we'll create a dataset about LangSmith documentation. We have hard-coded a few examples below. For a real-world scenario, it's best to have a much larger dataset (e.g., >100 examples) to get statistically significant results. These examples should ideally be sourced from real user interactions to ensure they are representative.

In [2]:
# We define a list of tuples, where each tuple contains a (question, answer) pair.
examples = [
    (
        "What is LangChain?", # This is the input question.
        "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.", # This is the reference answer.
    ),
    (
        "How might I query for all runs in a project?",
        "client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-anme'})",
    ),
    (
        "What's a langsmith dataset?",
        "A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point.",
    ),
    (
        "How do I use a traceable decorator?",
        """The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,\
import the required function, decorate your function, and then call the function. Below is an example:
```python
from langsmith.run_helpers import traceable
@traceable(run_type="chain") # or "llm", etc.
def my_function(input_param):
    # Function logic goes here
    return output
result = my_function(input_param)
```""",
    ),
    (
        "Can I trace my Llama V2 llm?",
        "So long as you are using one of LangChain's LLM implementations, all your calls can be traced",
    ),
    (
        "Why do I have to set environment variables?",
        "Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith."
        " While there are other ways to connect, environment variables tend to be the simplest way to configure your application.",
    ),
    (
        "How do I move my project between organizations?",
        "LangSmith doesn't directly support moving projects between organizations.",
    ),
]

Now, let's create a LangSmith client, which is our main entry point for interacting with the LangSmith platform.

In [3]:
from langsmith import Client # Import the Client class from the langsmith library.

client = Client() # Instantiate the client. It will automatically use the environment variables we set earlier.

Using the client, we will programmatically create a new dataset in LangSmith and populate it with our examples. We add a unique identifier (`uuid`) to the dataset name to prevent naming conflicts if we run this notebook multiple times.

In [4]:
import uuid # Import the uuid library to generate unique identifiers.

# Define a unique name for the dataset using a UUID to avoid collisions.
dataset_name = f"Retrieval QA Questions {str(uuid.uuid4())}"
# Create the dataset on the LangSmith platform and get back a dataset object.
dataset = client.create_dataset(dataset_name=dataset_name)
# Loop through our list of hard-coded question-answer pairs.
for q, a in examples:
    # For each pair, create an example in our LangSmith dataset.
    client.create_example(
        inputs={"question": q}, # The input dictionary must have keys that match what our chain expects.
        outputs={"answer": a}, # The output dictionary contains the ground truth reference answer.
        dataset_id=dataset.id # We specify which dataset to add this example to.
    )

In [5]:
dataset.id

UUID('d07a59ea-1e4b-43fb-a680-adc79ef5f87d')

In [6]:
dataset_name

'Retrieval QA Questions 6b1fd88c-9cce-4611-8729-3e6f28046f6e'

## Step 2. Define the RAG Q&A System

Now we'll build our Q&A system. We are using a **Retrieval-Augmented Generation (RAG)** architecture. This is a powerful pattern for building knowledgeable LLM systems. A RAG system works in two main stages:

1.  **Retrieval**: Given a user's question, the system first retrieves relevant information from a knowledge base. In our case, this knowledge base is the LangSmith documentation. This stage consists of:
    -   An **Embedding Model** (`OpenAIEmbeddings`): Converts both the documents and the user's question into numerical vectors (embeddings).
    -   A **Vector Store** (`Chroma`): A specialized database that stores the document vectors and allows for efficient searching to find vectors (and thus documents) that are most similar to the question vector.
    -   A **Retriever**: The component that orchestrates the search in the vector store and returns the most relevant documents.

2.  **Generation**: The retrieved documents are then passed to an LLM, along with the original question, to generate a final, synthesized answer. This stage consists of:
    -   A **Prompt Template** (`ChatPromptTemplate`): Structures the input for the LLM, combining the retrieved context and the user's question with instructions on how to answer.
    -   An **LLM** (`ChatOpenAI`): The language model that reads the prompt and generates the textual response.

We will use LangChain Expression Language (LCEL) to elegantly combine these components into a single, executable chain.

First, let's load and process the documents to populate our vector store.

In [8]:
from langchain_community.document_loaders import RecursiveUrlLoader # A loader for recursively scraping a website.
from langchain_community.document_transformers import Html2TextTransformer # A transformer to convert HTML content to plain text.
from langchain_community.vectorstores import Chroma # The Chroma vector store implementation.
from langchain_text_splitters import TokenTextSplitter # A text splitter that splits based on token count.
from langchain_openai import OpenAIEmbeddings # The class for using OpenAI's embedding models.

# Initialize a loader to fetch all documents from the LangSmith documentation website.
api_loader = RecursiveUrlLoader("https://docs.smith.langchain.com")
# Initialize a text splitter to break large documents into smaller chunks.
text_splitter = TokenTextSplitter(
    model_name="gpt-3.5-turbo", # The model used to count tokens for splitting.
    chunk_size=2000, # The maximum size of each chunk in tokens.
    chunk_overlap=200, # The number of tokens to overlap between consecutive chunks.
)
# Initialize a transformer to clean up the raw HTML.
doc_transformer = Html2TextTransformer()
# Load the raw documents from the URL.
raw_documents = api_loader.load()
# Transform the raw HTML documents into plain text.
transformed = doc_transformer.transform_documents(raw_documents)
# Split the transformed documents into smaller, manageable chunks.
documents = text_splitter.split_documents(transformed)

With the documents processed, we can now create the vector store and the retriever. The vector store will embed and index our document chunks, and the retriever will provide the interface for searching them.

In [9]:
# Initialize the OpenAI embeddings model.
embeddings = OpenAIEmbeddings()
# Create a Chroma vector store from the documents, using the OpenAI embeddings model.
vectorstore = Chroma.from_documents(documents, embeddings)
# Create a retriever from the vector store, configured to return the top 4 most relevant documents.
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

Next, we define the response generation part of our RAG chain. This involves creating a prompt template that will be populated with the retrieved context and the user's question, and then an LLM to generate the final answer.

In [10]:
from datetime import datetime # Import datetime to include the current time in the prompt.
from langchain_core.output_parsers import StrOutputParser # Import the string output parser.
from langchain_core.prompts import ChatPromptTemplate # Import the chat prompt template class.
from langchain_openai import ChatOpenAI # Import the OpenAI chat model class.

# Define the prompt template. This structures the input for the LLM.
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system", # The system message provides high-level instructions to the model.
            "You are a helpful documentation Q&A assistant, trained to answer"
            " questions from LangSmith's documentation."
            " LangChain is a framework for building applications using large language models."
            "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages.",
        ),
        ("system", "{context}"), # A placeholder for the retrieved documents (context).
        ("human", "{question}"), # A placeholder for the user's question.
    ]
).partial(time=str(datetime.now())) # Pre-fill the 'time' variable with the current time.

# Initialize the LLM we'll use for generation. We use gpt-3.5-turbo with a large context window and low temperature for factual answers.
model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
# Define the response generator part of the chain using LCEL. It pipes the prompt to the model, then to an output parser.
response_generator = prompt | model | StrOutputParser()

Finally, we assemble the full RAG chain using LangChain Expression Language (LCEL). This chain will seamlessly connect the retrieval and generation steps.

In [11]:
# The full chain combines the retriever and the response generator.
from operator import itemgetter # Import itemgetter for convenient data routing.

chain = (
    # A Runnable Map takes the input dictionary and prepares a new dictionary for the next step.
    {
        # The 'context' key is populated by a sub-chain: get the question, pass it to the retriever, and format the resulting documents.
        "context": itemgetter("question")
        | retriever
        | (lambda docs: "\n".join([doc.page_content for doc in docs])),
        # The 'question' key is passed through directly from the input.
        "question": itemgetter("question"),
    }
    | response_generator # The output of the map is piped into our response generator chain.
)

Let's do a quick test of our chain with a single question to see it in action before we run the full evaluation.

In [12]:
# We stream the output of the chain for a sample question.
for tok in chain.stream({"question": "How do I log user feedback to a run?"}):
    print(tok, end="", flush=True) # Print each token as it is generated for a real-time effect.

To log user feedback to a run in LangSmith, you can follow these steps:

1. **Create a RunTree Object**: First, you need to create a `RunTree` object to represent the run you want to log user feedback for. This object should include the necessary information such as the name of the component, the type of run, inputs, outputs, and any errors.

2. **Post the Run**: After creating the `RunTree` object, you need to post the run to LangSmith to initiate the logging process.

3. **End the Run with User Feedback**: Once the run is completed and you have collected user feedback, you can end the run by providing the user feedback as part of the outputs. This step allows you to log the user feedback along with the run data.

4. **Patch the Run**: Finally, you can patch the run to update any additional information or finalize the logging process.

Here is a code snippet demonstrating how you can log user feedback to a run using the LangSmith SDK:

```typescript
import { RunTree } from "langsmith"

## Step 3. Evaluate the Chain

With our chain defined and our dataset ready, it's time to run the evaluation. We will use one of LangSmith's built-in, LLM-assisted evaluators called `"qa"`. This evaluator is specifically designed for Q&A tasks. For each example in our dataset, it will:
1.  Receive the generated answer from our RAG chain.
2.  Receive the reference answer from our dataset.
3.  Use an LLM to determine if the generated answer is a "correct" answer based on the reference. It returns a binary score (1 for correct, 0 for incorrect).

We configure this using the `RunEvalConfig` object.

In [13]:
from langchain.smith import RunEvalConfig # Import the evaluation configuration class.

# Create an evaluation configuration object.
eval_config = RunEvalConfig(
    # Specify the evaluators to use. 'qa' is a built-in evaluator for question-answering correctness.
    evaluators=["qa"],
    # You can optionally configure the LLM used for evaluation if you want to use a different model.
    # eval_llm=ChatAnthropic(model="claude-2", temperature=0)
)

Now we execute the evaluation. The `client.arun_on_dataset` function orchestrates the entire process. It iterates through each example in our dataset, runs our RAG chain on the input question, and then applies the `qa` evaluator to score the result. The `await` keyword is used because this is an asynchronous operation, which can run evaluations in parallel for greater efficiency.

In [14]:
# Asynchronously run the evaluation on the dataset.
_ = await client.arun_on_dataset(
    dataset_name=dataset_name, # The name of the dataset to test against.
    llm_or_chain_factory=lambda: chain, # A function that returns an instance of the chain to be tested.
    evaluation=eval_config, # The evaluation configuration we defined earlier.
)

View the evaluation results for project 'extraneous-sea-89' at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/d07a59ea-1e4b-43fb-a680-adc79ef5f87d/compare?selectedSessions=d64011af-44ac-4b1d-afd5-debc5d7e7035

View all tests for Dataset Retrieval QA Questions 6b1fd88c-9cce-4611-8729-3e6f28046f6e at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/d07a59ea-1e4b-43fb-a680-adc79ef5f87d
[------------------------------------------------->] 7/7


### Analyzing the Results in LangSmith

As the test progresses, you can click the link printed above to go to the LangSmith project. There, you can see real-time results, including the chain's outputs, the feedback scores from the evaluator, and detailed traces for each run.

To find problematic examples, you can filter the results. For example, to see all the runs that the `qa` evaluator marked as incorrect, you can filter for `"Correctness==0"`.

![Incorrect Examples](./img/filter_correctness.png)

Clicking on an individual run lets you inspect the full trace to understand what went wrong. The "Feedback" tab within the trace view shows the reasoning behind the evaluator's score. 

![Incorrect Example Trace](./img/see_trace.png)

Since LLM-assisted evaluations are themselves LLM runs, you can even inspect the trace of the evaluator itself. This is useful for auditing the evaluation process and ensuring the evaluator is behaving as expected. You can click the link highlighted in the image to see the evaluator's thought process.

![QA Eval Chain Run](./img/qa_eval_chain_run.png)

### Diagnosing and Fixing the Error
In this example, one of the traces was marked as "incorrect". By inspecting the trace, we might find that the model is "hallucinating" – making up information that wasn't present in the retrieved documents. 

LangSmith's **Playground** is an interactive environment for debugging and improving prompts. By clicking on a specific LLM call in a trace, you can open it in the Playground to experiment with changes.

![Open in Playground](./img/open_in_playground.png)

To fix the hallucination, we can try making the prompt more robust. Let's add an explicit instruction telling the model to *only* use the provided documents and to admit when it doesn't know the answer. We can add a new system message:

> Respond as best as you can. If no documents are retrieved or if you do not see an answer in the retrieved documents, admit you do not know or that you don't see it being supported at the moment.

After adding this message in the Playground and resubmitting, we can see if the model's behavior improves.

![Change Prompt](./img/playground_prompt.png)

The new prompt seems to fix the issue for this specific example. However, we need to ensure this change doesn't negatively affect other examples (i.e., we're not overfitting to a single failure case). The next step is to re-run the entire evaluation with our improved chain.

## Step 4. Iterate and Re-Evaluate

Evaluation is not a one-time event; it's a cycle. We analyzed our initial results, identified a problem, and prototyped a fix in the Playground. Now, we'll implement that fix in our code and re-run the evaluation to measure the impact of our change across the entire dataset.

Below, we define a new RAG chain (`chain_2`) that includes the improved prompt with the added system message to discourage hallucination.

In [15]:
# Define the new, improved prompt template.
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful documentation Q&A assistant, trained to answer"
            " questions from LangSmith's documentation."
            "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages.",
        ),
        ("system", "{context}"),
        ("human", "{question}"),
        # Add the new system message here to make the model more cautious:
        (
            "system",
            "Respond as best as you can. If no documents are retrieved or if you do not see an answer in the retrieved documents,"
            " admit you do not know or that you don't see it being supported at the moment.",
        ),
    ]
).partial(time=lambda: str(datetime.now())) # Use a lambda to get the current time dynamically for each run.

# Re-initialize the model and response generator with the new prompt.
model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
response_generator_2 = prompt | model | StrOutputParser()
# Assemble the second version of our RAG chain.
chain_2 = {
    "context": itemgetter("question")
    | retriever
    | (lambda docs: "\n".join([doc.page_content for doc in docs])),
    "question": itemgetter("question"),
} | response_generator_2

Now we run the evaluation again, this time pointing to `chain_2`.

In [16]:
# Run the evaluation again with the updated chain factory.
_ = await client.arun_on_dataset(
    dataset_name=dataset_name, # Use the same dataset as before.
    llm_or_chain_factory=lambda: chain_2, # Point to the new, improved chain.
    evaluation=eval_config, # Use the same evaluation configuration.
)

View the evaluation results for project 'crushing-chain-51' at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/d07a59ea-1e4b-43fb-a680-adc79ef5f87d/compare?selectedSessions=d053012a-72c8-4d7d-9d45-e8b429683467

View all tests for Dataset Retrieval QA Questions 6b1fd88c-9cce-4611-8729-3e6f28046f6e at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/d07a59ea-1e4b-43fb-a680-adc79ef5f87d
[------------------------------------------------->] 7/7


### Comparing Results

After the second run is complete, we can go to our dataset page in LangSmith to compare the performance of the two chains. LangSmith automatically aggregates the feedback scores for each test run associated with a dataset.

![Datasets Page](./img/dataset_test_runs.png)

In this case, it looks like our change was successful, and the new chain now passes all the examples! This is a great result. Remember that this is a small, illustrative dataset; with a larger dataset, the goal is to see a positive trend in the aggregate correctness score.

LangSmith also makes it easy to compare outputs at the individual example level. In the "Examples" tab of the dataset, you can click on any row to see a side-by-side comparison of the responses from both test runs, making it easy to spot qualitative differences in their behavior.

![Example Page](./img/example.png)

## 5. Conclusion

Congratulations! You have successfully completed a full evaluation cycle for a RAG Q&A system. 

In this tutorial, you learned how to:
- Create a labeled dataset for evaluation in LangSmith.
- Build a complete RAG pipeline using LangChain.
- Use an LLM-assisted evaluator (`qa`) to automatically measure the correctness of your system's answers.
- Use the insights from LangSmith to diagnose a failure (hallucination), iterate on your prompt, and verify the improvement by re-running the evaluation.

This iterative process of testing, analyzing, and improving is fundamental to building reliable and high-quality LLM applications. 

Thanks for trying this out! If you have questions or suggestions, please open an issue on GitHub or reach out to us at support@langchain.dev.