# Exact Match Evaluation

====================================

### 1. Theory: Understanding Exact Match Evaluation

When evaluating Large Language Models (LLMs) or the systems built on top of them, we need a way to measure their performance. The goal is to determine if the model's output is "correct." The simplest and most stringent method for this is **Exact Match Evaluation**.

**What is it?**
Exact Match evaluation is a binary scoring method where a model's generated output is compared directly against a predefined, ground-truth answer (a "reference label"). The output is considered correct (Score: 1) if and only if it is an identical string to the reference label. If there is any difference whatsoever—even a single character, a punctuation mark, or a whitespace variation—the output is considered incorrect (Score: 0).

**When is it useful?**
This type of evaluation is most effective for tasks where there is a single, unambiguous, and correct answer. Examples include:
- **Fact-based Question Answering:** "What is the capital of France?" -> "Paris"
- **Data Extraction:** Extracting a specific date, number, or name from a text.
- **Structured Output Generation:** Generating a specific JSON key or a pre-defined command.

In this notebook, we will use **LangSmith**, a platform for LLM development and monitoring, to perform exact match evaluation. We will demonstrate this in two ways:
1.  Using LangChain's pre-built `"exact_match"` evaluator.
2.  Creating our own custom evaluator from scratch to replicate the same logic, which shows the flexibility of the platform.

You can preview the final results of this notebook on a public LangSmith run [here](https://smith.langchain.com/public/454c80b5-9809-4f4f-95ee-1f71d8e3ef53/d).

[![Test graph](./img/result_example.png)](https://smith.langchain.com/public/454c80b5-9809-4f4f-95ee-1f71d8e3ef53/d)

### 2. Setup: Installing Dependencies

This first code cell handles the installation of the necessary Python libraries. 
- `langchain`: The core library for building applications with LLMs.
- `langchain_openai`: Provides specific integrations for using OpenAI's models within the LangChain framework.

In [1]:
# The `%pip` command is used to install Python packages directly from a Jupyter cell.
# The `-U` flag ensures that the packages are upgraded to their latest versions.
# The `--quiet` flag suppresses the installation output for a cleaner notebook.
# %pip install -U --quiet langchain langchain_openai

### 3. Configuration: Setting Up Environment Variables

To connect our application to external services like LangSmith and OpenAI, we need to provide API keys. Storing these keys as environment variables is a security best practice, as it avoids hardcoding them directly in the script.

- **`LANGCHAIN_ENDPOINT`**: This tells LangChain where to send the logging and tracing data. We point it to the LangSmith API endpoint.
- **`LANGCHAIN_API_KEY`**: This is your personal key to authenticate with your LangSmith account, allowing you to create datasets and log evaluation runs.
- **`OPENAI_API_KEY`**: This is your key for the OpenAI API, which is required to make calls to models like `gpt-3.5-turbo`.

**Action Required:** You must replace the placeholder values (`"YOUR API KEY"` and `"Your openai api key"`) with your actual keys for this notebook to run.

In [2]:
from dotenv import load_dotenv # Import function to load environment variables

# Load environment variables from the .env file. The `override=True` argument
# ensures that variables from the .env file will overwrite existing environment variables.
load_dotenv(dotenv_path=".env", override=True)

True

In [3]:
import os # Import the 'os' module to interact with the operating system.

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint as an environment variable.
# Update with your API key
os.environ["LANGCHAIN_API_KEY"] = os.getenv('LANGSMITH_API_KEY')# Set your LangSmith API key as an environment variable.
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY') # Set your OpenAI API key as an environment variable.

### 4. Create an Evaluation Dataset

An evaluation dataset is a crucial component for testing any LLM application. It consists of a collection of examples, where each example contains:
- **Inputs**: The data that will be fed into your model (e.g., a user's prompt).
- **Outputs (Reference Labels)**: The corresponding "ground truth" or expected answer that you want the model to produce.

Here, we will create a small dataset named `"Oracle of Exactness"` directly in LangSmith. It will contain two examples designed to test for precise outputs. We first check if the dataset already exists to avoid creating duplicates.

In [5]:
import langsmith # Import the LangSmith client library.

client = langsmith.Client() # Instantiate the LangSmith client to interact with the platform.
dataset_name = "Oracle of Exactness" # Define a name for our new dataset.

# Check if a dataset with this name already exists in your LangSmith project.
if not client.has_dataset(dataset_name=dataset_name):
    # If the dataset does not exist, create it.
    ds = client.create_dataset(dataset_name)
    # Add examples to the newly created dataset.
    client.create_examples(
        # 'inputs' is a list of dictionaries, each representing an input to the model.
        inputs=[
            {
                "prompt_template": "State the year of the declaration of independence. Respond with just the year in digits, nothign else"
            },
            {"prompt_template": "What's the average speed of an unladen swallow?"},
        ],
        # 'outputs' is a list of dictionaries with the corresponding expected or ground-truth answers.
        outputs=[{"output": "1776"}, {"output": "5"}],
        # 'dataset_id' links these examples to the dataset we created above.
        dataset_id=ds.id,
    )

### 5. Define the System and Evaluators

Now we'll set up the components needed to run the evaluation. This involves three key parts:

1.  **The System Under Test (`predict_result`)**: This is the function that we want to evaluate. It takes an input dictionary (matching the structure of our dataset inputs), uses an OpenAI model to generate a response, and returns the result in a structured output dictionary.

2.  **A Custom Evaluator (`compare_label`)**: While LangSmith provides a built-in `"exact_match"` evaluator, we define our own here to demonstrate how you can create custom evaluation logic. This function receives the model's output (`run`) and the ground truth data (`example`), compares them, and returns a structured `EvaluationResult`. The `@run_evaluator` decorator registers this function with LangSmith so it can be used in an evaluation run.

3.  **The Evaluation Configuration (`RunEvalConfig`)**: This object bundles all the evaluators we want to apply to each model prediction. We include both LangSmith's pre-built `"exact_match"` evaluator and our custom `compare_label` function. This will allow us to see their results side-by-side and confirm they produce the same scores.

In [6]:
from langchain.smith import RunEvalConfig # Import the configuration class for evaluation runs.
from langchain_openai import ChatOpenAI # Import the ChatOpenAI class to interact with OpenAI's chat models.
from langsmith.evaluation import EvaluationResult, run_evaluator # Import classes for creating custom evaluators.

model = "gpt-3.5-turbo" # Specify the OpenAI model we want to use for our test.


# This is your model/system that you want to evaluate.
def predict_result(input_: dict) -> dict:
    # This function calls the OpenAI model with the provided prompt.
    response = ChatOpenAI(model=model).invoke(input_["prompt_template"])
    # It then returns the model's output in the standard dictionary format.
    return {"output": response.content}


# The '@run_evaluator' decorator registers this function as a LangSmith evaluator.
@run_evaluator
def compare_label(run, example) -> EvaluationResult:
    # Custom evaluators let you define how "exact" the match ought to be.
    # 'run' contains information about the model's execution, including its outputs.
    # 'example' contains information from the dataset, including the reference output.
    
    # Flexibly pick the fields to compare by accessing the dictionaries.
    prediction = run.outputs.get("output") or "" # Get the predicted output string from the run, defaulting to an empty string if not found.
    target = example.outputs.get("output") or "" # Get the target (reference) output string from the example.
    
    # Perform the direct string comparison.
    match = prediction and prediction == target
    
    # Return the result in the required EvaluationResult format.
    return EvaluationResult(key="matches_label", score=match)


# This defines how you generate metrics about the model's performance.
eval_config = RunEvalConfig(
    # Specify a list of built-in evaluators. `"exact_match"` performs the same logic as our custom one.
    evaluators=["exact_match"], 
    # Specify a list of custom evaluator functions to run.
    custom_evaluators=[compare_label],
)

# This is the main function that executes the evaluation.
client.run_on_dataset(
    dataset_name=dataset_name, # The name of the dataset in LangSmith to use for evaluation.
    llm_or_chain_factory=predict_result, # A reference to the function/chain that will be tested.
    evaluation=eval_config, # The evaluation configuration object we defined above.
    verbose=True, # Prints progress and links to the results in LangSmith.
    # Add any metadata to the project to help with tracking and organization.
    project_metadata={"version": "1.0.0", "model": model},
)

View the evaluation results for project 'bold-sink-95' at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/b0ea32a6-a6bc-4ee5-94a1-f0847d0c3fc8/compare?selectedSessions=eb5cec96-dd0f-4b46-9485-04ac4b17da94

View all tests for Dataset Oracle of Exactness at:
https://smith.langchain.com/o/0212d326-bd9d-42bb-9937-c063f40f2361/datasets/b0ea32a6-a6bc-4ee5-94a1-f0847d0c3fc8
[------------------------------------------------->] 2/2


{'project_name': 'bold-sink-95',
 'results': {'772c051d-c8ae-4438-a03f-97de3fb95d41': {'input': {'prompt_template': 'State the year of the declaration of independence. Respond with just the year in digits, nothign else'},
   'feedback': [EvaluationResult(key='exact_match', score=1, value=None, comment=None, correction=None, evaluator_info={'__run': RunInfo(run_id=UUID('20673b4c-b85a-4252-b1b9-2ca64e82f061'))}, feedback_config=None, source_run_id=None, target_run_id=None, extra=None),
    EvaluationResult(key='matches_label', score=True, value=None, comment=None, correction=None, evaluator_info={}, feedback_config=None, source_run_id=UUID('ba7767c4-ed36-410f-880a-b5731d50b45c'), target_run_id=None, extra=None)],
   'execution_time': 1.206973,
   'run_id': '6f05b482-937e-4667-8f42-75fc62346dc3',
   'output': {'output': '1776'},
   'reference': {'output': '1776'}},
  '843f6ddf-d63d-492b-af9d-a120d8169489': {'input': {'prompt_template': "What's the average speed of an unladen swallow?"},
 