# Unit 3

## Creating Metrics in DSPy

# Evaluation in DSPy: Creating Metrics

Welcome to the third lesson of the "Evaluation in DSPy" course. This lesson focuses on creating metrics, which are crucial for quantifying the performance of system outputs in DSPy. Metrics provide a basis for improving and optimizing a system's performance.

-----

## Basic Metric Functions

Basic metric functions are fundamental for evaluating system responses.

### `validate_answer`

The `validate_answer` function checks if a predicted answer is an exact match to the expected answer, ignoring case differences. It returns `True` for a match and `False` otherwise, making it useful for tasks requiring exact answers.

```python
def validate_answer(example, pred, trace=None):
    return example.answer.lower() == pred.answer.lower()
```

### `answer_exact_match` and `answer_passage_match`

DSPy also provides built-in functions like `answer_exact_match` and `answer_passage_match` that offer more flexibility.

  * `answer_exact_match`: This function uses a helper function `_answer_match` to check if a prediction matches any of the given answers. It allows for partial matches based on a `frac` parameter.
  * `answer_passage_match`: This function evaluates whether a predicted answer is present within a provided passage or context. It uses a helper function `_passage_match` to check if any of the expected answers are found in the predicted response's context.

-----

## Completeness and Groundedness Evaluation

Evaluating a system's response for **completeness** and **groundedness** is vital for understanding its quality. The `CompleteAndGrounded` built-in class provides a structured way to perform this evaluation.

The `CompleteAndGrounded` class uses an F1 score to combine two key components: `AnswerCompleteness` and `AnswerGroundedness`.

  * **`AnswerCompleteness`**: This component estimates how well the system's response covers the "ground truth" or correct answer. It involves enumerating key ideas from both the ground truth and the system's response to discuss their overlap and report a completeness score.
  * **`AnswerGroundedness`**: This component assesses the degree to which a system's response is supported by "retrieved documents" and common sense reasoning. It involves enumerating claims made in the system's response and discussing how they are supported by the provided context.

-----

## Semantic Evaluation with F1 Score

Semantic evaluation assesses the quality of a response based on its semantic content. The built-in `SemanticF1` class performs this evaluation using recall, precision, and an F1 score.

  * **Recall**: Measures the fraction of the ground truth covered by the system's response.
  * **Precision**: Measures the fraction of the system's response that is covered by the ground truth.
  * **F1 Score**: Combines precision and recall to provide a balanced evaluation metric.

The `SemanticF1` class can perform both standard and "decompositional" semantic evaluations, offering flexibility in assessing response quality.

-----

## Practical Example: Evaluating a Tweet

You can create **custom metrics** to evaluate specific criteria. As a practical example, a custom metric can be used to evaluate a tweet's quality. The goal is to check if a generated tweet correctly answers a question, is engaging, and adheres to a character limit.

This custom metric uses an `Assess` class with a `dspy.Signature` to define an automatic assessment. The `metric` function then evaluates the tweet's correctness, engagement, and length, returning a score that reflects its overall quality.

-----

## Summary and Preparation for Practice

In this lesson, you learned how to create and use metrics in DSPy to evaluate system outputs. This includes using basic functions, evaluating for completeness and groundedness, and performing semantic evaluation with F1 scores. These skills are fundamental for assessing the performance of DSPy systems and provide a foundation for more advanced topics.

## Implementing a Case Insensitive Metric

Now that you've learned about basic metric functions in DSPy, let's put that knowledge into practice! In this exercise, you'll implement a fundamental metric: a case-insensitive exact match validator.

The validate_answer function we discussed earlier is essential for many question-answering tasks. Your job is to complete this function in the provided file. The function should compare the predicted answer with the expected answer while ignoring differences in letter case.

To complete this exercise:

Implement the comparison logic in the validate_answer function.
Ensure it returns True when answers match (ignoring case) and False otherwise.
Run the provided test cases to verify your implementation works correctly.
This simple metric will serve as a building block for more complex evaluation methods you'll explore later in the course. Mastering these basic metrics is the first step toward creating sophisticated evaluation systems for your DSPy applications.

```python
from dspy import Example, Prediction


def validate_answer(example, pred, trace=None):
    """
    Validates if the predicted answer matches the expected answer, ignoring case.
    
    Args:
        example: The example containing the expected answer
        pred: The prediction containing the predicted answer
        trace: Optional trace information (not used in this function)
        
    Returns:
        True if the answers match (ignoring case), False otherwise
    """
    # TODO: Implement case-insensitive comparison between example.answer and pred.answer
    pass


# Test cases with matching answers (ignoring case)
matching_cases = [
    (Example(question="What is the capital of France?", answer="Paris"), 
     Prediction(answer="paris")),
    
    (Example(question="Who wrote Romeo and Juliet?", answer="william shakespeare"), 
     Prediction(answer="William Shakespeare")),
    
    (Example(question="What is the chemical symbol for gold?", answer="Au"), 
     Prediction(answer="au")),
]

# Test cases with non-matching answers
non_matching_cases = [
    (Example(question="What is the capital of France?", answer="Paris"), 
     Prediction(answer="London")),
    
    (Example(question="Who wrote Romeo and Juliet?", answer="William Shakespeare"), 
     Prediction(answer="Charles Dickens")),
    
    (Example(question="What is the chemical symbol for gold?", answer="Au"), 
     Prediction(answer="Ag")),
]


def run_tests():
    print("Testing matching cases (should all be True):")
    matching_results = []
    for i, (example, pred) in enumerate(matching_cases):
        result = validate_answer(example, pred)
        matching_results.append(result)
        print(f"  Test {i+1}: Expected '{example.answer}', Got '{pred.answer}', Match: {result}")
    
    print("\nTesting non-matching cases (should all be False):")
    non_matching_results = []
    for i, (example, pred) in enumerate(non_matching_cases):
        result = validate_answer(example, pred)
        non_matching_results.append(result)
        print(f"  Test {i+1}: Expected '{example.answer}', Got '{pred.answer}', Match: {result}")
    
    # Summary
    matching_success = all(matching_results)
    non_matching_success = not any(non_matching_results)
    overall_success = matching_success and non_matching_success
    
    print("\nSummary:")
    print(f"  Matching cases: {'All passed' if matching_success else 'Some failed'}")
    print(f"  Non-matching cases: {'All passed' if non_matching_success else 'Some failed'}")
    print(f"  Overall: {'All tests passed!' if overall_success else 'Some tests failed.'}")


if __name__ == "__main__":
    run_tests()

```

```python
from dspy import Example, Prediction


def validate_answer(example, pred, trace=None):
    """
    Validates if the predicted answer matches the expected answer, ignoring case.
    
    Args:
        example: The example containing the expected answer
        pred: The prediction containing the predicted answer
        trace: Optional trace information (not used in this function)
        
    Returns:
        True if the answers match (ignoring case), False otherwise
    """
    # Implement case-insensitive comparison between example.answer and pred.answer
    return example.answer.lower() == pred.answer.lower()


# Test cases with matching answers (ignoring case)
matching_cases = [
    (Example(question="What is the capital of France?", answer="Paris"), 
     Prediction(answer="paris")),
    
    (Example(question="Who wrote Romeo and Juliet?", answer="william shakespeare"), 
     Prediction(answer="William Shakespeare")),
    
    (Example(question="What is the chemical symbol for gold?", answer="Au"), 
     Prediction(answer="au")),
]

# Test cases with non-matching answers
non_matching_cases = [
    (Example(question="What is the capital of France?", answer="Paris"), 
     Prediction(answer="London")),
    
    (Example(question="Who wrote Romeo and Juliet?", answer="William Shakespeare"), 
     Prediction(answer="Charles Dickens")),
    
    (Example(question="What is the chemical symbol for gold?", answer="Au"), 
     Prediction(answer="Ag")),
]


def run_tests():
    print("Testing matching cases (should all be True):")
    matching_results = []
    for i, (example, pred) in enumerate(matching_cases):
        result = validate_answer(example, pred)
        matching_results.append(result)
        print(f"  Test {i+1}: Expected '{example.answer}', Got '{pred.answer}', Match: {result}")
    
    print("\nTesting non-matching cases (should all be False):")
    non_matching_results = []
    for i, (example, pred) in enumerate(non_matching_cases):
        result = validate_answer(example, pred)
        non_matching_results.append(result)
        print(f"  Test {i+1}: Expected '{example.answer}', Got '{pred.answer}', Match: {result}")
    
    # Summary
    matching_success = all(matching_results)
    non_matching_success = not any(non_matching_results)
    overall_success = matching_success and non_matching_success
    
    print("\nSummary:")
    print(f"  Matching cases: {'All passed' if matching_success else 'Some failed'}")
    print(f"  Non-matching cases: {'All passed' if non_matching_success else 'Some failed'}")
    print(f"  Overall: {'All tests passed!' if overall_success else 'Some tests failed.'}")


if __name__ == "__main__":
    run_tests()

```

## Flexible Answer Matching for Multiple Formats

## Passage Matching for Retrieval Systems

## Building a Holistic QA Evaluation Metric

## Semantic Evaluation with F1 Score