# Unit 3

# Evaluation in DSPy: Creating Metrics

Welcome to the third lesson of the "Evaluation in DSPy" course. This lesson focuses on creating metrics, which are crucial for quantifying the performance of system outputs in DSPy. Metrics provide a basis for improving and optimizing a system's performance.

-----

## Basic Metric Functions

Basic metric functions are fundamental for evaluating system responses.

### `validate_answer`

The `validate_answer` function checks if a predicted answer is an exact match to the expected answer, ignoring case differences. It returns `True` for a match and `False` otherwise, making it useful for tasks requiring exact answers.

```python
def validate_answer(example, pred, trace=None):
    return example.answer.lower() == pred.answer.lower()
```

### `answer_exact_match` and `answer_passage_match`

DSPy also provides built-in functions like `answer_exact_match` and `answer_passage_match` that offer more flexibility.

  * `answer_exact_match`: This function uses a helper function `_answer_match` to check if a prediction matches any of the given answers. It allows for partial matches based on a `frac` parameter.
  * `answer_passage_match`: This function evaluates whether a predicted answer is present within a provided passage or context. It uses a helper function `_passage_match` to check if any of the expected answers are found in the predicted response's context.

-----

## Completeness and Groundedness Evaluation

Evaluating a system's response for **completeness** and **groundedness** is vital for understanding its quality. The `CompleteAndGrounded` built-in class provides a structured way to perform this evaluation.

The `CompleteAndGrounded` class uses an F1 score to combine two key components: `AnswerCompleteness` and `AnswerGroundedness`.

  * **`AnswerCompleteness`**: This component estimates how well the system's response covers the "ground truth" or correct answer. It involves enumerating key ideas from both the ground truth and the system's response to discuss their overlap and report a completeness score.
  * **`AnswerGroundedness`**: This component assesses the degree to which a system's response is supported by "retrieved documents" and common sense reasoning. It involves enumerating claims made in the system's response and discussing how they are supported by the provided context.

-----

## Semantic Evaluation with F1 Score

Semantic evaluation assesses the quality of a response based on its semantic content. The built-in `SemanticF1` class performs this evaluation using recall, precision, and an F1 score.

  * **Recall**: Measures the fraction of the ground truth covered by the system's response.
  * **Precision**: Measures the fraction of the system's response that is covered by the ground truth.
  * **F1 Score**: Combines precision and recall to provide a balanced evaluation metric.

The `SemanticF1` class can perform both standard and "decompositional" semantic evaluations, offering flexibility in assessing response quality.

-----

## Practical Example: Evaluating a Tweet

You can create **custom metrics** to evaluate specific criteria. As a practical example, a custom metric can be used to evaluate a tweet's quality. The goal is to check if a generated tweet correctly answers a question, is engaging, and adheres to a character limit.

This custom metric uses an `Assess` class with a `dspy.Signature` to define an automatic assessment. The `metric` function then evaluates the tweet's correctness, engagement, and length, returning a score that reflects its overall quality.

-----

## Summary and Preparation for Practice

In this lesson, you learned how to create and use metrics in DSPy to evaluate system outputs. This includes using basic functions, evaluating for completeness and groundedness, and performing semantic evaluation with F1 scores. These skills are fundamental for assessing the performance of DSPy systems and provide a foundation for more advanced topics.

## Implementing a Case Insensitive Metric

Now that you've learned about basic metric functions in DSPy, let's put that knowledge into practice! In this exercise, you'll implement a fundamental metric: a case-insensitive exact match validator.

The validate_answer function we discussed earlier is essential for many question-answering tasks. Your job is to complete this function in the provided file. The function should compare the predicted answer with the expected answer while ignoring differences in letter case.

To complete this exercise:

Implement the comparison logic in the validate_answer function.
Ensure it returns True when answers match (ignoring case) and False otherwise.
Run the provided test cases to verify your implementation works correctly.
This simple metric will serve as a building block for more complex evaluation methods you'll explore later in the course. Mastering these basic metrics is the first step toward creating sophisticated evaluation systems for your DSPy applications.

```python
from dspy import Example, Prediction


def validate_answer(example, pred, trace=None):
    """
    Validates if the predicted answer matches the expected answer, ignoring case.
    
    Args:
        example: The example containing the expected answer
        pred: The prediction containing the predicted answer
        trace: Optional trace information (not used in this function)
        
    Returns:
        True if the answers match (ignoring case), False otherwise
    """
    # TODO: Implement case-insensitive comparison between example.answer and pred.answer
    pass


# Test cases with matching answers (ignoring case)
matching_cases = [
    (Example(question="What is the capital of France?", answer="Paris"), 
     Prediction(answer="paris")),
    
    (Example(question="Who wrote Romeo and Juliet?", answer="william shakespeare"), 
     Prediction(answer="William Shakespeare")),
    
    (Example(question="What is the chemical symbol for gold?", answer="Au"), 
     Prediction(answer="au")),
]

# Test cases with non-matching answers
non_matching_cases = [
    (Example(question="What is the capital of France?", answer="Paris"), 
     Prediction(answer="London")),
    
    (Example(question="Who wrote Romeo and Juliet?", answer="William Shakespeare"), 
     Prediction(answer="Charles Dickens")),
    
    (Example(question="What is the chemical symbol for gold?", answer="Au"), 
     Prediction(answer="Ag")),
]


def run_tests():
    print("Testing matching cases (should all be True):")
    matching_results = []
    for i, (example, pred) in enumerate(matching_cases):
        result = validate_answer(example, pred)
        matching_results.append(result)
        print(f"  Test {i+1}: Expected '{example.answer}', Got '{pred.answer}', Match: {result}")
    
    print("\nTesting non-matching cases (should all be False):")
    non_matching_results = []
    for i, (example, pred) in enumerate(non_matching_cases):
        result = validate_answer(example, pred)
        non_matching_results.append(result)
        print(f"  Test {i+1}: Expected '{example.answer}', Got '{pred.answer}', Match: {result}")
    
    # Summary
    matching_success = all(matching_results)
    non_matching_success = not any(non_matching_results)
    overall_success = matching_success and non_matching_success
    
    print("\nSummary:")
    print(f"  Matching cases: {'All passed' if matching_success else 'Some failed'}")
    print(f"  Non-matching cases: {'All passed' if non_matching_success else 'Some failed'}")
    print(f"  Overall: {'All tests passed!' if overall_success else 'Some tests failed.'}")


if __name__ == "__main__":
    run_tests()

```

```python
from dspy import Example, Prediction


def validate_answer(example, pred, trace=None):
    """
    Validates if the predicted answer matches the expected answer, ignoring case.
    
    Args:
        example: The example containing the expected answer
        pred: The prediction containing the predicted answer
        trace: Optional trace information (not used in this function)
        
    Returns:
        True if the answers match (ignoring case), False otherwise
    """
    # Implement case-insensitive comparison between example.answer and pred.answer
    return example.answer.lower() == pred.answer.lower()


# Test cases with matching answers (ignoring case)
matching_cases = [
    (Example(question="What is the capital of France?", answer="Paris"), 
     Prediction(answer="paris")),
    
    (Example(question="Who wrote Romeo and Juliet?", answer="william shakespeare"), 
     Prediction(answer="William Shakespeare")),
    
    (Example(question="What is the chemical symbol for gold?", answer="Au"), 
     Prediction(answer="au")),
]

# Test cases with non-matching answers
non_matching_cases = [
    (Example(question="What is the capital of France?", answer="Paris"), 
     Prediction(answer="London")),
    
    (Example(question="Who wrote Romeo and Juliet?", answer="William Shakespeare"), 
     Prediction(answer="Charles Dickens")),
    
    (Example(question="What is the chemical symbol for gold?", answer="Au"), 
     Prediction(answer="Ag")),
]


def run_tests():
    print("Testing matching cases (should all be True):")
    matching_results = []
    for i, (example, pred) in enumerate(matching_cases):
        result = validate_answer(example, pred)
        matching_results.append(result)
        print(f"  Test {i+1}: Expected '{example.answer}', Got '{pred.answer}', Match: {result}")
    
    print("\nTesting non-matching cases (should all be False):")
    non_matching_results = []
    for i, (example, pred) in enumerate(non_matching_cases):
        result = validate_answer(example, pred)
        non_matching_results.append(result)
        print(f"  Test {i+1}: Expected '{example.answer}', Got '{pred.answer}', Match: {result}")
    
    # Summary
    matching_success = all(matching_results)
    non_matching_success = not any(non_matching_results)
    overall_success = matching_success and non_matching_success
    
    print("\nSummary:")
    print(f"  Matching cases: {'All passed' if matching_success else 'Some failed'}")
    print(f"  Non-matching cases: {'All passed' if non_matching_success else 'Some failed'}")
    print(f"  Overall: {'All tests passed!' if overall_success else 'Some tests failed.'}")


if __name__ == "__main__":
    run_tests()

```

## Flexible Answer Matching for Multiple Formats

Now that you've learned about basic metric functions and seen how validate_answer works, let's dive deeper into DSPy's evaluation capabilities! In this exercise, you'll enhance the answer_exact_match function to handle different types of answers.

The function needs to properly process both single string answers and lists of acceptable answers. Your task is to complete the implementation that correctly delegates to the _answer_match helper function in both cases.

To complete this exercise:

Implement the logic to handle when example.answer is a string.
Implement the logic to handle when example.answer is a list.
Ensure the frac parameter is passed correctly to allow for partial matching.
This enhanced metric will be valuable for real-world applications where multiple correct answers might exist. Building flexible evaluation metrics is a key skill for developing robust DSPy applications that can handle diverse response formats.

```python
from dspy import Example, Prediction
from utils import EM, F1


def _answer_match(prediction, answers, frac=1.0):
    """Returns True if the prediction matches any of the answers."""
    if frac >= 1.0:
        return EM(prediction, answers)

    return F1(prediction, answers) >= frac


def answer_exact_match(example, pred, trace=None, frac=1.0):
    """
    Checks if the predicted answer matches any of the expected answers.
    
    Args:
        example: The example containing the expected answer(s)
        pred: The prediction containing the predicted answer
        trace: Optional trace information (not used in this function)
        frac: The fraction threshold for partial matching (default: 1.0 for exact match)
        
    Returns:
        True if the prediction matches any of the expected answers, False otherwise
        
    Raises:
        ValueError: If the answer type is neither a string nor a list
    """
    # TODO: Handle the case when example.answer is a string
    
    # TODO: Handle the case when example.answer is a list
    
    raise ValueError(f"Invalid answer type: {type(example.answer)}")


def run_tests():
    print("Testing answer_exact_match function...")
    
    # Test case 1: String answer, exact match
    example = Example(answer="Paris")
    pred = Prediction(answer="Paris")
    result = answer_exact_match(example, pred)
    print(f"Test 1 (String, Exact Match): {'✓ Passed' if result else '✗ Failed'}")
    
    # Test case 2: String answer, case-insensitive match
    example = Example(answer="Paris")
    pred = Prediction(answer="paris")
    result = answer_exact_match(example, pred)
    print(f"Test 2 (String, Case-Insensitive): {'✓ Passed' if result else '✗ Failed'}")
    
    # Test case 3: String answer, no match
    example = Example(answer="Paris")
    pred = Prediction(answer="London")
    result = answer_exact_match(example, pred)
    print(f"Test 3 (String, No Match): {'✓ Passed' if not result else '✗ Failed'}")
    
    # Test case 4: List answer, exact match with first item
    example = Example(answer=["Paris", "City of Light"])
    pred = Prediction(answer="Paris")
    result = answer_exact_match(example, pred)
    print(f"Test 4 (List, Match First): {'✓ Passed' if result else '✗ Failed'}")
    
    # Test case 5: List answer, exact match with second item
    example = Example(answer=["City of Light", "Paris"])
    pred = Prediction(answer="Paris")
    result = answer_exact_match(example, pred)
    print(f"Test 5 (List, Match Second): {'✓ Passed' if result else '✗ Failed'}")
    
    # Test case 6: List answer, no match
    example = Example(answer=["Paris", "City of Light"])
    pred = Prediction(answer="London")
    result = answer_exact_match(example, pred)
    print(f"Test 6 (List, No Match): {'✓ Passed' if not result else '✗ Failed'}")
    
    # Test case 7: String answer, partial match with frac=0.5
    example = Example(answer="The capital of France is Paris")
    pred = Prediction(answer="Paris is the capital")
    result = answer_exact_match(example, pred, frac=0.5)
    print(f"Test 7 (String, Partial Match): {'✓ Passed' if result else '✗ Failed'}")
    
    # Test case 8: List answer, partial match with frac=0.5
    example = Example(answer=["The capital of France is Paris", "Paris is in France"])
    pred = Prediction(answer="Paris is the capital")
    result = answer_exact_match(example, pred, frac=0.5)
    print(f"Test 8 (List, Partial Match): {'✓ Passed' if result else '✗ Failed'}")
    
    # Test case 9: Invalid answer type
    try:
        example = Example(answer=123)  # Integer is not a valid type
        pred = Prediction(answer="Paris")
        answer_exact_match(example, pred)
        print("Test 9 (Invalid Type): ✗ Failed - No exception raised")
    except ValueError:
        print("Test 9 (Invalid Type): ✓ Passed - ValueError raised")
    except Exception as e:
        print(f"Test 9 (Invalid Type): ✗ Failed - Wrong exception: {type(e).__name__}")


if __name__ == "__main__":
    run_tests()
```

```python
from dspy import Example, Prediction
from utils import EM, F1


def _answer_match(prediction, answers, frac=1.0):
    """Returns True if the prediction matches any of the answers."""
    if frac >= 1.0:
        return EM(prediction, answers)

    return F1(prediction, answers) >= frac


def answer_exact_match(example, pred, trace=None, frac=1.0):
    """
    Checks if the predicted answer matches any of the expected answers.
    
    Args:
        example: The example containing the expected answer(s)
        pred: The prediction containing the predicted answer
        trace: Optional trace information (not used in this function)
        frac: The fraction threshold for partial matching (default: 1.0 for exact match)
        
    Returns:
        True if the prediction matches any of the expected answers, False otherwise
        
    Raises:
        ValueError: If the answer type is neither a string nor a list
    """
    if isinstance(example.answer, str):
        answers = [example.answer]
        return _answer_match(pred.answer, answers, frac)
    
    if isinstance(example.answer, list):
        return _answer_match(pred.answer, example.answer, frac)
    
    raise ValueError(f"Invalid answer type: {type(example.answer)}")


def run_tests():
    print("Testing answer_exact_match function...")
    
    # Test case 1: String answer, exact match
    example = Example(answer="Paris")
    pred = Prediction(answer="Paris")
    result = answer_exact_match(example, pred)
    print(f"Test 1 (String, Exact Match): {'✓ Passed' if result else '✗ Failed'}")
    
    # Test case 2: String answer, case-insensitive match
    example = Example(answer="Paris")
    pred = Prediction(answer="paris")
    result = answer_exact_match(example, pred)
    print(f"Test 2 (String, Case-Insensitive): {'✓ Passed' if result else '✗ Failed'}")
    
    # Test case 3: String answer, no match
    example = Example(answer="Paris")
    pred = Prediction(answer="London")
    result = answer_exact_match(example, pred)
    print(f"Test 3 (String, No Match): {'✓ Passed' if not result else '✗ Failed'}")
    
    # Test case 4: List answer, exact match with first item
    example = Example(answer=["Paris", "City of Light"])
    pred = Prediction(answer="Paris")
    result = answer_exact_match(example, pred)
    print(f"Test 4 (List, Match First): {'✓ Passed' if result else '✗ Failed'}")
    
    # Test case 5: List answer, exact match with second item
    example = Example(answer=["City of Light", "Paris"])
    pred = Prediction(answer="Paris")
    result = answer_exact_match(example, pred)
    print(f"Test 5 (List, Match Second): {'✓ Passed' if result else '✗ Failed'}")
    
    # Test case 6: List answer, no match
    example = Example(answer=["Paris", "City of Light"])
    pred = Prediction(answer="London")
    result = answer_exact_match(example, pred)
    print(f"Test 6 (List, No Match): {'✓ Passed' if not result else '✗ Failed'}")
    
    # Test case 7: String answer, partial match with frac=0.5
    example = Example(answer="The capital of France is Paris")
    pred = Prediction(answer="Paris is the capital")
    result = answer_exact_match(example, pred, frac=0.5)
    print(f"Test 7 (String, Partial Match): {'✓ Passed' if result else '✗ Failed'}")
    
    # Test case 8: List answer, partial match with frac=0.5
    example = Example(answer=["The capital of France is Paris", "Paris is in France"])
    pred = Prediction(answer="Paris is the capital")
    result = answer_exact_match(example, pred, frac=0.5)
    print(f"Test 8 (List, Partial Match): {'✓ Passed' if result else '✗ Failed'}")
    
    # Test case 9: Invalid answer type
    try:
        example = Example(answer=123)  # Integer is not a valid type
        pred = Prediction(answer="Paris")
        answer_exact_match(example, pred)
        print("Test 9 (Invalid Type): ✗ Failed - No exception raised")
    except ValueError:
        print("Test 9 (Invalid Type): ✓ Passed - ValueError raised")
    except Exception as e:
        print(f"Test 9 (Invalid Type): ✗ Failed - Wrong exception: {type(e).__name__}")


if __name__ == "__main__":
    run_tests()
```

### Explanation of the Solution

The solution correctly handles both string and list inputs for the `example.answer` by using `isinstance` to check the data type.

  * When `example.answer` is a **string**, the code wraps it in a single-element list (`[example.answer]`). This is crucial because the `_answer_match` helper function is designed to work with a list of acceptable answers. This approach makes the code cleaner and avoids duplicating logic.
  * When `example.answer` is a **list**, the code directly passes this list to the `_answer_match` helper function.
  * In both cases, the `pred.answer` and the `frac` parameter are passed correctly to the `_answer_match` function, ensuring that both exact (`frac=1.0`) and partial (`frac < 1.0`) matching are handled correctly.

Finally, if the `example.answer` is neither a string nor a list, the function raises a `ValueError`, as specified in the instructions. This robust type-checking ensures the function behaves predictably and handles unexpected inputs gracefully. The provided test cases validate that the completed function works as expected for all scenarios, including exact matches, partial matches, and incorrect input types.

## Passage Matching for Retrieval Systems

Excellent work on implementing the flexible answer-matching function! Now let's build on your understanding of DSPy metrics by implementing the answer_passage_match function, which is crucial for retrieval-based question-answering systems.

This function checks whether any passage in the predicted context contains the expected answer — a key metric for evaluating if retrieved information is actually useful for answering questions.

Your task is to complete the answer_passage_match function that handles both single-string answers and lists of acceptable answers. The helper function _passage_match is already implemented for you, so you'll need to:

Add code to handle when example.answer is a string
Add code to handle when example.answer is a list of strings
Properly call the _passage_match helper with the right parameters
This metric is particularly valuable when working with retrieval-augmented generation systems, as it helps verify that your system is finding relevant information before generating answers. Mastering this function will give you a powerful tool for evaluating and improving information retrieval components in your DSPy applications.

```python
from dspy import Example, Prediction


def _passage_match(passages: list[str], answers: list[str]) -> bool:
    """Returns True if any of the passages contains the answer."""
    from utils import DPR_normalize, has_answer, normalize_text

    def passage_has_answers(passage: str, answers: list[str]) -> bool:
        """Returns True if the passage contains the answer."""
        return has_answer(
            tokenized_answers=[DPR_normalize(normalize_text(ans)) for ans in answers],
            text=normalize_text(passage),
        )

    return any(passage_has_answers(psg, answers) for psg in passages)


def answer_passage_match(example, pred, trace=None):
    """
    Checks if any passage in the predicted context contains the expected answer.
    
    Args:
        example: The example containing the expected answer(s)
        pred: The prediction containing the context passages
        trace: Optional trace information (not used in this function)
        
    Returns:
        True if any passage contains any acceptable answer, False otherwise
    """
    # TODO: Implement handling for when example.answer is a string
    # Hint: Call _passage_match with pred.context and a list containing the single answer
    
    # TODO: Implement handling for when example.answer is a list
    # Hint: Call _passage_match with pred.context and the list of answers
    
    raise ValueError(f"Invalid answer type: {type(example.answer)}")


def run_tests():
    # Test cases for string answers
    string_cases = [
        # Answer present in first passage
        (Example(question="What is the capital of France?", answer="Paris"), 
         Prediction(context=["Paris is the capital of France.", "France is in Europe."]), 
         True),
        
        # Answer present in second passage
        (Example(question="Who wrote Romeo and Juliet?", answer="Shakespeare"), 
         Prediction(context=["Romeo and Juliet is a famous play.", "It was written by Shakespeare in the 16th century."]), 
         True),
        
        # Answer not present in any passage
        (Example(question="What is the capital of Japan?", answer="Tokyo"), 
         Prediction(context=["Japan is an island nation.", "It has a rich cultural history."]), 
         False),
        
        # Case insensitive matching
        (Example(question="What element has symbol Au?", answer="Gold"), 
         Prediction(context=["The symbol Au represents gold on the periodic table."]), 
         True),
    ]
    
    # Test cases for list answers
    list_cases = [
        # One of the answers present
        (Example(question="Name a primary color", answer=["Red", "Blue", "Yellow"]), 
         Prediction(context=["The rainbow contains many colors.", "Red is considered a primary color."]), 
         True),
        
        # Multiple answers present
        (Example(question="Name a fruit", answer=["Apple", "Banana", "Orange"]), 
         Prediction(context=["Apples and oranges are popular fruits.", "Bananas are rich in potassium."]), 
         True),
        
        # No answers present
        (Example(question="Name a planet", answer=["Mercury", "Venus", "Mars"]), 
         Prediction(context=["The Earth is the third planet from the Sun.", "It is the only known planet with life."]), 
         False),
    ]
    
    # Edge cases
    edge_cases = [
        # Empty answer
        (Example(question="Empty answer", answer=""), 
         Prediction(context=["This is a test passage."]), 
         True),  # Empty string is considered to be in any text
        
        # Empty passage
        (Example(question="Empty passage", answer="Something"), 
         Prediction(context=[""]), 
         False),
        
        # Empty list of answers
        (Example(question="Empty list", answer=[]), 
         Prediction(context=["This is a test passage."]), 
         False),
        
        # Empty list of passages
        (Example(question="Empty passages", answer="Something"), 
         Prediction(context=[]), 
         False),
    ]
    
    # Run all test cases
    all_test_cases = {
        "String Answer Cases": string_cases,
        "List Answer Cases": list_cases,
        "Edge Cases": edge_cases,
    }
    
    all_passed = True
    
    for category, test_cases in all_test_cases.items():
        print(f"\nTesting {category}:")
        category_passed = True
        
        for i, (example, pred, expected) in enumerate(test_cases):
            try:
                result = answer_passage_match(example, pred)
                passed = result == expected
                category_passed = category_passed and passed
                
                status = "✓" if passed else "✗"
                print(f"  Test {i+1}: {status} Expected: {expected}, Got: {result}")
                if not passed:
                    if isinstance(example.answer, list):
                        answer_display = f"[{', '.join(example.answer)}]"
                    else:
                        answer_display = example.answer
                    print(f"    Example answer: {answer_display}")
                    print(f"    Prediction context: {pred.context}")
            except Exception as e:
                category_passed = False
                print(f"  Test {i+1}: ✗ Error: {str(e)}")
        
        category_status = "PASSED" if category_passed else "FAILED"
        print(f"{category}: {category_status}")
        all_passed = all_passed and category_passed
    
    print("\nOverall Result:", "ALL TESTS PASSED!" if all_passed else "SOME TESTS FAILED")
    return all_passed


if __name__ == "__main__":
    run_tests()
```

```python
from dspy import Example, Prediction


def _passage_match(passages: list[str], answers: list[str]) -> bool:
    """Returns True if any of the passages contains the answer."""
    from utils import DPR_normalize, has_answer, normalize_text

    def passage_has_answers(passage: str, answers: list[str]) -> bool:
        """Returns True if the passage contains the answer."""
        return has_answer(
            tokenized_answers=[DPR_normalize(normalize_text(ans)) for ans in answers],
            text=normalize_text(passage),
        )

    return any(passage_has_answers(psg, answers) for psg in passages)


def answer_passage_match(example, pred, trace=None):
    """
    Checks if any passage in the predicted context contains the expected answer.
    
    Args:
        example: The example containing the expected answer(s)
        pred: The prediction containing the context passages
        trace: Optional trace information (not used in this function)
        
    Returns:
        True if any passage contains any acceptable answer, False otherwise
    """
    if isinstance(example.answer, str):
        answers = [example.answer]
        return _passage_match(pred.context, answers)
    
    if isinstance(example.answer, list):
        return _passage_match(pred.context, example.answer)
    
    raise ValueError(f"Invalid answer type: {type(example.answer)}")


def run_tests():
    # Test cases for string answers
    string_cases = [
        # Answer present in first passage
        (Example(question="What is the capital of France?", answer="Paris"), 
         Prediction(context=["Paris is the capital of France.", "France is in Europe."]), 
         True),
        
        # Answer present in second passage
        (Example(question="Who wrote Romeo and Juliet?", answer="Shakespeare"), 
         Prediction(context=["Romeo and Juliet is a famous play.", "It was written by Shakespeare in the 16th century."]), 
         True),
        
        # Answer not present in any passage
        (Example(question="What is the capital of Japan?", answer="Tokyo"), 
         Prediction(context=["Japan is an island nation.", "It has a rich cultural history."]), 
         False),
        
        # Case insensitive matching
        (Example(question="What element has symbol Au?", answer="Gold"), 
         Prediction(context=["The symbol Au represents gold on the periodic table."]), 
         True),
    ]
    
    # Test cases for list answers
    list_cases = [
        # One of the answers present
        (Example(question="Name a primary color", answer=["Red", "Blue", "Yellow"]), 
         Prediction(context=["The rainbow contains many colors.", "Red is considered a primary color."]), 
         True),
        
        # Multiple answers present
        (Example(question="Name a fruit", answer=["Apple", "Banana", "Orange"]), 
         Prediction(context=["Apples and oranges are popular fruits.", "Bananas are rich in potassium."]), 
         True),
        
        # No answers present
        (Example(question="Name a planet", answer=["Mercury", "Venus", "Mars"]), 
         Prediction(context=["The Earth is the third planet from the Sun.", "It is the only known planet with life."]), 
         False),
    ]
    
    # Edge cases
    edge_cases = [
        # Empty answer
        (Example(question="Empty answer", answer=""), 
         Prediction(context=["This is a test passage."]), 
         True),  # Empty string is considered to be in any text
        
        # Empty passage
        (Example(question="Empty passage", answer="Something"), 
         Prediction(context=[""]), 
         False),
        
        # Empty list of answers
        (Example(question="Empty list", answer=[]), 
         Prediction(context=["This is a test passage."]), 
         False),
        
        # Empty list of passages
        (Example(question="Empty passages", answer="Something"), 
         Prediction(context=[]), 
         False),
    ]
    
    # Run all test cases
    all_test_cases = {
        "String Answer Cases": string_cases,
        "List Answer Cases": list_cases,
        "Edge Cases": edge_cases,
    }
    
    all_passed = True
    
    for category, test_cases in all_test_cases.items():
        print(f"\nTesting {category}:")
        category_passed = True
        
        for i, (example, pred, expected) in enumerate(test_cases):
            try:
                result = answer_passage_match(example, pred)
                passed = result == expected
                category_passed = category_passed and passed
                
                status = "✓" if passed else "✗"
                print(f"  Test {i+1}: {status} Expected: {expected}, Got: {result}")
                if not passed:
                    if isinstance(example.answer, list):
                        answer_display = f"[{', '.join(example.answer)}]"
                    else:
                        answer_display = example.answer
                    print(f"    Example answer: {answer_display}")
                    print(f"    Prediction context: {pred.context}")
            except Exception as e:
                category_passed = False
                print(f"  Test {i+1}: ✗ Error: {str(e)}")
        
        category_status = "PASSED" if category_passed else "FAILED"
        print(f"{category}: {category_status}")
        all_passed = all_passed and category_passed
    
    print("\nOverall Result:", "ALL TESTS PASSED!" if all_passed else "SOME TESTS FAILED")
    return all_passed


if __name__ == "__main__":
    run_tests()
```

## Building a Holistic QA Evaluation Metric

You've done a fantastic job implementing both the basic validate_answer function and the more advanced answer_exact_match and answer_passage_match functions! Now, let's take your skills to the next level by combining these metrics into a more powerful evaluation tool.

In real-world question-answering systems, we often care about two things: whether the system produced the correct answer and whether that answer is supported by the retrieved context. This is where a composite metric becomes valuable.

Your task is to implement the composite_metric function that:

Checks if the predicted answer matches the expected answer using answer_exact_match
Verifies if the expected answer appears in the context using answer_passage_match
Combines these results into a single score between 0 and 1
The function should handle both string and list answers and properly weigh the importance of exact matches versus passage matches.

This composite metric will help you evaluate question-answering systems more holistically, ensuring they not only give correct answers but also find relevant supporting information. This is a key skill for building trustworthy AI systems that can explain their reasoning.

```python
from dspy import Example, Prediction
from dspy.evaluate import answer_exact_match, answer_passage_match
from utils import f1_score


def composite_metric(example, pred, trace=None, exact_weight=0.6, passage_weight=0.4):
    """
    A composite metric that combines answer_exact_match and answer_passage_match.
    
    Args:
        example: The example containing the expected answer(s)
        pred: The prediction containing the predicted answer and context
        trace: Optional trace information (not used in this function)
        exact_weight: Weight for the exact match component (default: 0.6)
        passage_weight: Weight for the passage match component (default: 0.4)
        
    Returns:
        A score between 0 and 1, with higher values indicating better performance
        
    Raises:
        ValueError: If the answer type is neither a string nor a list
    """
    # TODO: Implement this function
    


def run_tests():
    print("Testing composite_metric function...")
    
    # Test cases where both conditions are met
    both_true_cases = [
        (Example(question="What is the capital of France?", answer="Paris"), 
         Prediction(answer="Paris", context=["Paris is the capital of France."]), 
         1.0),
        
        (Example(question="Who wrote Romeo and Juliet?", answer=["William Shakespeare", "Shakespeare"]), 
         Prediction(answer="Shakespeare", context=["Romeo and Juliet was written by Shakespeare."]), 
         1.0),
    ]
    
    # Test cases where only exact match is true
    only_exact_cases = [
        (Example(question="What is the capital of Japan?", answer="Tokyo"), 
         Prediction(answer="Tokyo", context=["Japan is an island nation in East Asia."]), 
         0.6),  # exact_weight
        
        (Example(question="What is the chemical symbol for gold?", answer="Au"), 
         Prediction(answer="Au", context=["Silver's chemical symbol is Ag."]), 
         0.6),  # exact_weight
    ]
    
    # Test cases where only passage match is true
    only_passage_cases = [
        (Example(question="What is the capital of Italy?", answer="Rome"), 
         Prediction(answer="Milan", context=["Rome is the capital of Italy."]), 
         0.4),  # passage_weight
        
        (Example(question="Who discovered penicillin?", answer="Alexander Fleming"), 
         Prediction(answer="Louis Pasteur", context=["Alexander Fleming discovered penicillin in 1928."]), 
         0.4),  # passage_weight
    ]
    
    # Test cases where neither condition is met
    neither_true_cases = [
        (Example(question="What is the capital of Germany?", answer="Berlin"), 
         Prediction(answer="Munich", context=["Germany is a country in Europe."]), 
         0.0),
        
        (Example(question="Who painted the Mona Lisa?", answer="Leonardo da Vinci"), 
         Prediction(answer="Michelangelo", context=["The Sistine Chapel was painted by Michelangelo."]), 
         0.0),
    ]
    
    # Edge cases
    edge_cases = [
        (Example(question="Empty answer", answer=""), 
         Prediction(answer="", context=["This is a test passage."]), 
         1.0),  # Both match
        
        (Example(question="Empty passage", answer="Something"), 
         Prediction(answer="Something", context=[""]), 
         0.6),  # Only exact match
        
        (Example(question="Empty list", answer="Empty list"), 
         Prediction(answer="", context=["This is a test passage."]), 
         0.0),  # Neither match
    ]
    
    # Run all test cases
    all_test_cases = {
        "Both Conditions Met": both_true_cases,
        "Only Exact Match": only_exact_cases,
        "Only Passage Match": only_passage_cases,
        "Neither Condition Met": neither_true_cases,
        "Edge Cases": edge_cases,
    }
    
    all_passed = True
    
    for category, test_cases in all_test_cases.items():
        print(f"\nTesting {category}:")
        category_passed = True
        
        for i, (example, pred, expected) in enumerate(test_cases):
            try:
                result = composite_metric(example, pred)
                # Allow for small floating-point differences
                passed = abs(result - expected) < 0.01
                category_passed = category_passed and passed
                
                status = "✓" if passed else "✗"
                print(f"  Test {i+1}: {status} Expected: {expected:.2f}, Got: {result:.2f}")
                if not passed:
                    if isinstance(example.answer, list):
                        answer_display = f"[{', '.join(example.answer)}]"
                    else:
                        answer_display = example.answer
                    print(f"    Example answer: {answer_display}")
                    print(f"    Prediction answer: {pred.answer}")
                    print(f"    Prediction context: {pred.context}")
            except Exception as e:
                category_passed = False
                print(f"  Test {i+1}: ✗ Error: {str(e)}")
        
        category_status = "PASSED" if category_passed else "FAILED"
        print(f"{category}: {category_status}")
        all_passed = all_passed and category_passed
    
    print("\nOverall Result:", "ALL TESTS PASSED!" if all_passed else "SOME TESTS FAILED")
    return all_passed


if __name__ == "__main__":
    run_tests()
```

```python
from dspy import Example, Prediction
from dspy.evaluate import answer_exact_match, answer_passage_match
from utils import f1_score


def composite_metric(example, pred, trace=None, exact_weight=0.6, passage_weight=0.4):
    """
    A composite metric that combines answer_exact_match and answer_passage_match.
    
    Args:
        example: The example containing the expected answer(s)
        pred: The prediction containing the predicted answer and context
        trace: Optional trace information (not used in this function)
        exact_weight: Weight for the exact match component (default: 0.6)
        passage_weight: Weight for the passage match component (default: 0.4)
        
    Returns:
        A score between 0 and 1, with higher values indicating better performance
        
    Raises:
        ValueError: If the answer type is neither a string nor a list
    """
    # Check if the predicted answer matches the expected answer
    exact_match_score = int(answer_exact_match(example, pred))
    
    # Check if the expected answer appears in the context
    passage_match_score = int(answer_passage_match(example, pred))
    
    # Combine the scores using the given weights
    score = (exact_match_score * exact_weight) + (passage_match_score * passage_weight)
    
    return score


def run_tests():
    print("Testing composite_metric function...")
    
    # Test cases where both conditions are met
    both_true_cases = [
        (Example(question="What is the capital of France?", answer="Paris"), 
         Prediction(answer="Paris", context=["Paris is the capital of France."]), 
         1.0),
        
        (Example(question="Who wrote Romeo and Juliet?", answer=["William Shakespeare", "Shakespeare"]), 
         Prediction(answer="Shakespeare", context=["Romeo and Juliet was written by Shakespeare."]), 
         1.0),
    ]
    
    # Test cases where only exact match is true
    only_exact_cases = [
        (Example(question="What is the capital of Japan?", answer="Tokyo"), 
         Prediction(answer="Tokyo", context=["Japan is an island nation in East Asia."]), 
         0.6),  # exact_weight
        
        (Example(question="What is the chemical symbol for gold?", answer="Au"), 
         Prediction(answer="Au", context=["Silver's chemical symbol is Ag."]), 
         0.6),  # exact_weight
    ]
    
    # Test cases where only passage match is true
    only_passage_cases = [
        (Example(question="What is the capital of Italy?", answer="Rome"), 
         Prediction(answer="Milan", context=["Rome is the capital of Italy."]), 
         0.4),  # passage_weight
        
        (Example(question="Who discovered penicillin?", answer="Alexander Fleming"), 
         Prediction(answer="Louis Pasteur", context=["Alexander Fleming discovered penicillin in 1928."]), 
         0.4),  # passage_weight
    ]
    
    # Test cases where neither condition is met
    neither_true_cases = [
        (Example(question="What is the capital of Germany?", answer="Berlin"), 
         Prediction(answer="Munich", context=["Germany is a country in Europe."]), 
         0.0),
        
        (Example(question="Who painted the Mona Lisa?", answer="Leonardo da Vinci"), 
         Prediction(answer="Michelangelo", context=["The Sistine Chapel was painted by Michelangelo."]), 
         0.0),
    ]
    
    # Edge cases
    edge_cases = [
        (Example(question="Empty answer", answer=""), 
         Prediction(answer="", context=["This is a test passage."]), 
         1.0),  # Both match
        
        (Example(question="Empty passage", answer="Something"), 
         Prediction(answer="Something", context=[""]), 
         0.6),  # Only exact match
        
        (Example(question="Empty list", answer="Empty list"), 
         Prediction(answer="", context=["This is a test passage."]), 
         0.0),  # Neither match
    ]
    
    # Run all test cases
    all_test_cases = {
        "Both Conditions Met": both_true_cases,
        "Only Exact Match": only_exact_cases,
        "Only Passage Match": only_passage_cases,
        "Neither Condition Met": neither_true_cases,
        "Edge Cases": edge_cases,
    }
    
    all_passed = True
    
    for category, test_cases in all_test_cases.items():
        print(f"\nTesting {category}:")
        category_passed = True
        
        for i, (example, pred, expected) in enumerate(test_cases):
            try:
                result = composite_metric(example, pred)
                # Allow for small floating-point differences
                passed = abs(result - expected) < 0.01
                category_passed = category_passed and passed
                
                status = "✓" if passed else "✗"
                print(f"  Test {i+1}: {status} Expected: {expected:.2f}, Got: {result:.2f}")
                if not passed:
                    if isinstance(example.answer, list):
                        answer_display = f"[{', '.join(example.answer)}]"
                    else:
                        answer_display = example.answer
                    print(f"    Example answer: {answer_display}")
                    print(f"    Prediction answer: {pred.answer}")
                    print(f"    Prediction context: {pred.context}")
            except Exception as e:
                category_passed = False
                print(f"  Test {i+1}: ✗ Error: {str(e)}")
        
        category_status = "PASSED" if category_passed else "FAILED"
        print(f"{category}: {category_status}")
        all_passed = all_passed and category_passed
    
    print("\nOverall Result:", "ALL TESTS PASSED!" if all_passed else "SOME TESTS FAILED")
    return all_passed


if __name__ == "__main__":
    run_tests()
```

-----

### Solution Explanation

The `composite_metric` function combines two evaluation criteria to provide a more holistic score for a QA system's performance: the **correctness of the answer** and the **relevance of the retrieved context**.

The solution correctly implements this by:

1.  **Evaluating Answer Correctness:** It calls `answer_exact_match(example, pred)` to check if the predicted answer matches the expected answer. The boolean result is converted to an integer (`True` becomes `1`, `False` becomes `0`).
2.  **Evaluating Context Relevance:** It calls `answer_passage_match(example, pred)` to check if the expected answer is present in the provided context. This result is also converted to an integer.
3.  **Combining the Scores:** The function then uses the provided weights (`exact_weight` and `passage_weight`) to create a final score. The formula used is a weighted sum: `(exact_match_score * exact_weight) + (passage_match_score * passage_weight)`.

This approach provides a flexible and customizable metric. For example, if you consider a correct answer more important than the presence of the answer in the context, you can assign `exact_weight` a higher value, like the default `0.6`. Conversely, if the focus is more on the retrieval component, the weights can be adjusted accordingly.  This kind of composite metric is invaluable for a comprehensive evaluation of a Retrieval-Augmented Generation (RAG) system, as it tests both the retrieval and generation stages.

## Semantic Evaluation with F1 Score

Now that you've worked with exact matching and passage matching, let's explore semantic evaluation — a more nuanced way to assess language model outputs. In this exercise, you'll implement a SemanticEvaluator class that measures how well a system's response captures the meaning of ground truth answers.

Unlike exact matching, semantic evaluation focuses on the overlap of key ideas between responses, making it ideal for evaluating longer, more complex outputs. Your evaluator will calculate precision (how much of the system's response is supported by ground truth) and recall (how much of the ground truth is covered by the system's response), then combine them into an F1 score.

To complete this exercise:

Implement the constructor to support both standard and decompositional evaluation modes.
Complete the forward method to calculate semantic similarity scores.
Handle the threshold parameter correctly for binary evaluation.
This metric will be particularly valuable for evaluating summarization, question answering, and other tasks where the exact wording matters less than capturing the right concepts. By implementing this evaluator, you'll add a powerful tool to your DSPy evaluation toolkit.

```python
import dspy
from dspy import Example, Prediction
import os


def f1_score(precision, recall):
    """Calculate F1 score from precision and recall values."""
    precision, recall = max(0.0, min(1.0, precision)), max(0.0, min(1.0, recall))
    return 0.0 if precision + recall == 0 else 2 * (precision * recall) / (precision + recall)


class SemanticEvaluator(dspy.Module):
    """
    A module that evaluates the semantic similarity between ground truth and system responses.
    
    This evaluator calculates precision (how much of the system response is supported by ground truth),
    recall (how much of the ground truth is covered by the system response), and F1 score.
    """
    
    def __init__(self, threshold=0.66, decompositional=False):
        """
        Initialize the semantic evaluator.
        
        Args:
            threshold: The threshold for considering a response good enough (default: 0.66)
            decompositional: Whether to use decompositional evaluation (default: False)
        """
        super().__init__()
        self.threshold = threshold
        
        # TODO: Set self.module based on the decompositional parameter
        # If decompositional is True, use DecompositionalSemanticRecallPrecision
        # Otherwise, use SemanticRecallPrecision
        # Wrap the chosen signature with dspy.ChainOfThought
    
    def forward(self, example, pred, trace=None):
        """
        Evaluate the semantic similarity between the example's ground truth and the prediction.
        
        Args:
            example: The example containing the question and ground truth response
            pred: The prediction containing the system response
            trace: Optional trace information
            
        Returns:
            If trace is None: A score between 0 and 1 representing semantic similarity
            If trace is not None: True if the score exceeds the threshold, False otherwise
        """
        # TODO: Call the appropriate module to get precision and recall
        # The module should receive question, ground_truth, and system_response
        
        # TODO: Calculate F1 score from precision and recall
        
        # TODO: Return appropriate value based on trace parameter
        # If trace is not None, return whether score exceeds threshold
        # Otherwise, return the raw score
        pass


class SemanticRecallPrecision(dspy.Signature):
    """
    Compare a system's response to the ground truth to compute its recall and precision.
    If asked to reason, enumerate key ideas in each response, and whether they are present in the other response.
    """

    question: str = dspy.InputField()
    ground_truth: str = dspy.InputField()
    system_response: str = dspy.InputField()
    recall: float = dspy.OutputField(desc="fraction (out of 1.0) of ground truth covered by the system response")
    precision: float = dspy.OutputField(desc="fraction (out of 1.0) of system response covered by the ground truth")


class DecompositionalSemanticRecallPrecision(dspy.Signature):
    """
    Compare a system's response to the ground truth to compute recall and precision of key ideas.
    You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.
    """

    question: str = dspy.InputField()
    ground_truth: str = dspy.InputField()
    system_response: str = dspy.InputField()
    ground_truth_key_ideas: str = dspy.OutputField(desc="enumeration of key ideas in the ground truth")
    system_response_key_ideas: str = dspy.OutputField(desc="enumeration of key ideas in the system response")
    discussion: str = dspy.OutputField(desc="discussion of the overlap between ground truth and system response")
    recall: float = dspy.OutputField(desc="fraction (out of 1.0) of ground truth covered by the system response")
    precision: float = dspy.OutputField(desc="fraction (out of 1.0) of system response covered by the ground truth")
    
    
def run_tests():
    # Create test cases
    perfect_match = (
        Example(
            question="What are the benefits of exercise?",
            ground_truth="Regular exercise improves cardiovascular health, builds muscle strength, and enhances mental wellbeing."
        ),
        Prediction(
            response="Regular exercise improves cardiovascular health, builds muscle strength, and enhances mental wellbeing."
        ),
        1.0  # Expected score
    )
    
    high_overlap = (
        Example(
            question="What are the benefits of exercise?",
            ground_truth="Regular exercise improves cardiovascular health, builds muscle strength, and enhances mental wellbeing."
        ),
        Prediction(
            response="Exercise has many benefits including better heart health, stronger muscles, and improved mood."
        ),
        0.8  # Expected score (approximate)
    )
    
    partial_overlap = (
        Example(
            question="What are the benefits of exercise?",
            ground_truth="Regular exercise improves cardiovascular health, builds muscle strength, and enhances mental wellbeing."
        ),
        Prediction(
            response="Exercise is good for your heart and can help you build stronger muscles."
        ),
        0.5  # Expected score (approximate)
    )
    
    low_overlap = (
        Example(
            question="What are the benefits of exercise?",
            ground_truth="Regular exercise improves cardiovascular health, builds muscle strength, and enhances mental wellbeing."
        ),
        Prediction(
            response="Physical activity is an important part of a healthy lifestyle."
        ),
        0.2  # Expected score (approximate)
    )
    
    no_overlap = (
        Example(
            question="What are the benefits of exercise?",
            ground_truth="Regular exercise improves cardiovascular health, builds muscle strength, and enhances mental wellbeing."
        ),
        Prediction(
            response="The weather forecast predicts rain tomorrow."
        ),
        0.0  # Expected score
    )
    
    # Test both standard and decompositional modes
    test_cases = [
        ("Standard Mode", False, [perfect_match, high_overlap, partial_overlap, low_overlap, no_overlap]),
        ("Decompositional Mode", True, [perfect_match, high_overlap, partial_overlap, low_overlap, no_overlap])
    ]
    
    # Run tests
    for mode_name, decompositional, cases in test_cases:
        print(f"\nTesting {mode_name}:")
        evaluator = SemanticEvaluator(threshold=0.6, decompositional=decompositional)
        
        for i, (example, pred, expected) in enumerate(cases):
            score = evaluator(example, pred)
            threshold_result = evaluator(example, pred, trace={})
            
            # For approximate scores, check if within reasonable range
            if expected in (0.0, 1.0):
                score_ok = abs(score - expected) < 0.1
            else:
                # For approximate scores, we're more lenient
                score_ok = abs(score - expected) < 0.3
                
            expected_threshold = expected >= 0.6
            
            print(f"  Test {i+1}: Score: {score:.2f} (Expected ~{expected:.1f}) - {'✓' if score_ok else '✗'}")
            print(f"    Threshold check: {threshold_result} (Expected {expected_threshold}) - {'✓' if threshold_result == expected_threshold else '✗'}")
            
            # Print details for failed tests
            if not score_ok:
                print(f"    Question: {example.question}")
                print(f"    Ground truth: {example.ground_truth}")
                print(f"    Response: {pred.response}")


if __name__ == "__main__":
    # Set up a language model
    lm = dspy.LM('openai/gpt-4o-mini', api_key=os.environ['OPENAI_API_KEY'], api_base=os.environ['OPENAI_BASE_URL'])
    dspy.configure(lm=lm)
    run_tests()

```

```python
import dspy
from dspy import Example, Prediction
import os


def f1_score(precision, recall):
    """Calculate F1 score from precision and recall values."""
    precision, recall = max(0.0, min(1.0, precision)), max(0.0, min(1.0, recall))
    return 0.0 if precision + recall == 0 else 2 * (precision * recall) / (precision + recall)


class SemanticEvaluator(dspy.Module):
    """
    A module that evaluates the semantic similarity between ground truth and system responses.
    
    This evaluator calculates precision (how much of the system response is supported by ground truth),
    recall (how much of the ground truth is covered by the system response), and F1 score.
    """
    
    def __init__(self, threshold=0.66, decompositional=False):
        """
        Initialize the semantic evaluator.
        
        Args:
            threshold: The threshold for considering a response good enough (default: 0.66)
            decompositional: Whether to use decompositional evaluation (default: False)
        """
        super().__init__()
        self.threshold = threshold
        
        if decompositional:
            self.module = dspy.ChainOfThought(DecompositionalSemanticRecallPrecision)
        else:
            self.module = dspy.ChainOfThought(SemanticRecallPrecision)
    
    def forward(self, example, pred, trace=None):
        """
        Evaluate the semantic similarity between the example's ground truth and the prediction.
        
        Args:
            example: The example containing the question and ground truth response
            pred: The prediction containing the system response
            trace: Optional trace information
            
        Returns:
            If trace is None: A score between 0 and 1 representing semantic similarity
            If trace is not None: True if the score exceeds the threshold, False otherwise
        """
        result = self.module(
            question=example.question,
            ground_truth=example.ground_truth,
            system_response=pred.response
        )
        
        precision = result.precision
        recall = result.recall
        
        score = f1_score(precision, recall)
        
        if trace is not None:
            return score >= self.threshold
        else:
            return score


class SemanticRecallPrecision(dspy.Signature):
    """
    Compare a system's response to the ground truth to compute its recall and precision.
    If asked to reason, enumerate key ideas in each response, and whether they are present in the other response.
    """

    question: str = dspy.InputField()
    ground_truth: str = dspy.InputField()
    system_response: str = dspy.InputField()
    recall: float = dspy.OutputField(desc="fraction (out of 1.0) of ground truth covered by the system response")
    precision: float = dspy.OutputField(desc="fraction (out of 1.0) of system response covered by the ground truth")


class DecompositionalSemanticRecallPrecision(dspy.Signature):
    """
    Compare a system's response to the ground truth to compute recall and precision of key ideas.
    You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.
    """

    question: str = dspy.InputField()
    ground_truth: str = dspy.InputField()
    system_response: str = dspy.InputField()
    ground_truth_key_ideas: str = dspy.OutputField(desc="enumeration of key ideas in the ground truth")
    system_response_key_ideas: str = dspy.OutputField(desc="enumeration of key ideas in the system response")
    discussion: str = dspy.OutputField(desc="discussion of the overlap between ground truth and system response")
    recall: float = dspy.OutputField(desc="fraction (out of 1.0) of ground truth covered by the system response")
    precision: float = dspy.OutputField(desc="fraction (out of 1.0) of system response covered by the ground truth")
    
    
def run_tests():
    # Create test cases
    perfect_match = (
        Example(
            question="What are the benefits of exercise?",
            ground_truth="Regular exercise improves cardiovascular health, builds muscle strength, and enhances mental wellbeing."
        ),
        Prediction(
            response="Regular exercise improves cardiovascular health, builds muscle strength, and enhances mental wellbeing."
        ),
        1.0  # Expected score
    )
    
    high_overlap = (
        Example(
            question="What are the benefits of exercise?",
            ground_truth="Regular exercise improves cardiovascular health, builds muscle strength, and enhances mental wellbeing."
        ),
        Prediction(
            response="Exercise has many benefits including better heart health, stronger muscles, and improved mood."
        ),
        0.8  # Expected score (approximate)
    )
    
    partial_overlap = (
        Example(
            question="What are the benefits of exercise?",
            ground_truth="Regular exercise improves cardiovascular health, builds muscle strength, and enhances mental wellbeing."
        ),
        Prediction(
            response="Exercise is good for your heart and can help you build stronger muscles."
        ),
        0.5  # Expected score (approximate)
    )
    
    low_overlap = (
        Example(
            question="What are the benefits of exercise?",
            ground_truth="Regular exercise improves cardiovascular health, builds muscle strength, and enhances mental wellbeing."
        ),
        Prediction(
            response="Physical activity is an important part of a healthy lifestyle."
        ),
        0.2  # Expected score (approximate)
    )
    
    no_overlap = (
        Example(
            question="What are the benefits of exercise?",
            ground_truth="Regular exercise improves cardiovascular health, builds muscle strength, and enhances mental wellbeing."
        ),
        Prediction(
            response="The weather forecast predicts rain tomorrow."
        ),
        0.0  # Expected score
    )
    
    # Test both standard and decompositional modes
    test_cases = [
        ("Standard Mode", False, [perfect_match, high_overlap, partial_overlap, low_overlap, no_overlap]),
        ("Decompositional Mode", True, [perfect_match, high_overlap, partial_overlap, low_overlap, no_overlap])
    ]
    
    # Run tests
    for mode_name, decompositional, cases in test_cases:
        print(f"\nTesting {mode_name}:")
        evaluator = SemanticEvaluator(threshold=0.6, decompositional=decompositional)
        
        for i, (example, pred, expected) in enumerate(cases):
            score = evaluator(example, pred)
            threshold_result = evaluator(example, pred, trace={})
            
            # For approximate scores, check if within reasonable range
            if expected in (0.0, 1.0):
                score_ok = abs(score - expected) < 0.1
            else:
                # For approximate scores, we're more lenient
                score_ok = abs(score - expected) < 0.3
                
            expected_threshold = expected >= 0.6
            
            print(f"  Test {i+1}: Score: {score:.2f} (Expected ~{expected:.1f}) - {'✓' if score_ok else '✗'}")
            print(f"    Threshold check: {threshold_result} (Expected {expected_threshold}) - {'✓' if threshold_result == expected_threshold else '✗'}")
            
            # Print details for failed tests
            if not score_ok:
                print(f"    Question: {example.question}")
                print(f"    Ground truth: {example.ground_truth}")
                print(f"    Response: {pred.response}")


if __name__ == "__main__":
    # Set up a language model
    lm = dspy.LM('openai/gpt-4o-mini', api_key=os.environ['OPENAI_API_KEY'], api_base=os.environ['OPENAI_BASE_URL'])
    dspy.configure(lm=lm)
    run_tests()
```

-----

### Solution Explanation

The `SemanticEvaluator` class is a powerful example of how to build a custom evaluation metric in DSPy that goes beyond simple keyword matching. Here's a breakdown of the completed implementation:

### 1\. The `__init__` Method

The constructor is responsible for setting up the core logic based on the `decompositional` parameter.

  * It checks the value of `decompositional`.
  * If `True`, it initializes `self.module` with a **`DecompositionalSemanticRecallPrecision`** signature. This signature is designed to first break down the key ideas in each text, enabling a more granular, human-like comparison.
  * If `False`, it uses the simpler **`SemanticRecallPrecision`** signature.
  * In both cases, the signature is wrapped in a `dspy.ChainOfThought` module. This is a crucial step that instructs the Language Model (LLM) to "think step by step" and provides the reasoning for its final recall and precision scores, leading to more robust and accurate evaluations.

### 2\. The `forward` Method

This method orchestrates the evaluation process.

  * It calls `self.module`, passing the **`question`**, **`ground_truth`**, and **`system_response`** fields from the `example` and `pred` objects.
  * It extracts the `precision` and `recall` values from the LLM's output. The LLM's `ChainOfThought` process computes these scores as part of its output.
  * It then uses the provided `f1_score` helper function to combine the precision and recall into a single F1 score, which is a harmonic mean that balances both metrics.
  * Finally, it checks the `trace` parameter. If `trace` is provided, it returns a binary result (`True` or `False`) based on whether the calculated F1 score meets or exceeds the `self.threshold`. This is useful for pass/fail evaluations. If `trace` is not provided, it returns the raw F1 score, which is valuable for quantitative analysis and ranking.

This composite approach—using an LLM to perform semantic reasoning and then calculating a standard F1 score—makes the `SemanticEvaluator` a versatile tool for assessing the quality of generative AI outputs.