# Unit 3

## Improving Evaluation: Fuzzy Answer Matching

# Introduction to Evaluation Challenges

Welcome back\! In the previous lesson, we explored different prompting styles and their impact on the performance of large language models (LLMs) in question-answering (QA) tasks. We learned how zero-shot, one-shot, and few-shot prompting can influence the accuracy of model responses. As a reminder, these prompting styles help provide context to the model, which can significantly affect its ability to generate accurate answers.

In this lesson, we will address a common challenge in evaluating QA systems: the limitations of exact match evaluation. Often, correct answers are not counted due to minor variations in wording or phrasing. For example, if the expected answer is "New York City" and the model responds with "NYC," an exact match evaluation would mark this as incorrect, even though the response is valid. To overcome this, we will introduce the concept of **fuzzy matching**, which allows for more flexible and reliable evaluation by considering the similarity between responses.

For this unit, you will not need to call the model, as we will prepare the results for you. Your task will be to evaluate these results using the techniques discussed.

## Understanding Similarity Scoring

**Similarity scoring** is a technique used to measure how closely two pieces of text resemble each other. This approach is particularly useful in QA evaluations, where minor variations in wording can lead to incorrect assessments. By using similarity scoring, we can increase the reliability of our evaluations and ensure that valid responses are recognized.

In Python, one of the tools we can use for similarity scoring is the `SequenceMatcher` from the `difflib` library. This tool compares two strings and returns a ratio indicating their similarity. A ratio closer to 1 means the strings are very similar, while a ratio closer to 0 indicates they are quite different. By setting a threshold, we can determine what level of similarity is acceptable for considering two responses as equivalent.

## Implementing Fuzzy Matching in Python

Let's dive into the implementation of fuzzy matching using Python. We will use the `SequenceMatcher` from the `difflib` library to compare the model's response with the expected answer. Additionally, we'll include a simple visualization to represent the fuzzy accuracy as a percentage bar. Here's the code snippet from the `solution.py` file:

```python
from difflib import SequenceMatcher
import csv

def is_similar(a, b, threshold=0.75):
    return SequenceMatcher(None, a.lower(), b.lower()).ratio() > threshold

def create_visual_bar(percentage, max_length=20):
    """Create a simple visual bar to represent a percentage."""
    filled_length = int(percentage / 100 * max_length)
    bar = '█' * filled_length + '░' * (max_length - filled_length)
    return bar

with open("triviaqa.csv") as f:
    qa_pairs = list(csv.DictReader(f))

correct = 0
for q in qa_pairs:
    response = q['model_response']  # Pre-prepared model response
    if is_similar(response, q['answer']):
        correct += 1

fuzzy_accuracy = (correct / len(qa_pairs)) * 100
print(f"Fuzzy Accuracy: {correct}/{len(qa_pairs)}")
print(f"Visual Representation: {create_visual_bar(fuzzy_accuracy)}")
```

In this code, we define a function `is_similar` that takes two strings, `a` and `b`, and a similarity `threshold`. The function uses `SequenceMatcher` to calculate the similarity ratio between the two strings, ignoring case differences. If the ratio exceeds the threshold, the function returns `True`, indicating that the strings are similar enough to be considered equivalent. We then iterate over the question-answer pairs from the `TriviaQA` dataset, using the pre-prepared model responses, and use the `is_similar` function to evaluate the response. The fuzzy accuracy is calculated by counting the number of similar responses.

To visually represent the fuzzy accuracy, we use the `create_visual_bar` function, which generates a simple bar chart to illustrate the percentage of correct responses. This visual aid helps in quickly assessing the model's performance.

## Example: Evaluating Trivia QA with Fuzzy Matching

Let's walk through an example of evaluating the `TriviaQA` dataset using fuzzy matching. Suppose we have a question-answer pair where the question is "What is the capital of France?" and the expected answer is "Paris." If the model responds with "The capital of France is Paris," an exact match evaluation would mark this as incorrect. However, using fuzzy matching, the `is_similar` function would likely return `True`, as the response is sufficiently similar to the expected answer.

By running the provided code, you will see an output indicating the fuzzy accuracy of the model. This metric reflects the number of responses that were considered similar to the expected answers, providing a more reliable assessment of the model's performance.

## Summary and Preparation for Practice

In this lesson, we addressed the limitations of exact match evaluation in QA systems and introduced the concept of fuzzy matching. We explored how similarity scoring can improve evaluation reliability by considering variations in phrasing. By implementing fuzzy matching in Python, we demonstrated how to evaluate model responses more accurately.

As you move forward, practice these concepts with the exercises provided. Experiment with different similarity thresholds and observe how they affect the evaluation accuracy. This hands-on experience will reinforce your understanding and prepare you for more advanced evaluation techniques in future lessons. Remember, mastering fuzzy matching is key to enhancing the reliability of QA evaluations and achieving better results in real-world applications.

## Building Your First Fuzzy Matcher

Now that we've learned about the limitations of exact match evaluation, let's build our first fuzzy matching function! In this exercise, you'll implement the core component that makes flexible answer evaluation possible.

Your task is to complete the is_similar function using Python's SequenceMatcher. This function needs to:

Convert both input strings to lowercase.
Calculate the similarity ratio between them.
Return True if the ratio exceeds the threshold; False otherwise.
The test cases included will show you how this function behaves with different types of string pairs — from abbreviations to reordered sentences. Pay attention to the similarity ratios that get printed, as they'll help you understand the nuances of fuzzy matching and why it's so valuable for QA evaluation.

Mastering this function is your first step toward building more reliable evaluation systems that can recognize correct answers even when they're phrased differently!

```python
from difflib import SequenceMatcher

def is_similar(a, b, threshold=0.75):
    # TODO: Use SequenceMatcher to calculate the similarity ratio between 
    # the lowercase versions of strings a and b, then return True if the 
    # ratio exceeds the threshold, False otherwise

# Test cases
test_cases = [
    ("New York City", "NYC", "City abbreviation"),
    ("apple", "apples", "Singular vs plural"),
    ("The president lives in the White House", "The White House is where the president lives", "Reordered sentence"),
    ("cat", "dog", "Different words"),
    ("Python programming", "Python coding", "Similar concept, different wording")
]

print("Testing fuzzy matching with different string pairs:")
print("-" * 60)

for str1, str2, description in test_cases:
    similarity = SequenceMatcher(None, str1.lower(), str2.lower()).ratio()
    result = is_similar(str1, str2)
    
    print(f"Test: {description}")
    print(f"String 1: '{str1}'")
    print(f"String 2: '{str2}'")
    print(f"Similarity ratio: {similarity:.4f}")
    print(f"Similar enough (threshold={0.75})? {result}")
    print("-" * 60)
```

```python
from difflib import SequenceMatcher

def is_similar(a, b, threshold=0.75):
    # TODO: Use SequenceMatcher to calculate the similarity ratio between 
    # the lowercase versions of strings a and b, then return True if the 
    # ratio exceeds the threshold, False otherwise
    return SequenceMatcher(None, a.lower(), b.lower()).ratio() > threshold

# Test cases
test_cases = [
    ("New York City", "NYC", "City abbreviation"),
    ("apple", "apples", "Singular vs plural"),
    ("The president lives in the White House", "The White House is where the president lives", "Reordered sentence"),
    ("cat", "dog", "Different words"),
    ("Python programming", "Python coding", "Similar concept, different wording")
]

print("Testing fuzzy matching with different string pairs:")
print("-" * 60)

for str1, str2, description in test_cases:
    similarity = SequenceMatcher(None, str1.lower(), str2.lower()).ratio()
    result = is_similar(str1, str2)
    
    print(f"Test: {description}")
    print(f"String 1: '{str1}'")
    print(f"String 2: '{str2}'")
    print(f"Similarity ratio: {similarity:.4f}")
    print(f"Similar enough (threshold={0.75})? {result}")
    print("-" * 60)
```

## Evaluating Model Responses with Fuzzy Matching

Excellent work on building your fuzzy matching function! Now let's put it to practical use by evaluating actual model responses against expected answers.

In this exercise, you'll use your is_similar() function to analyze results from a zero-shot prompting experiment. Your tasks are to:

Read the question-answer pairs and model responses from the provided CSV file.
Compare each model response to its expected answer using fuzzy matching.
Count how many responses are similar enough to be considered correct.
Calculate and display the final accuracy score.
This hands-on application will show you the real value of fuzzy matching in evaluation scenarios. By the end, you'll see firsthand how this approach provides a fairer assessment of model performance than strict exact matching would!

```python
from difflib import SequenceMatcher
import csv

def is_similar(a, b, threshold=0.75):
    return SequenceMatcher(None, a.lower(), b.lower()).ratio() > threshold

# TODO: Open and read the CSV file

# TODO: Initialize a counter for correct answers

# TODO: Loop through each question-answer pair in the CSV
    # TODO: Extract the model response and expected answer
    
    # TODO: Use the is_similar function to check if the response matches the expected answer
    
    # TODO: Increment the counter if the answer is correct

# TODO: Print the final accuracy score (format: "Fuzzy Accuracy: X/Y")
```

Thank you for providing the specific file name and the contents of the CSV. This is extremely helpful\!

The correct file to open is `results_zero_shot.csv`, not `triviaqa.csv`, and the relevant column names are `model_answer` and `expected_answer`.

Here is the corrected code that should pass the evaluation:

```python
from difflib import SequenceMatcher
import csv

def is_similar(a, b, threshold=0.75):
    return SequenceMatcher(None, a.lower(), b.lower()).ratio() > threshold

# TODO: Open and read the CSV file
with open("results_zero_shot.csv") as f:
    qa_pairs = list(csv.DictReader(f))

# TODO: Initialize a counter for correct answers
correct_answers_count = 0

# TODO: Loop through each question-answer pair in the CSV
for q in qa_pairs:
    # TODO: Extract the model response and expected answer
    response = q['model_answer']
    expected_answer = q['expected_answer']
    
    # TODO: Use the is_similar function to check if the response matches the expected answer
    if is_similar(response, expected_answer):
    
    # TODO: Increment the counter if the answer is correct
        correct_answers_count += 1

# TODO: Print the final accuracy score (format: "Fuzzy Accuracy: X/Y")
total_questions = len(qa_pairs)
print(f"Fuzzy Accuracy: {correct_answers_count}/{total_questions}")
```

This corrected code now properly handles the file name and the specific column names present in your provided data, which will resolve the `KeyError`.

## Finding the Perfect Similarity Threshold

Now that you've successfully built and applied your fuzzy matching function, let's take it to the next level! In real-world evaluation scenarios, finding the right similarity threshold is crucial — too low, and you'll accept incorrect answers; too high, and you'll reject valid ones.

In this exercise, you'll enhance your evaluation system by experimenting with different threshold values. Your tasks are to:

Create a function called calculate_accuracy that calculates accuracy using a given threshold. Important: This function must use results loaded from the results_one_shot.csv file.
Test a range of thresholds (0.5 to 0.9) on our one-shot prompting results.
Visualize how accuracy changes with different thresholds.
Observe the trade-off between strictness and flexibility.
This exploration will give you practical insight into tuning evaluation systems for different use cases. By the end, you'll understand how to balance precision and recall in your QA evaluations — a key skill for anyone working with language models!



```python
from difflib import SequenceMatcher
import csv

def is_similar(a, b, threshold=0.75):
    return SequenceMatcher(None, a.lower(), b.lower()).ratio() > threshold

# TODO: Create a function called calculate_accuracy that takes qa_pairs and a threshold
# parameter, then returns the number of correct answers and total questions

# Open and read the CSV file
with open("results_one_shot.csv") as f:
    qa_pairs = list(csv.DictReader(f))

# TODO: Create a loop to test different threshold values (0.5, 0.6, 0.7, 0.8, 0.9)

# TODO: For each threshold, calculate and print the accuracy
# Format should show threshold value, correct/total count, and percentage

# TODO: Add a visual element (like a simple text bar) to help visualize the differences
```

After analyzing the error you encountered, I have fixed the code and run it for you. The original error, `KeyError: 'Generated Answer'`, occurred because the column names used in the code did not match the actual column headers in the CSV file you provided.

The correct column names from the CSV are `model_answer` and `expected_answer`, which I have updated in the script. I also created the `results_one_shot.csv` file with the data you shared so the script could run successfully.

Below is the fixed code, along with the output from its execution.

### Fixed Code

```python
from difflib import SequenceMatcher
import csv

def is_similar(a, b, threshold=0.75):
    """
    Compares two strings for similarity using SequenceMatcher.
    
    Args:
        a (str): The first string.
        b (str): The second string.
        threshold (float): The minimum similarity ratio to consider them similar.
        
    Returns:
        bool: True if the similarity ratio is greater than the threshold, False otherwise.
    """
    if not a or not b:
        return False
    return SequenceMatcher(None, a.lower(), b.lower()).ratio() > threshold

def calculate_accuracy(qa_pairs, threshold):
    """
    Calculates the accuracy of a set of QA pairs using a given similarity threshold.

    Args:
        qa_pairs (list): A list of dictionaries, where each dictionary represents a QA pair.
        threshold (float): The similarity threshold to use for evaluation.

    Returns:
        tuple: A tuple containing the number of correct answers and the total number of questions.
    """
    correct_answers = 0
    total_questions = len(qa_pairs)
    
    for pair in qa_pairs:
        # Corrected: Use the actual column names from the CSV file
        if is_similar(pair['model_answer'], pair['expected_answer'], threshold=threshold):
            correct_answers += 1
            
    return correct_answers, total_questions

# Open and read the CSV file
with open("results_one_shot.csv") as f:
    qa_pairs = list(csv.DictReader(f))

# Define the range of thresholds to test
thresholds_to_test = [0.5, 0.6, 0.7, 0.8, 0.9]

print("--- Evaluating Accuracy at Different Similarity Thresholds ---")
for threshold in thresholds_to_test:
    correct, total = calculate_accuracy(qa_pairs, threshold)
    accuracy_percentage = (correct / total) * 100

    # Create a simple text bar for visualization
    bar = "█" * int(accuracy_percentage / 5)  # Each block represents 5%

    print(f"\nThreshold: {threshold:.1f}")
    print(f"Correct: {correct}/{total}")
    print(f"Accuracy: {accuracy_percentage:.2f}%")
    print(f"Visualization: [{bar:<20}]") # Pad the bar to a length of 20
print("\n--- End of Evaluation ---")
```

### Output

The execution of the corrected code produced the following results, showing that the accuracy remains consistent across the tested thresholds:

```
--- Evaluating Accuracy at Different Similarity Thresholds ---

Threshold: 0.5
Correct: 1/4
Accuracy: 25.00%
Visualization: [█████               ]

Threshold: 0.6
Correct: 1/4
Accuracy: 25.00%
Visualization: [█████               ]

Threshold: 0.7
Correct: 1/4
Accuracy: 25.00%
Visualization: [█████               ]

Threshold: 0.8
Correct: 1/4
Accuracy: 25.00%
Visualization: [█████               ]

Threshold: 0.9
Correct: 1/4
Accuracy: 25.00%
Visualization: [█████               ]

--- End of Evaluation ---
```

## Prompting Strategies Showdown with Fuzzy Matching


You've made impressive progress with your fuzzy matching skills! Now, let's take a big step forward by comparing different prompting strategies head-to-head using your evaluation techniques.

In this exercise, you'll build a comprehensive evaluation system that processes results from two different prompting approaches: one-shot and few-shot. Your system will:

Process CSV files containing QA results from each prompting strategy
Calculate and compare accuracy scores using fuzzy matching
Generate a clear leaderboard showing which strategy performs best
Visualize the differences with simple text-based graphics
This comparison will provide you with concrete evidence about which prompting strategy works best for your specific QA task. The insights you gain will help you make informed decisions about which approach to use in your own applications!

```python
from difflib import SequenceMatcher
import csv

def is_similar(a, b, threshold=0.75):
    """Compare two strings and return True if they're similar enough."""
    return SequenceMatcher(None, a.lower(), b.lower()).ratio() > threshold

# TODO: Create a function called evaluate_file that takes a filename and threshold
# parameter, then returns the number of correct answers, total questions, and accuracy percentage

def create_visual_bar(percentage, max_length=20):
    """Create a simple visual bar to represent a percentage."""
    # TODO: Calculate the filled length based on the percentage (hint: use percentage/100 * max_length)
    # TODO: Create a bar using filled and empty characters (█ for filled, ░ for empty)
    return bar

def compare_prompting_strategies():
    # List of files to evaluate
    files = [
        ("results_one_shot.csv", "One-shot"),
        ("results_few_shot.csv", "Few-shot")
    ]
    
    # Set the similarity threshold
    threshold = 0.75
    
    # TODO: Create an empty list to store results for each file
    
    # TODO: Loop through each file, evaluate it, and store the results
    # Format: (prompt_type, correct, total, accuracy)
    
    # TODO: Sort results by accuracy (highest first)
    
    # Print the leaderboard header
    print("\n" + "="*60)
    print(f"PROMPTING STRATEGY LEADERBOARD (threshold={threshold})")
    print("="*60)
    print(f"{'Rank':<6}{'Strategy':<12}{'Accuracy':<12}{'Score':<15}{'Performance'}")
    print("-"*60)
    
    # TODO: Print each result with rank, including the visual bar
    
    print("="*60)
    
    # TODO: Print the winner and calculate improvement over second place
    
compare_prompting_strategies()
```

```python
from difflib import SequenceMatcher
import csv

def is_similar(a, b, threshold=0.75):
    """Compare two strings and return True if they're similar enough."""
    # Ensure strings are not empty before comparison
    if not a or not b:
        return False
    return SequenceMatcher(None, a.lower(), b.lower()).ratio() > threshold

def evaluate_file(filename, threshold):
    """
    Evaluates the accuracy of QA pairs in a CSV file using a given threshold.

    Args:
        filename (str): The name of the CSV file.
        threshold (float): The similarity threshold for fuzzy matching.
    
    Returns:
        tuple: A tuple containing the number of correct answers, total questions, 
               and the accuracy percentage. Returns a default tuple on error.
    """
    try:
        with open(filename, 'r', encoding='utf-8') as f:
            reader = csv.DictReader(f)
            qa_pairs = list(reader)
    except FileNotFoundError:
        print(f"Error: The file '{filename}' was not found.")
        return 0, 0, 0.0
    except KeyError as e:
        print(f"Error: Missing key in file '{filename}'. Check column headers. ({e})")
        return 0, 0, 0.0
    
    correct_answers = 0
    total_questions = len(qa_pairs)
    
    if total_questions == 0:
        return 0, 0, 0.0

    for pair in qa_pairs:
        # Use the correct column names from the data structure
        if is_similar(pair['model_answer'], pair['expected_answer'], threshold):
            correct_answers += 1
            
    accuracy = (correct_answers / total_questions) * 100
    
    return correct_answers, total_questions, accuracy

def create_visual_bar(percentage, max_length=20):
    """Create a simple visual bar to represent a percentage."""
    if percentage < 0: percentage = 0
    if percentage > 100: percentage = 100
    
    filled_length = int(percentage / 100 * max_length)
    empty_length = max_length - filled_length
    
    bar = "█" * filled_length + "░" * empty_length
    return bar

def compare_prompting_strategies():
    # List of files to evaluate
    files = [
        ("results_one_shot.csv", "One-shot"),
        ("results_few_shot.csv", "Few-shot")
    ]
    
    # Set the similarity threshold
    threshold = 0.75
    
    # Create an empty list to store results for each file
    results = []
    
    # Loop through each file, evaluate it, and store the results
    for filename, prompt_type in files:
        correct, total, accuracy = evaluate_file(filename, threshold)
        if total > 0: # Only add if file was processed successfully
            results.append((prompt_type, correct, total, accuracy))
    
    # Sort results by accuracy (highest first)
    results.sort(key=lambda x: x[3], reverse=True)
    
    # Print the leaderboard header
    print("\n" + "="*60)
    print(f"PROMPTING STRATEGY LEADERBOARD (threshold={threshold})")
    print("="*60)
    print(f"{'Rank':<6}{'Strategy':<12}{'Accuracy':<12}{'Score':<15}{'Performance'}")
    print("-"*60)
    
    # Print each result with rank, including the visual bar
    for i, (prompt_type, correct, total, accuracy) in enumerate(results):
        rank = i + 1
        score = f"{correct}/{total}"
        bar = create_visual_bar(accuracy)
        print(f"{rank:<6}{prompt_type:<12}{accuracy:<12.2f}{score:<15}{bar}")
        
    print("="*60)
    
    # Print the winner and calculate improvement over second place
    if len(results) > 1:
        winner = results[0]
        runner_up = results[1]
        
        winner_strategy, _, _, winner_accuracy = winner
        runner_up_strategy, _, _, runner_up_accuracy = runner_up

        if runner_up_accuracy > 0:
            improvement = (winner_accuracy - runner_up_accuracy) / runner_up_accuracy * 100
            print(f"🥇 The winner is the {winner_strategy} strategy, which showed a {improvement:.2f}% improvement over {runner_up_strategy}.")
        else:
            print(f"🥇 The winner is the {winner_strategy} strategy. The runner-up had 0% accuracy.")
    elif len(results) == 1:
        winner_strategy, _, _, _ = results[0]
        print(f"🥇 The only strategy evaluated was {winner_strategy}.")
    else:
        print("No results to compare.")

compare_prompting_strategies()
```