# Lesson 7 Exercise: Evaluate and Analyze Q&A Model Performance

**Estimated Time:** 17 minutes

## Scenario

Your Q&A model achieves **78% F1** but management sees user complaints about wrong answers.

**Questions to answer:**
- Are mistakes on long contexts?
- Unanswerable questions?
- Specific question types?

**Goal:** Perform systematic error analysis to prioritize improvements!

---

## Setup

In [None]:
# Install if needed
# !pip install transformers datasets evaluate pandas matplotlib torch

import json
import random
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
random.seed(42)

print("âœ“ Setup complete!")

---

# Part A: Generate Predictions and Compute Metrics (6 minutes)

## Step 1: Load BERT Q&A Model

We'll use the same BERT from Lesson 6, now fine-tuned for Q&A!

In [None]:
# TODO: Load BERT Q&A model using HuggingFace pipeline
# Hint: Use pipeline("question-answering") with model="deepset/bert-base-cased-squad2"
# Hint: Check if GPU is available with torch.cuda.is_available() and set device accordingly

from transformers import pipeline
import torch

print("Loading BERT Q&A model...")
print("(First run will download ~400MB model)\n")

# TODO: Complete this section
device = None  # TODO: Set to 0 if GPU available, else -1
qa_pipeline = None  # TODO: Load the pipeline

print(f"âœ“ BERT Q&A model loaded on {'GPU' if device == 0 else 'CPU'}")
print("  This is the same BERT architecture from Lesson 6!")

## Step 2: Load SQuAD 2.0 Dataset

In [None]:
# TODO: Load SQuAD 2.0 validation set
# Hint: Use load_dataset("squad_v2", split="validation[:250]")

from datasets import load_dataset

print("Loading SQuAD 2.0 validation set...")
dataset = None  # TODO: Load the dataset
print(f"âœ“ Loaded {len(dataset)} examples")

# Show sample
print("\nSample question:")
print(f"Q: {dataset[0]['question']}")
print(f"Context: {dataset[0]['context'][:150]}...")
print(f"Answer: {dataset[0]['answers']['text'][0] if dataset[0]['answers']['text'] else '[Unanswerable]'}")

## Step 3: Generate Predictions with BERT

We'll get top-5 predictions for ranking metrics later!

In [None]:
# TODO: Implement prediction generation with BERT
# Hint: For each example, call qa_pipeline with question, context, and top_k=5
# Hint: Handle both single results and list of results (when top_k > 1)
# Hint: Store prediction_text, score, and top_k_predictions

from collections import Counter
import re
import string

# Helper functions for metrics
def normalize_answer(s):
    """Normalize answer text."""
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)
    def white_space_fix(text):
        return ' '.join(text.split())
    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)
    def lower(text):
        return text.lower()
    return white_space_fix(remove_articles(remove_punc(lower(s))))

def compute_exact_match(prediction, ground_truth):
    return float(normalize_answer(prediction) == normalize_answer(ground_truth))

def compute_f1(prediction, ground_truth):
    pred_tokens = normalize_answer(prediction).split()
    truth_tokens = normalize_answer(ground_truth).split()
    
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return float(pred_tokens == truth_tokens)
    
    common = Counter(pred_tokens) & Counter(truth_tokens)
    num_common = sum(common.values())
    
    if num_common == 0:
        return 0.0
    
    precision = num_common / len(pred_tokens)
    recall = num_common / len(truth_tokens)
    f1 = 2 * (precision * recall) / (precision + recall)
    
    return f1

print("Generating predictions with BERT...")
print("(This will take 2-3 minutes on CPU)\n")

predictions = []

# TODO: Loop through dataset and generate predictions
# TODO: For each example, call qa_pipeline with top_k=5
# TODO: Store all results in predictions list

print(f"\nâœ“ Generated {len(predictions)} predictions!")

# Show statistics
total = len(predictions)
correct = sum(1 for p in predictions if p['em_score'] == 1.0)
errors = sum(1 for p in predictions if p['f1_score'] < 1.0)
print(f"\nQuick stats:")
print(f"  Correct (EM=100%): {correct} ({100*correct/total:.1f}%)")
print(f"  Errors (F1<100%):  {errors} ({100*errors/total:.1f}%)")

## Step 4: Compute Ranking Metrics (P@K, R@K, MRR)

Since we have top-5 predictions, let's compute retrieval metrics!

In [None]:
# TODO: Implement ranking metric functions
# Hint: Implement compute_precision_at_k, compute_recall_at_k, compute_reciprocal_rank
# Hint: Precision@K = what fraction of top-K are correct
# Hint: Recall@K = what fraction of correct answers appear in top-K
# Hint: MRR = 1/rank of first correct answer

def compute_precision_at_k(ranked_predictions, ground_truths, k=3):
    """TODO: What fraction of top-K are correct?"""
    pass

def compute_recall_at_k(ranked_predictions, ground_truths, k=3):
    """TODO: What fraction of correct answers appear in top-K?"""
    pass

def compute_reciprocal_rank(ranked_predictions, ground_truths):
    """TODO: Return 1/rank of first correct answer."""
    pass

# TODO: Compute ranking metrics for answerable questions
# TODO: Loop through answerable predictions and compute P@3, R@3, MRR

print("\n" + "="*70)
print("RANKING METRICS (Top-5 Predictions)")
print("="*70)
print(f"Precision@3:  {avg_precision_at_3:.2%}")
print(f"Recall@3:     {avg_recall_at_3:.2%}")
print(f"MRR:          {avg_mrr:.3f}")
print("="*70)
print("\nðŸ’¡ These metrics measure retrieval quality:")
print("   - P@3: What % of top-3 are correct (quality)")
print("   - R@3: Do we get all correct answers in top-3 (coverage)")
print("   - MRR: How high is the first correct answer ranked (UX)")

## Step 5: Use HuggingFace Evaluate for EM and F1

In [None]:
# TODO: Use official SQuAD v2 metric from evaluate library
# Hint: Load "squad_v2" metric with evaluate.load()
# Hint: Format predictions and references in SQuAD v2 format
# Hint: Call squad_metric.compute() with predictions and references

import evaluate

squad_metric = None  # TODO: Load the squad_v2 metric

# TODO: Format predictions and references for the official metric
formatted_predictions = []  # TODO: Create list of formatted predictions
formatted_references = []   # TODO: Create list of formatted references

results = None  # TODO: Compute metrics

print("\n" + "="*70)
print("OVERALL METRICS (Official SQuAD v2)")
print("="*70)
print(f"Exact Match (EM): {results['exact']:.2f}%")
print(f"F1 Score:         {results['f1']:.2f}%")
print("="*70)

## Step 6: Create Comprehensive Metrics Table

In [None]:
# TODO: Create summary DataFrame with all metrics
# Hint: Include EM, F1, P@3, R@3, MRR, and number of examples

summary_data = {
    'Metric': [
        'Exact Match',
        'F1 Score',
        'Precision@3',
        'Recall@3',
        'MRR',
        'Number of Examples'
    ],
    'Score': [
        # TODO: Fill in metric values
    ]
}

summary_df = pd.DataFrame(summary_data)

print("\nComprehensive Metrics Summary:")
print(summary_df.to_string(index=False))

## Step 7: Metrics by Question Type

In [None]:
# TODO: Compute metrics separately for answerable vs unanswerable questions
# Hint: Filter predictions by is_impossible field
# Hint: Format and compute metrics for each group separately

answerable_preds = None  # TODO: Filter answerable
unanswerable_preds = None  # TODO: Filter unanswerable

# TODO: Format and compute answerable_results
answerable_results = None

# TODO: Format and compute unanswerable_results
unanswerable_results = None

# Print breakdown
print("\n" + "="*70)
print("METRICS BY QUESTION TYPE")
print("="*70)
print(f"\nAnswerable Questions ({len(answerable_preds)} examples):")
print(f"  Exact Match: {answerable_results['exact']:.2f}%")
print(f"  F1 Score:    {answerable_results['f1']:.2f}%")
print(f"  P@3:         {avg_precision_at_3:.2%}")
print(f"  R@3:         {avg_recall_at_3:.2%}")
print(f"  MRR:         {avg_mrr:.3f}")
print(f"\nUnanswerable Questions ({len(unanswerable_preds)} examples):")
print(f"  Exact Match: {unanswerable_results['exact']:.2f}%")
print(f"  F1 Score:    {unanswerable_results['f1']:.2f}%")
print("="*70)
print("\nðŸ’¡ Observation: Model performs worse on unanswerable questions!")

## Checkpoint: Part A Complete! âœ“

We now have:
- âœ“ Real BERT predictions generated
- âœ“ EM and F1 metrics computed
- âœ“ Ranking metrics (P@3, R@3, MRR) computed
- âœ“ Breakdown by question type

---

# Part B: Categorize and Visualize Errors (8 minutes)

## Step 8: Filter Incorrect Predictions

In [None]:
# TODO: Filter predictions where F1 < 1.0
errors = None  # TODO: Filter for errors

print(f"Found {len(errors)} errors (F1 < 100%)")
print(f"This is {100*len(errors)/len(predictions):.1f}% of all predictions")

## Step 9: Sample 30 Errors

In [None]:
# TODO: Randomly sample 30 errors for analysis
# Hint: Use random.sample(errors, min(30, len(errors)))

n_to_analyze = 30
sample_errors = None  # TODO: Sample errors

print(f"Sampled {len(sample_errors)} errors for manual categorization")

## Step 10: Helper Function

In [None]:
def display_error(error, index):
    """Display error for inspection."""
    print("="*80)
    print(f"ERROR #{index + 1}")
    print("="*80)
    print(f"Question: {error['question']}")
    print(f"\nContext: {error['context'][:200]}...")
    print(f"\nModel Predicted:  '{error['prediction_text']}'")
    print(f"Ground Truth:      '{error['ground_truth']}'")
    print(f"\nF1 Score: {error['f1_score']:.2%}")
    print(f"Question Type: {'Unanswerable' if error['is_impossible'] else 'Answerable'}")
    print("="*80)

# Test it
display_error(sample_errors[0], 0)

## Step 11: Categorize Errors

Based on inspection of real BERT errors:

In [None]:
# TODO: Categorize errors based on patterns
# Hint: Categories could be: unanswerable_error, hallucination, wrong_span, partial_answer
# Hint: Examine each error's characteristics to assign a category

error_categories = []  # TODO: Assign categories to each error

print(f"âœ“ Categorized {len(error_categories)} errors")
print(f"\nCategories assigned:")
for cat, count in Counter(error_categories).items():
    print(f"  {cat}: {count}")

## Step 12: Count by Category

In [None]:
# TODO: Count error categories and calculate percentages
category_counts = None  # TODO: Count categories
total_categorized = len(error_categories)

category_percentages = None  # TODO: Calculate percentages

print("\n" + "="*70)
print("ERROR DISTRIBUTION BY CATEGORY")
print("="*70)
for category, count in sorted(category_counts.items(), key=lambda x: x[1], reverse=True):
    pct = category_percentages[category]
    print(f"{category:20s}: {count:3d} ({pct:.1f}%)")
print("="*70)

## Step 13: Visualize

In [None]:
# TODO: Create bar plot of error distribution
# Hint: Use plt.bar() with categories and counts
# Hint: Add value labels on bars

fig, ax = plt.subplots(figsize=(10, 6))

categories = None  # TODO: Get categories
counts = None  # TODO: Get counts

# TODO: Create bar plot
ax.set_xlabel('Error Category', fontsize=12, fontweight='bold')
ax.set_ylabel('Count', fontsize=12, fontweight='bold')
ax.set_title('Error Distribution by Category (30 Sampled Errors)', 
             fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3)

plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## Step 14: Save to CSV

In [None]:
# TODO: Save categorized errors to CSV
# Hint: Create list of dicts with id, question, prediction, ground_truth, f1_score, is_impossible, error_type
# Hint: Convert to DataFrame and save with to_csv()

categorized_data = []  # TODO: Build categorized data list

categorized_df = pd.DataFrame(categorized_data)
categorized_df.to_csv('categorized_errors.csv', index=False)

print("âœ“ Saved categorized errors to categorized_errors.csv")
print(f"âœ“ Saved {len(categorized_df)} categorized errors")
print("\n" + categorized_df.head(10).to_string())

## Checkpoint: Part B Complete! âœ“

---

# Part C: Identify Patterns and Document Findings (3 minutes)

## Step 15: Identify Top 2 Error Types

In [None]:
# TODO: Get the 2 most common error types
# Hint: Use category_counts.most_common(2)

top_2_errors = None  # TODO: Get top 2

print("Top 2 Error Types:")
for i, (category, count) in enumerate(top_2_errors, 1):
    pct = 100 * count / total_categorized
    print(f"{i}. {category}: {count} ({pct:.1f}%)")

## Step 16-17: Inspect Examples

In [None]:
# TODO: Select 3 examples of each top error type
# Hint: Filter sample_errors by error_categories to get examples of each type

top_1_category = None  # TODO: Get top 1 category
top_2_category = None  # TODO: Get top 2 category

top_1_examples = []  # TODO: Get 3 examples of top 1
top_2_examples = []  # TODO: Get 3 examples of top 2

print(f"\nTop error type 1: {top_1_category}")
print(f"Selected {len(top_1_examples)} examples")

print(f"\nTop error type 2: {top_2_category}")
print(f"Selected {len(top_2_examples)} examples")

In [None]:
# TODO: Display examples from each top error type
# Hint: Loop through top_1_examples and top_2_examples
# Hint: Print question, context, prediction, ground truth, F1 score for each

print("\n" + "="*80)
print(f"TOP ERROR TYPE 1: {top_1_category.upper()}")
print("="*80)

# TODO: Display top_1_examples

print("\n" + "="*80)
print(f"TOP ERROR TYPE 2: {top_2_category.upper()}")
print("="*80)

# TODO: Display top_2_examples

## Step 18: Document Findings

---

### FINDINGS:

**1. Top 2 Error Types:**
- TODO: What are the top 2 error types you found?
- TODO: What percentage of errors does each represent?

**2. Patterns Observed:**

*Top Error Type 1:*
- TODO: After inspecting 3 examples, what patterns do you notice?
- TODO: When/why does this error occur?
- TODO: What contextual clues cause the model to make this mistake?

*Top Error Type 2:*
- TODO: After inspecting 3 examples, what patterns do you notice?
- TODO: When/why does this error occur?
- TODO: What could be improved to prevent this error?

**3. Proposed Improvement:**

Based on your top error type:
- TODO: What specific change would you make to reduce this error?
- TODO: Would you add more training data, change the model, or modify preprocessing?
- TODO: What's your confidence that this would improve performance?

---

## ðŸŽ‰ Exercise Complete!


### Key Takeaway

**Metrics + Error Analysis = Complete Evaluation**

You now understand:
- WHAT the performance is (F1, EM)
- HOW retrieval quality looks (P@K, R@K, MRR)
- WHY failures occur (error categorization)