# Running Benchmarks & Evaluation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/your-org/neurosym-kg/blob/main/notebooks/04_evaluation.ipynb)

This notebook demonstrates how to evaluate reasoners on standard benchmarks:

- **WebQSP** - WebQuestions Semantic Parsing
- **CWQ** - ComplexWebQuestions
- **MetaQA** - Multi-hop movie QA
- **SimpleQuestions** - Single-relation questions

**Note**: For full benchmark evaluation, you'll need to download the datasets. We provide demo examples for quick testing.

## 1. Setup

In [None]:
# Install dependencies
!pip install -q pydantic pydantic-settings httpx tenacity networkx numpy tqdm

# For Colab Pro with GPU (if using HuggingFace models)
# !pip install -q torch transformers accelerate

In [None]:
import json
from datetime import datetime

from neurosym_kg import (
    InMemoryKG,
    Triple,
    MockLLMBackend,
    ThinkOnGraphReasoner,
)
from neurosym_kg.evaluation import (
    WebQSP,
    CWQ,
    MetaQA,
    BenchmarkRunner,
    RunConfig,
    MetricsCalculator,
    exact_match,
    f1_score,
    normalize_answer,
)

print("‚úÖ Imports successful!")

## 2. Explore Available Benchmarks

NeuroSym-KG includes demo examples for each benchmark. For full evaluation, download the actual datasets.

In [None]:
# Load benchmarks (uses built-in demo examples)
benchmarks = {
    "WebQSP": WebQSP(),
    "CWQ": CWQ(),
    "MetaQA-1hop": MetaQA(hops=1),
    "MetaQA-2hop": MetaQA(hops=2),
}

print("üìö Available Benchmarks:")
print("=" * 60)
for name, bench in benchmarks.items():
    print(f"\n{name}:")
    print(f"  Examples: {len(bench)}")
    
    # Show sample questions
    for ex in list(bench)[:2]:
        print(f"  ‚Ä¢ Q: {ex.question[:50]}...")
        print(f"    A: {ex.answers[:3]}")

## 3. Understanding Metrics

In [None]:
# Demonstrate metric calculations
print("üìè Metric Examples:")
print("=" * 50)

test_cases = [
    ("Paris", "Paris"),
    ("PARIS", "Paris"),
    ("The Eiffel Tower", "Eiffel Tower"),
    ("Christopher Nolan", "Nolan"),
    ("London, UK", "London"),
    ("wrong answer", "correct answer"),
]

print(f"{'Prediction':<25} {'Ground Truth':<20} {'EM':>6} {'F1':>6}")
print("-" * 60)

for pred, gt in test_cases:
    em = exact_match(pred, gt)
    f1 = f1_score(pred, gt)
    print(f"{pred:<25} {gt:<20} {em:>6.2f} {f1:>6.2f}")

In [None]:
# Answer normalization
print("\nüîÑ Answer Normalization:")
print("-" * 40)

examples = [
    "The United States of America",
    "Christopher Nolan",
    "  PARIS, FRANCE  ",
]

for ex in examples:
    normalized = normalize_answer(ex)
    print(f"'{ex}' ‚Üí '{normalized}'")

## 4. Create Test Knowledge Graph

In [None]:
def create_demo_kg() -> InMemoryKG:
    """Create a KG that can answer demo benchmark questions."""
    kg = InMemoryKG(name="Demo KG")
    
    # Facts for WebQSP demo questions
    kg.add_triples([
        Triple("France", "capital", "Paris"),
        Triple("Paris", "located_in", "France"),
        Triple("Paris", "type", "City"),
        
        Triple("Inception", "director", "Christopher_Nolan"),
        Triple("Christopher_Nolan", "born_in", "London"),
        Triple("Christopher_Nolan", "nationality", "British"),
        
        Triple("Eiffel_Tower", "located_in", "Paris"),
        Triple("Paris", "country", "France"),
        Triple("France", "official_language", "French"),
        
        Triple("Barack_Obama", "spouse", "Michelle_Obama"),
        Triple("Michelle_Obama", "spouse", "Barack_Obama"),
        
        Triple("Tesla", "CEO", "Elon_Musk"),
        Triple("Elon_Musk", "education", "University_of_Pennsylvania"),
    ])
    
    return kg

kg = create_demo_kg()
print(kg.summary())

## 5. Configure Reasoner

In [None]:
# Create mock LLM with generic responses
llm = MockLLMBackend()

# Generic patterns for entity extraction
llm.add_response(r".*Extract.*France.*", "France")
llm.add_response(r".*Extract.*Inception.*", "Inception")
llm.add_response(r".*Extract.*Obama.*", "Barack_Obama")
llm.add_response(r".*Extract.*Eiffel.*", "Eiffel_Tower")
llm.add_response(r".*Extract.*Tesla.*", "Tesla")
llm.add_response(r".*Extract.*Nolan.*", "Christopher_Nolan")
llm.add_response(r".*Extract.*", "Unknown_Entity")  # Fallback

# Generic patterns
llm.add_response(r".*relevant.*", "capital\nlocated_in\ndirector\nspouse")
llm.add_response(r".*enough.*", "YES")
llm.add_response(r".*answer.*capital.*France.*", "Paris")
llm.add_response(r".*answer.*director.*Inception.*", "Christopher Nolan")
llm.add_response(r".*answer.*spouse.*Obama.*", "Michelle Obama")
llm.add_response(r".*answer.*", "Unknown")  # Fallback

# Create reasoner
reasoner = ThinkOnGraphReasoner(
    kg=kg,
    llm=llm,
    max_depth=2,
    beam_width=3,
    verbose=False,
)

print("‚úÖ Reasoner configured")

## 6. Run Evaluation with BenchmarkRunner

In [None]:
# Create benchmark runner
runner = BenchmarkRunner(reasoner)

# Configure evaluation
config = RunConfig(
    subset_size=5,      # Evaluate on 5 examples (use None for full dataset)
    random_seed=42,     # For reproducibility
    max_retries=1,
    verbose=False,
)

print("üìä Running WebQSP Evaluation...")
report = runner.evaluate(WebQSP(), config=config)

# Print summary
print(report.summary())

In [None]:
# Detailed results
print("\nüìã Per-Question Results:")
print("=" * 70)

for pred in report.results.predictions:
    status = "‚úÖ" if pred.exact_match else "‚ùå"
    print(f"\n{status} Q: {pred.question}")
    print(f"   Predicted: {pred.prediction}")
    print(f"   Expected: {pred.ground_truth}")
    print(f"   EM: {pred.exact_match:.0f}, F1: {pred.f1:.2f}")

## 7. Compare Multiple Benchmarks

In [None]:
# Run on multiple benchmarks
benchmark_results = {}

for bench_name, benchmark in benchmarks.items():
    print(f"\nüìä Evaluating on {bench_name}...")
    
    try:
        report = runner.evaluate(benchmark, config=config)
        benchmark_results[bench_name] = report
        print(f"   Accuracy: {report.results.accuracy:.1%}")
    except Exception as e:
        print(f"   ‚ùå Error: {e}")

# Summary table
print("\n" + "=" * 60)
print("                  BENCHMARK RESULTS")
print("=" * 60)
print(f"{'Benchmark':<20} {'Accuracy':>10} {'F1':>10} {'Samples':>10}")
print("-" * 60)

for name, report in benchmark_results.items():
    metrics = report.results.metrics
    print(f"{name:<20} {metrics['accuracy']:>10.1%} {metrics['f1']:>10.1%} {int(metrics['num_samples']):>10}")

## 8. Manual Evaluation Loop

In [None]:
# For more control, use MetricsCalculator directly
calculator = MetricsCalculator(dataset_name="Custom")

custom_questions = [
    {"q": "What is the capital of France?", "a": ["Paris"]},
    {"q": "Who directed Inception?", "a": ["Christopher Nolan"]},
    {"q": "Who is Obama's spouse?", "a": ["Michelle Obama"]},
]

print("Running custom evaluation...\n")

for item in custom_questions:
    result = reasoner.reason(item["q"])
    
    calculator.add_prediction(
        prediction=result.answer,
        ground_truth=item["a"],
        question=item["q"],
        confidence=result.confidence,
        latency_ms=result.latency_ms,
    )
    
    em = exact_match(result.answer, item["a"])
    status = "‚úÖ" if em else "‚ùå"
    print(f"{status} {item['q']}")
    print(f"   ‚Üí {result.answer}")

# Get aggregated results
final_results = calculator.compute()
print(f"\nFinal Accuracy: {final_results.accuracy:.1%}")

## 9. Error Analysis

In [None]:
# Analyze errors from the evaluation
if benchmark_results:
    report = list(benchmark_results.values())[0]
    
    print("üîç Error Analysis")
    print("=" * 60)
    
    errors = report.results.error_analysis(top_n=5)
    
    if errors:
        print(f"\nTop {len(errors)} errors (by confidence):")
        for i, err in enumerate(errors, 1):
            print(f"\n{i}. Q: {err.question}")
            print(f"   Predicted: {err.prediction}")
            print(f"   Expected: {err.ground_truth}")
            print(f"   Confidence: {err.confidence:.2f} (high confidence = more problematic)")
    else:
        print("\n‚úÖ No errors found!")

## 10. Save Results

In [None]:
# Save evaluation report
if benchmark_results:
    report = list(benchmark_results.values())[0]
    
    # Save to JSON
    output_path = "evaluation_results.json"
    report.save(output_path)
    print(f"‚úÖ Results saved to {output_path}")
    
    # Preview saved content
    with open(output_path) as f:
        data = json.load(f)
    
    print("\nSaved data preview:")
    print(json.dumps({k: v for k, v in data.items() if k != 'predictions'}, indent=2))

## 11. Running Full Benchmarks

To run on the complete benchmarks (not just demo examples):

### Download Datasets

```bash
# WebQSP
wget https://download.microsoft.com/download/0/7/5/0755490B-8F8F-4DB3-9B34-D4C0C8B3E3F4/WebQSP.zip
unzip WebQSP.zip -d data/webqsp/

# CWQ (ComplexWebQuestions)
wget https://www.tau-nlp.org/compwebq/ComplexWebQuestions_1.1.zip
unzip ComplexWebQuestions_1.1.zip -d data/cwq/

# MetaQA
git clone https://github.com/yuyuz/MetaQA.git data/metaqa/
```

In [None]:
# Example with full dataset (uncomment when you have the data)
"""
from pathlib import Path

# Load full WebQSP dataset
webqsp_full = WebQSP(data_dir=Path("data/webqsp"))
print(f"Full WebQSP: {len(webqsp_full)} questions")

# Run with larger subset
config = RunConfig(
    subset_size=500,  # Or None for all
    random_seed=42,
)

report = runner.evaluate(webqsp_full, config=config)
print(report.summary())
"""
print("üí° Uncomment the code above after downloading the datasets")

## üìä Summary

This notebook covered:

1. **Available benchmarks**: WebQSP, CWQ, MetaQA, SimpleQuestions
2. **Metrics**: Exact Match (EM), F1 Score, answer normalization
3. **BenchmarkRunner**: Automated evaluation with reporting
4. **MetricsCalculator**: Manual evaluation for custom datasets
5. **Error analysis**: Understanding model failures
6. **Result persistence**: Saving reports for later analysis

### Next Steps

- Download full benchmark datasets for comprehensive evaluation
- Compare different reasoners on the same benchmark
- Analyze performance by question type (1-hop vs multi-hop)
- Use with real LLMs for better performance