# QC: Benchmark Framework Validation

This notebook validates the `starfinder.benchmark` module by:
1. Testing the `@benchmark` decorator
2. Testing `run_comparison()` across methods
3. Testing report generation (table, CSV, JSON)
4. Validating timing/memory measurements are reasonable

In [1]:
import sys
sys.path.insert(0, "../src/python")

import numpy as np
from pathlib import Path

from starfinder.benchmark import (
    BenchmarkResult,
    BenchmarkSuite,
    benchmark,
    measure,
    run_comparison,
    print_table,
    save_csv,
    save_json,
    SIZE_PRESETS,
)

print("Benchmark module loaded successfully!")
print(f"Available size presets: {list(SIZE_PRESETS.keys())}")

Benchmark module loaded successfully!
Available size presets: ['tiny', 'small', 'medium', 'large', 'xlarge', 'tissue']


## 1. Test `measure()` Function

Verify that `measure()` correctly captures execution time and memory.

In [2]:
import time

def slow_function():
    """Function that takes ~100ms."""
    time.sleep(0.1)
    return 42

result, elapsed, memory = measure(slow_function)
print(f"Result: {result}")
print(f"Elapsed time: {elapsed:.3f}s (expected ~0.1s)")
print(f"Memory: {memory:.2f} MB")

assert result == 42, "Return value should be preserved"
assert 0.08 < elapsed < 0.15, f"Time should be ~0.1s, got {elapsed:.3f}s"
print("✓ measure() works correctly")

Result: 42
Elapsed time: 0.100s (expected ~0.1s)
Memory: 0.00 MB
✓ measure() works correctly


## 2. Test `@benchmark` Decorator

Verify the decorator captures timing and extracts size from arguments.

In [3]:
@benchmark(method="numpy", operation="sum", size_arg="arr")
def array_sum(arr):
    return arr.sum()

test_arr = np.ones((10, 256, 256))
result = array_sum(test_arr)

print(f"Method: {result.method}")
print(f"Operation: {result.operation}")
print(f"Size: {result.size}")
print(f"Time: {result.time_seconds:.6f}s")
print(f"Memory: {result.memory_mb:.2f} MB")
print(f"Return value: {result.metrics['return_value']}")

assert result.size == (10, 256, 256), "Size should be extracted from array"
assert result.metrics["return_value"] == 10 * 256 * 256, "Return value incorrect"
print("✓ @benchmark decorator works correctly")

Method: numpy
Operation: sum
Size: (10, 256, 256)
Time: 0.000621s
Memory: 0.06 MB
Return value: 655360.0
✓ @benchmark decorator works correctly


## 3. Test `run_comparison()`

Compare multiple methods on the same inputs.

In [4]:
def numpy_sum(arr):
    return np.sum(arr)

def loop_sum(arr):
    total = 0.0
    for val in arr.flat:
        total += val
    return total

# Create test arrays
small_arr = np.ones((5, 64, 64))
medium_arr = np.ones((10, 128, 128))

results = run_comparison(
    methods={"numpy": numpy_sum, "loop": loop_sum},
    inputs=[small_arr, medium_arr],
    operation="sum",
    n_runs=3,
    warmup=True,
)

print(f"Collected {len(results)} results")
print_table(results)

# Verify numpy is faster than loop
numpy_times = [r.time_seconds for r in results if r.method == "numpy"]
loop_times = [r.time_seconds for r in results if r.method == "loop"]
print(f"NumPy avg: {np.mean(numpy_times):.6f}s")
print(f"Loop avg: {np.mean(loop_times):.6f}s")
print(f"NumPy is {np.mean(loop_times) / np.mean(numpy_times):.1f}x faster")
print("✓ run_comparison() works correctly")

Collected 4 results

| Method | Operation | Size | Time (s) | Memory (MB) |
|--------|-----------|------|----------|-------------|
| numpy  | sum       | 5x64x64 |   0.0001 |         0.1 |
| loop   | sum       | 5x64x64 |   0.0187 |         0.0 |
| numpy  | sum       | 10x128x128 |   0.0001 |         0.1 |
| loop   | sum       | 10x128x128 |   0.1207 |         0.0 |

NumPy avg: 0.000068s
Loop avg: 0.069731s
NumPy is 1020.7x faster
✓ run_comparison() works correctly


## 4. Test Report Generation

Save results to CSV and JSON.

In [5]:
import json

output_dir = Path("benchmark_output")
output_dir.mkdir(exist_ok=True)

# Save CSV
csv_path = output_dir / "test_results.csv"
save_csv(results, csv_path)
print(f"Saved CSV to: {csv_path}")
print(csv_path.read_text()[:500])

# Save JSON
json_path = output_dir / "test_results.json"
save_json(results, json_path)
print(f"\nSaved JSON to: {json_path}")

data = json.loads(json_path.read_text())
print(f"JSON contains {len(data)} results")
print("✓ Report generation works correctly")

Saved CSV to: benchmark_output/test_results.csv
method,operation,size,time_seconds,memory_mb
numpy,sum,5x64x64,5.678584178288778e-05,0.06342315673828125
loop,sum,5x64x64,0.018738817423582077,0.0027882258097330728
numpy,sum,10x128x128,7.984352608521779e-05,0.06342315673828125
loop,sum,10x128x128,0.12072226156791051,0.0027720133463541665


Saved JSON to: benchmark_output/test_results.json
JSON contains 4 results
✓ Report generation works correctly


## 5. Test BenchmarkSuite

Collect results and compute summary statistics.

In [6]:
suite = BenchmarkSuite(name="sum_benchmark")
for r in results:
    suite.add(r)

print(f"Suite '{suite.name}' has {len(suite.results)} results")

stats = suite.summary()
print(f"\nSummary statistics:")
for key, value in stats.items():
    print(f"  {key}: {value:.6f}")

# Filter by method
numpy_results = suite.filter(method="numpy")
print(f"\nNumPy results: {len(numpy_results)}")
print("✓ BenchmarkSuite works correctly")

Suite 'sum_benchmark' has 4 results

Summary statistics:
  mean_time: 0.034899
  min_time: 0.000057
  max_time: 0.120722
  std_time: 0.050133
  mean_memory: 0.033102

NumPy results: 2
✓ BenchmarkSuite works correctly


## Summary

All benchmark framework components validated:
- [x] `measure()` - captures time and memory
- [x] `@benchmark` decorator - wraps functions with timing
- [x] `run_comparison()` - compares multiple methods
- [x] `print_table()` - formatted output
- [x] `save_csv()` / `save_json()` - file output
- [x] `BenchmarkSuite` - result collection and stats