# 02 - Evaluation and Analysis

This notebook demonstrates how to evaluate a trained model and analyze the results.

## What you'll learn:
- How to evaluate a model using `alt.evaluate()`
- How to interpret evaluation metrics
- How to export results to CSV/JSON
- How to visualize prediction quality

In [None]:
import altair as alt
import numpy as np
from pathlib import Path

## Basic Evaluation

Evaluate a trained model on the validation dataset:

In [None]:
# Evaluate using the run ID (uses validation data from config)
# results = alt.evaluate("run_abc123")

# Or evaluate on a specific test set
# results = alt.evaluate("run_abc123", data="path/to/test")

## Understanding Metrics

The evaluation results contain comprehensive metrics:

In [None]:
# Example metrics explanation
metrics_info = """
=== Binary Segmentation Metrics ===
IoU (Jaccard Index): Intersection / Union of prediction and ground truth
    - Range: [0, 1], higher is better
    - IoU = TP / (TP + FP + FN)

Dice Coefficient: Similar to F1 score, more weight on overlap
    - Range: [0, 1], higher is better
    - Dice = 2*TP / (2*TP + FP + FN)

Precision: How many predicted positives are correct
    - Precision = TP / (TP + FP)

Recall: How many actual positives are found
    - Recall = TP / (TP + FN)

=== Multi-class Segmentation Metrics ===
mIoU: Mean IoU across all classes
mDice: Mean Dice across all classes
Pixel Accuracy: Percentage of correctly classified pixels
Per-class IoU: IoU for each individual class

=== Advanced Metrics (with segmentation-evaluation) ===
soft_pq: Soft Panoptic Quality - handles instance segmentation
mAP: Mean Average Precision
"""
print(metrics_info)

In [None]:
# Access metrics from results
# (assuming 'results' is available from evaluation)

example_code = """
# Aggregate metrics
print("=== Aggregate Metrics ===")
for name, value in results.metrics.items():
    print(f"{name}: {value:.4f}")

# Per-class metrics (for multi-class)
print("\n=== Per-class IoU ===")
for cls, iou in results.per_class_metrics.get('IoU', {}).items():
    print(f"Class {cls}: {iou:.4f}")

# Access specific metrics
miou = results['mIoU']  # or results.get('mIoU', 0.0)
dice = results['mDice']
"""
print(example_code)

## Per-Sample Analysis

Analyze performance on individual samples:

In [None]:
# Per-sample metrics are useful for:
# - Finding difficult samples
# - Identifying failure cases
# - Understanding model behavior

example_analysis = """
# Sort samples by IoU (worst first)
sorted_samples = sorted(
    results.per_sample_metrics,
    key=lambda x: x.get('IoU', x.get('mIoU', 0))
)

# Print worst 5 samples
print("=== Worst Performing Samples ===")
for sample in sorted_samples[:5]:
    print(f"Image: {sample.get('image_path', 'N/A')}")
    print(f"  IoU: {sample.get('IoU', sample.get('mIoU', 0)):.4f}")
    print()

# Print best 5 samples
print("=== Best Performing Samples ===")
for sample in sorted_samples[-5:]:
    print(f"Image: {sample.get('image_path', 'N/A')}")
    print(f"  IoU: {sample.get('IoU', sample.get('mIoU', 0)):.4f}")
"""
print(example_analysis)

## Exporting Results

Save results for further analysis or reporting:

In [None]:
# Save to CSV (per-sample metrics)
# results.to_csv("evaluation_results.csv")

# Save to JSON (all metrics)
# results.to_json("evaluation_results.json")

# Print summary
# results.print_summary()

## Visualizing Results

Use visualization utilities to inspect predictions:

In [None]:
from altair.utils import (
    create_overlay,
    create_comparison,
    create_error_map,
    visualize_prediction,
    SampleExporter,
)

In [None]:
# Create overlay visualization
example_viz = """
from PIL import Image
import numpy as np

# Load an image and its prediction
image = np.array(Image.open("test_image.png").convert("RGB"))
prediction = np.array(Image.open("prediction_mask.png"))
ground_truth = np.array(Image.open("ground_truth.png"))

# Create overlay (mask on image)
overlay = create_overlay(image, prediction, alpha=0.5)
Image.fromarray(overlay).save("overlay.png")

# Create comparison (image | GT | prediction)
comparison = create_comparison(image, ground_truth, prediction)
Image.fromarray(comparison).save("comparison.png")

# Create error map (green=correct, red=error)
error_map = create_error_map(ground_truth, prediction)
Image.fromarray(error_map).save("error_map.png")
"""
print(example_viz)

In [None]:
# Full visualization with matplotlib
viz_code = """
# Visualize a single prediction
visualize_prediction(
    image=image,
    ground_truth=ground_truth,
    prediction=prediction,
    show_error_map=True,
    save_path="visualization.png"
)
"""
print(viz_code)

## Export Sample Visualizations

Export multiple samples with visualizations for review:

In [None]:
# Use SampleExporter for batch visualization export
export_code = """
# Create exporter
exporter = SampleExporter(
    output_dir="eval_samples/",
    max_samples=20,
    alpha=0.5,
)

# Add samples from evaluation
for i, sample in enumerate(results.per_sample_metrics[:20]):
    # Load image and masks
    image = load_image(sample['image_path'])
    prediction = results.predictions[i]  # If stored
    ground_truth = load_mask(sample['image_path'])  # Load corresponding GT
    
    exporter.add_sample(
        image=image,
        prediction=prediction,
        ground_truth=ground_truth,
        metrics=sample,
        image_path=sample['image_path'],
    )

# Save summary and grid
exporter.save_summary()
exporter.save_grid(cols=4)
"""
print(export_code)

## Evaluation with Sample Export (All-in-One)

You can also export samples directly during evaluation:

In [None]:
# Evaluate and export samples in one step
# from altair.engine.evaluator import Evaluator
# 
# evaluator = Evaluator(run, checkpoint, store_predictions=True)
# results = evaluator.evaluate(
#     export_samples=True,
#     export_dir="eval_samples/",
#     n_export_samples=20,
# )

## Comparing Multiple Runs

Compare results across different experiments:

In [None]:
compare_code = """
# Evaluate multiple runs
run_ids = ["run_abc123", "run_def456", "run_ghi789"]
all_results = {}

for run_id in run_ids:
    results = alt.evaluate(run_id, data="path/to/test")
    all_results[run_id] = results.metrics

# Compare key metrics
print("=== Comparison ===")
print(f"{'Run ID':<15} {'mIoU':<10} {'mDice':<10}")
print("-" * 35)
for run_id, metrics in all_results.items():
    print(f"{run_id:<15} {metrics['mIoU']:<10.4f} {metrics['mDice']:<10.4f}")
"""
print(compare_code)

## Next Steps

- **03_inference.ipynb**: Run predictions on new images
- **04_export.ipynb**: Export your model for deployment