# Tutorial 1: Metrics Basics

**Comprehensive guide to evaluation metrics for change point detection**

This notebook demonstrates:
1. Point-based metrics (precision, recall, F-beta)
2. Distance-based metrics (Hausdorff, annotation error)
3. Segmentation metrics (Adjusted Rand Index)
4. Multi-annotator metrics (covering metric)
5. Visualization techniques

---

In [None]:
# Import required packages
import numpy as np
import matplotlib.pyplot as plt

from fastcpd.metrics import (
    precision_recall,
    f_beta_score,
    hausdorff_distance,
    annotation_error,
    adjusted_rand_index,
    covering_metric,
    evaluate_all
)
from fastcpd.visualization import plot_detection

%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')

## 1. Point-Based Metrics: Precision & Recall

Precision and recall are fundamental metrics for evaluating change point detection:

- **Precision**: What fraction of detected change points are correct?
- **Recall**: What fraction of true change points were detected?

A tolerance margin allows "close enough" matches.

In [None]:
# Example 1: Perfect detection
true_cps = [100, 200, 300]
pred_cps = [100, 200, 300]

result = precision_recall(true_cps, pred_cps, margin=10)

print("Perfect Detection:")
print(f"  Precision: {result['precision']:.3f}")
print(f"  Recall:    {result['recall']:.3f}")
print(f"  F1 Score:  {result['f1_score']:.3f}")
print(f"\nBreakdown:")
print(f"  True Positives:  {result['true_positives']}")
print(f"  False Positives: {result['false_positives']}")
print(f"  False Negatives: {result['false_negatives']}")

In [None]:
# Example 2: Detection with small errors (within margin)
true_cps = [100, 200, 300]
pred_cps = [98, 205, 295]  # All within margin=10

result = precision_recall(true_cps, pred_cps, margin=10)

print("Detection with Small Errors:")
print(f"  Precision: {result['precision']:.3f}")
print(f"  Recall:    {result['recall']:.3f}")
print(f"\nMatched Pairs (true_cp, pred_cp):")
for true_cp, pred_cp in result['matched_pairs']:
    print(f"  {true_cp} → {pred_cp} (error: {abs(true_cp - pred_cp)})")

In [None]:
# Example 3: Over-detection (false positives)
true_cps = [100, 200]
pred_cps = [100, 150, 200, 250]  # Extra CPs at 150 and 250

result = precision_recall(true_cps, pred_cps, margin=10)

print("Over-Detection (False Positives):")
print(f"  Precision: {result['precision']:.3f} (2 correct out of 4 predictions)")
print(f"  Recall:    {result['recall']:.3f} (found all true CPs)")
print(f"\nUnmatched Predictions (False Positives):")
print(f"  {result['unmatched_pred']}")

In [None]:
# Example 4: Under-detection (false negatives)
true_cps = [100, 200, 300, 400]
pred_cps = [100, 200]  # Missed 300 and 400

result = precision_recall(true_cps, pred_cps, margin=10)

print("Under-Detection (False Negatives):")
print(f"  Precision: {result['precision']:.3f} (all predictions correct)")
print(f"  Recall:    {result['recall']:.3f} (found 2 out of 4)")
print(f"\nUnmatched True CPs (False Negatives):")
print(f"  {result['unmatched_true']}")

## 2. F-Beta Score: Weighting Precision vs Recall

The F-beta score allows you to weight precision and recall differently:

- **β = 1** (F1): Equal weight
- **β < 1** (F0.5): Favor precision (penalize false positives more)
- **β > 1** (F2): Favor recall (penalize false negatives more)

In [None]:
# Example: When missing CPs is worse than false alarms
true_cps = [100, 200, 300]
pred_cps = [100, 200, 250, 300]  # One false positive at 250

result = f_beta_score(true_cps, pred_cps, beta=1.0, margin=10)

print("F-beta Scores:")
print(f"  F1  (β=1.0): {result['f1_score']:.4f} (balanced)")
print(f"  F2  (β=2.0): {result['f2_score']:.4f} (favor recall)")
print(f"  F0.5(β=0.5): {result['f0_5_score']:.4f} (favor precision)")
print(f"\nInterpretation:")
print(f"  Precision: {result['precision']:.3f}")
print(f"  Recall:    {result['recall']:.3f}")
print(f"  \n  F2 > F1: Emphasizing recall increases score when recall is good")
print(f"  F0.5 < F1: De-emphasizing precision decreases score when precision is lower")

## 3. Distance-Based Metrics

### 3.1 Hausdorff Distance

Measures the worst-case distance between two CP sets. Sensitive to outliers.

In [None]:
true_cps = [100, 200, 300]
pred_cps = [105, 200, 350]  # Errors: 5, 0, 50

result = hausdorff_distance(true_cps, pred_cps)

print("Hausdorff Distance:")
print(f"  Hausdorff distance: {result['hausdorff']:.1f}")
print(f"  Forward (true→pred): {result['forward_distance']:.1f}")
print(f"  Backward (pred→true): {result['backward_distance']:.1f}")
print(f"\nClosest Pairs:")
for cp1, cp2, dist in result['closest_pairs']:
    print(f"  {cp1} → {cp2} (distance: {dist:.1f})")
print(f"\nNote: Hausdorff = 50 because CP at 300 is 50 away from nearest pred (350)")

### 3.2 Annotation Error

Measures average localization accuracy using optimal matching.

In [None]:
true_cps = [100, 200, 300]
pred_cps = [105, 195, 310]  # Errors: 5, 5, 10

result_mae = annotation_error(true_cps, pred_cps, method='mae')
result_rmse = annotation_error(true_cps, pred_cps, method='rmse')

print("Annotation Error:")
print(f"  MAE:  {result_mae['error']:.2f}")
print(f"  RMSE: {result_rmse['error']:.2f}")
print(f"  Median: {result_mae['median_error']:.2f}")
print(f"  Max:    {result_mae['max_error']:.2f}")
print(f"  Std:    {result_mae['std_error']:.2f}")
print(f"\nErrors per CP: {result_mae['errors_per_cp']}")

## 4. Segmentation Metrics: Adjusted Rand Index

Measures overall similarity of segmentations, accounting for chance agreement.

In [None]:
# Perfect segmentation
true_cps = [100, 200, 300]
pred_cps_perfect = [100, 200, 300]
pred_cps_offset = [105, 195, 305]  # Slightly offset

result_perfect = adjusted_rand_index(true_cps, pred_cps_perfect, n_samples=400)
result_offset = adjusted_rand_index(true_cps, pred_cps_offset, n_samples=400)

print("Adjusted Rand Index:")
print(f"\nPerfect Segmentation:")
print(f"  ARI: {result_perfect['ari']:.4f}")
print(f"  Rand Index: {result_perfect['rand_index']:.4f}")

print(f"\nSlightly Offset (by 5):")
print(f"  ARI: {result_offset['ari']:.4f}")
print(f"  Rand Index: {result_offset['rand_index']:.4f}")

print(f"\nInterpretation:")
print(f"  ARI = 1.0: Perfect agreement")
print(f"  ARI = 0.0: Agreement by chance")
print(f"  ARI < 0.0: Worse than random")

## 5. Multi-Annotator Metrics: Covering Score

**UNIQUE to fastcpd!** Measures how well predictions agree with EACH annotator.

In [None]:
# Simulate multiple annotators
from fastcpd.datasets import add_annotation_noise

true_cps = [100, 200, 300]
annotators = add_annotation_noise(true_cps, n_annotators=5, 
                                   noise_std=5.0, agreement_rate=0.8, seed=42)

print("Multiple Annotators:")
for i, ann_cps in enumerate(annotators, 1):
    print(f"  Annotator {i}: {ann_cps}")

# Evaluate predictions
pred_cps = [100, 200, 300]
result = covering_metric(annotators, pred_cps, margin=10)

print(f"\nCovering Metric Results:")
print(f"  Covering Score: {result['covering_score']:.3f} (mean recall across annotators)")
print(f"  Std Recall:     {result['std_recall']:.3f}")
print(f"  Min Recall:     {result['min_recall']:.3f}")
print(f"  Max Recall:     {result['max_recall']:.3f}")
print(f"\nRecall per Annotator: {[f'{r:.2f}' for r in result['recall_per_annotator']]}")

## 6. Comprehensive Evaluation

Use `evaluate_all()` for one-stop evaluation with all metrics.

In [None]:
true_cps = [100, 200, 300]
pred_cps = [98, 205, 295]

result = evaluate_all(true_cps, pred_cps, n_samples=400, margin=10)

# Print formatted summary
print(result['summary'])

In [None]:
# Access individual metrics
print("\nPoint Metrics:")
for key, value in result['point_metrics'].items():
    print(f"  {key}: {value}")

print("\nDistance Metrics:")
for key, value in result['distance_metrics'].items():
    print(f"  {key}: {value:.2f}" if isinstance(value, float) else f"  {key}: {value}")

print("\nSegmentation Metrics:")
for key, value in result['segmentation_metrics'].items():
    print(f"  {key}: {value:.4f}")

## 7. Visualization Examples

Visualize detection results with metrics overlay.

In [None]:
# Generate sample data
from fastcpd.datasets import make_mean_change

data_dict = make_mean_change(n_samples=500, n_changepoints=3, 
                             noise_std=1.0, seed=42)

# Simulate detection (add some noise to true CPs)
true_cps = data_dict['changepoints']
pred_cps = [cp + np.random.randint(-10, 10) for cp in true_cps]

# Evaluate
metrics = evaluate_all(true_cps, pred_cps, n_samples=500, margin=10)

# Plot
fig, axes = plot_detection(data_dict['data'], true_cps, pred_cps, 
                          metric_result=metrics, 
                          title="Mean Change Detection Example")
plt.show()

## 8. Practical Tips

### Choosing the Right Metric

| Use Case | Recommended Metric | Why? |
|----------|-------------------|------|
| **Binary decision** (detected or not) | Precision/Recall | Clear interpretation |
| **False alarms are costly** | F0.5 (β=0.5) | Emphasizes precision |
| **Missing CPs is costly** | F2 (β=2.0) | Emphasizes recall |
| **Localization accuracy** | Annotation Error | Direct distance measure |
| **Worst-case analysis** | Hausdorff | Sensitive to outliers |
| **Overall segmentation quality** | Adjusted Rand Index | Accounts for chance |
| **Multiple ground truths** | Covering Metric | Evaluates agreement with each |

### Choosing the Margin

- **Small margin** (1-5): Strict localization
- **Medium margin** (10-20): Reasonable tolerance
- **Large margin** (50+): Focus on detection, not localization

### Best Practices

1. **Always report multiple metrics** - Single metric can be misleading
2. **Use evaluate_all()** - Comprehensive evaluation
3. **Visualize results** - Plots reveal patterns metrics might miss
4. **Consider your application** - Choose metrics that match your goals

## Summary

This tutorial covered:

✅ **Point-based metrics**: Precision, Recall, F-beta  
✅ **Distance metrics**: Hausdorff, Annotation Error  
✅ **Segmentation metrics**: Adjusted Rand Index  
✅ **Multi-annotator metrics**: Covering Score (UNIQUE!)  
✅ **Comprehensive evaluation**: evaluate_all()  
✅ **Visualization**: plot_detection()  

### Next Steps

- **Tutorial 2**: Dataset Generation
- **Tutorial 3**: End-to-End Benchmarking

---

**fastcpd-python** provides the most comprehensive evaluation metrics for change point detection! 🎉