---

## Evaluation

Run evaluation on the test set after training is complete.


In [None]:
# Load best model checkpoint (optional - if you want to evaluate a saved model)
# checkpoint_path = CHECKPOINTS_PATH + "best_model.pth"
# model.load_state_dict(torch.load(checkpoint_path, map_location=DEVICE))
# print(f"Loaded checkpoint: {checkpoint_path}")


In [None]:
predictions_path = (
    CHECKPOINTS_PATH
    + f"eval_results/epoch_{history['best_epoch']}_test_predictions.json"
)
with open(predictions_path) as f:
    predictions = json.load(f)

print(f"Generated {len(predictions)} captions\n")
print("Sample predictions:")
for pred in predictions[:10]:
    print(f"  Image {pred['image_id']}: {pred['caption']}")

---

## Quick Test: Evaluation Functions with Mock Data

Test the evaluation functions without needing real data.


In [None]:
# Test compute_caption_metrics with mock data
print("Testing compute_caption_metrics()...")

mock_predictions = {
    1: ["a dog sitting on the grass"],
    2: ["a cat sleeping on a couch"],
    3: ["a person riding a bicycle"],
}

mock_references = {
    1: [
        "a dog is sitting on green grass",
        "dog on the grass",
        "a brown dog sits on grass",
    ],
    2: [
        "a cat is sleeping on the sofa",
        "cat napping on couch",
        "a sleeping cat on a couch",
    ],
    3: ["a man rides a bike", "person on a bicycle", "someone cycling on the road"],
}

metrics = compute_caption_metrics(mock_predictions, mock_references)
print(f"✅ Success! Metrics: {metrics}")

In [None]:
# Test evaluate_captions with mock COCO format
print("Testing evaluate_captions()...")

mock_coco = {
    "images": [
        {"id": 1, "file_name": "img1.jpg"},
        {"id": 2, "file_name": "img2.jpg"},
    ],
    "annotations": [
        {"image_id": 1, "id": 1, "caption": "a dog sitting on grass"},
        {"image_id": 1, "id": 2, "caption": "dog on the green grass"},
        {"image_id": 2, "id": 3, "caption": "a cat on a couch"},
        {"image_id": 2, "id": 4, "caption": "cat sleeping on sofa"},
    ],
}

mock_pred_list = [
    {"image_id": 1, "caption": "a dog sitting on the grass"},
    {"image_id": 2, "caption": "a cat sleeping on the couch"},
]

with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
    json.dump(mock_coco, f)
    temp_path = f.name

try:
    metrics = evaluate_captions(mock_pred_list, temp_path)
    print(f"✅ Success! Metrics: {metrics}")
finally:
    os.unlink(temp_path)

---

## FiftyOne Visualization

Interactive visualization of generated captions vs. ground truth.


In [None]:
# Create FiftyOne dataset from evaluation results
dataset = create_captioning_dataset(
    images_dir=VAL_PATH,
    predictions_path=CHECKPOINTS_PATH
    + f"eval_results/epoch_{history['best_epoch']}_test_predictions.json",
    annotations_path=ANNOTATIONS_PATH + "captions_val2017.json",
    dataset_name="caption_eval",
)

print(dataset)

In [None]:
# Browse samples in the dataset
for sample in dataset.take(5):
    print(f"\nImage ID: {sample.image_id}")
    print(f"Generated: {sample.generated_caption}")
    print(f"References: {sample.reference_captions[:2]}...")  # Show first 2

In [None]:
# Launch FiftyOne app for interactive exploration
# This opens a web browser at http://localhost:5151

# Uncomment to launch:
# launch_app(dataset)

# Or launch without blocking:
# session = fo.launch_app(dataset)
# session.show()


---

## Plot Metrics Over Training


In [None]:
# Plot validation metrics from training history (if available)
if history["val_metrics"]:
    val_history = history["val_metrics"]
    epochs = [m["epoch"] for m in val_history]

    # Extract metrics
    bleu4 = [m.get("BLEU-4", 0) for m in val_history]
    cider = [m.get("CIDEr", 0) for m in val_history]
    meteor = [m.get("METEOR", 0) for m in val_history]

    fig, ax = plt.subplots(figsize=(10, 5))
    ax.plot(epochs, bleu4, "b-o", label="BLEU-4")
    ax.plot(epochs, cider, "r-s", label="CIDEr")
    ax.plot(epochs, meteor, "g-^", label="METEOR")
    ax.set_xlabel("Epoch")
    ax.set_ylabel("Score")
    ax.set_title("Validation Metrics Over Training")
    ax.legend()
    ax.grid(True)
    plt.show()
else:
    print("No validation metrics available - train for more epochs to see trends")

---

## Summary

### Evaluation Functions:
- `compute_caption_metrics(preds, refs)` - Low-level metric computation
- `evaluate_captions(predictions, annotations_path)` - Evaluate list of predictions  
- `generate_and_evaluate(model, dataset, ...)` - Generate + evaluate in one call
- `evaluate_epoch(model, dataset, ...)` - Full epoch eval with file saving

### Visualization Functions:
- `create_captioning_dataset(...)` - Build FiftyOne dataset
- `create_comparison_dataset(...)` - Compare multiple models
- `get_low_score_view(...)` / `get_high_score_view(...)` - Filter samples
- `launch_app(dataset)` - Interactive web visualization

### Metrics:
- **BLEU-1/2/3/4**: N-gram precision
- **METEOR**: Semantic matching with synonyms
- **CIDEr**: Consensus-based TF-IDF weighted (most important for captioning)
- **ROUGE-L**: Longest common subsequence
