# Step 4: Train Baseline Detector + Evaluate

Now we train a model on our human labels and evaluate it properly. This step teaches you to:
1. Export data in YOLO format for training
2. Train YOLOv8 on your annotated batch
3. Run inference and evaluate with FiftyOne
4. Analyze failure modes: FP, FN, class confusion, localization errors

**Why This Matters**: Evaluation isn't just a number. Understanding *where* and *why* your model fails tells you what to label next.

## Install Dependencies

In [None]:
!pip install -U ultralytics

## Load the Dataset

In [None]:
import fiftyone as fo
from fiftyone import ViewField as F

# Load dataset
dataset = fo.load_dataset("kitti_annotation_tutorial")

# Get annotated samples (our training data)
train_view = dataset.match_tags("annotated:v0")

# Get validation data (from pool, not yet annotated)
# We'll use a portion of remaining pool for validation
pool_remaining = dataset.match_tags("split:pool").match(F("annotation_status") != "annotated")

print(f"Training samples (annotated): {len(train_view)}")
print(f"Pool remaining: {len(pool_remaining)}")

In [None]:
# For evaluation, we'll use ground_truth on a held-out portion
# Note: In production, you'd have human labels on validation too
# For this tutorial, we use a subset of remaining pool with ground_truth

import random
random.seed(42)

val_ids = random.sample(list(pool_remaining.values("id")), min(50, len(pool_remaining)))
val_view = dataset.select(val_ids)
val_view.tag_samples("split:val_v0")

print(f"Validation samples: {len(val_view)}")

## Export Data for YOLOv8 Training

YOLOv8 expects data in a specific format. FiftyOne makes export easy.

In [None]:
import os

# Create export directory
export_dir = "/tmp/kitti_yolo_v0"
os.makedirs(export_dir, exist_ok=True)

# Get unique classes from human_labels
classes = train_view.distinct("human_labels.detections.label")
print(f"Classes for training: {classes}")

In [None]:
# Export training data in YOLOv5 format (compatible with YOLOv8)
train_view.export(
    export_dir=os.path.join(export_dir, "train"),
    dataset_type=fo.types.YOLOv5Dataset,
    label_field="human_labels",
    classes=classes,
)

print(f"Exported training data to {export_dir}/train")

In [None]:
# Export validation data (using ground_truth for now)
val_view.export(
    export_dir=os.path.join(export_dir, "val"),
    dataset_type=fo.types.YOLOv5Dataset,
    label_field="ground_truth",
    classes=classes,
)

print(f"Exported validation data to {export_dir}/val")

In [None]:
# Create the YAML config file for YOLOv8
yaml_content = f"""path: {export_dir}
train: train/images
val: val/images

names:
"""

for i, cls in enumerate(classes):
    yaml_content += f"  {i}: {cls}\n"

yaml_path = os.path.join(export_dir, "dataset.yaml")
with open(yaml_path, "w") as f:
    f.write(yaml_content)

print(f"Created {yaml_path}")
print("\nYAML content:")
print(yaml_content)

## Train YOLOv8

We'll train a small YOLOv8n model for speed. In production, use larger models.

In [None]:
from ultralytics import YOLO

# Load a pretrained YOLOv8n model
model = YOLO('yolov8n.pt')

# Train on our data
# Note: epochs=10 is just for demo; use more epochs for real training
results = model.train(
    data=yaml_path,
    epochs=10,
    imgsz=640,
    batch=8,
    name='kitti_v0',
    project='/tmp/yolo_runs'
)

print("\nTraining complete!")

In [None]:
# Get the best model path
best_model_path = '/tmp/yolo_runs/kitti_v0/weights/best.pt'
print(f"Best model saved at: {best_model_path}")

## Run Inference on Validation Set

Now we add predictions to our FiftyOne dataset for evaluation.

In [None]:
# Load trained model
model = YOLO(best_model_path)

# Get filepaths for inference
filepaths = val_view.values("filepath")
print(f"Running inference on {len(filepaths)} validation images...")

In [None]:
# Run inference and add predictions to FiftyOne
for sample in val_view:
    # Run inference
    results = model(sample.filepath, verbose=False)[0]
    
    # Convert to FiftyOne detections
    detections = []
    if results.boxes is not None:
        for box in results.boxes:
            # Get normalized coordinates
            x1, y1, x2, y2 = box.xyxyn[0].tolist()
            conf = box.conf[0].item()
            cls_idx = int(box.cls[0].item())
            label = classes[cls_idx] if cls_idx < len(classes) else f"class_{cls_idx}"
            
            # Convert to FiftyOne format [x, y, w, h]
            det = fo.Detection(
                label=label,
                bounding_box=[x1, y1, x2-x1, y2-y1],
                confidence=conf
            )
            detections.append(det)
    
    sample["predictions_v0"] = fo.Detections(detections=detections)
    sample.save()

print(f"Added predictions_v0 to {len(val_view)} samples")

## Evaluate with FiftyOne

FiftyOne's evaluation computes mAP and provides per-sample TP/FP/FN counts for analysis.

In [None]:
# Run evaluation
eval_results = val_view.evaluate_detections(
    "predictions_v0",
    gt_field="ground_truth",
    eval_key="eval_v0",
    compute_mAP=True
)

print("Evaluation Results:")
print(f"  mAP: {eval_results.mAP():.3f}")
print(f"  mAP@50: {eval_results.mAP(iou=0.5):.3f}" if hasattr(eval_results, 'mAP') else "")

In [None]:
# Print per-class metrics
eval_results.print_report()

## Analyze Failure Modes

Now the important part: understanding *where* the model fails. We'll analyze:
1. **False Negatives (FN)**: Objects the model missed
2. **False Positives (FP)**: Detections that don't match ground truth
3. **Class Confusion**: Correct localization but wrong class
4. **Localization Errors**: Right class but poor IoU

In [None]:
# Launch the App with evaluation results
session = fo.launch_app(val_view)

In [None]:
# Find high-FN samples (model missed many objects)
high_fn_view = val_view.sort_by("eval_v0_fn", reverse=True).limit(10)

print("Top 10 samples by False Negatives:")
for sample in high_fn_view:
    fn_count = sample.eval_v0_fn if hasattr(sample, 'eval_v0_fn') else 0
    print(f"  {sample.filepath.split('/')[-1]}: {fn_count} FN")

In [None]:
# Find high-FP samples (model hallucinated detections)
high_fp_view = val_view.sort_by("eval_v0_fp", reverse=True).limit(10)

print("\nTop 10 samples by False Positives:")
for sample in high_fp_view:
    fp_count = sample.eval_v0_fp if hasattr(sample, 'eval_v0_fp') else 0
    print(f"  {sample.filepath.split('/')[-1]}: {fp_count} FP")

In [None]:
# Confusion matrix analysis
confusion = eval_results.confusion_matrix()
print("\nConfusion Matrix:")
confusion.print()

In [None]:
# Plot confusion matrix
confusion.plot()

## Tag Failure Cases for Next Iteration

We'll tag samples that represent different failure modes. These will guide our next batch selection.

In [None]:
# Tag samples with high FN (recall issues)
fn_threshold = 3  # More than 3 missed objects
high_fn_samples = val_view.match(F("eval_v0_fn") > fn_threshold)
high_fn_samples.tag_samples("failure:high_fn")
print(f"Tagged {len(high_fn_samples)} samples with 'failure:high_fn'")

# Tag samples with high FP (precision issues)
fp_threshold = 3
high_fp_samples = val_view.match(F("eval_v0_fp") > fp_threshold)
high_fp_samples.tag_samples("failure:high_fp")
print(f"Tagged {len(high_fp_samples)} samples with 'failure:high_fp'")

In [None]:
# Analyze failures by class
print("\nPer-class failure analysis:")
for cls in classes:
    cls_gt = val_view.filter_labels("ground_truth", F("label") == cls)
    total_gt = sum(len(s.ground_truth.detections) for s in cls_gt if s.ground_truth)
    
    cls_pred = val_view.filter_labels("predictions_v0", F("label") == cls)
    total_pred = sum(len(s.predictions_v0.detections) for s in cls_pred if s.predictions_v0)
    
    print(f"  {cls}: GT={total_gt}, Pred={total_pred}, Diff={total_pred - total_gt}")

## Save Evaluation Artifacts

In [None]:
# Save a view of failure cases
failure_view = val_view.match_tags(["failure:high_fn", "failure:high_fp"])
dataset.save_view("eval_v0_failures", failure_view)

print(f"Saved view 'eval_v0_failures' with {len(failure_view)} samples")

In [None]:
# Store evaluation metrics in dataset info
dataset.info["eval_v0"] = {
    "mAP": eval_results.mAP(),
    "train_samples": len(train_view),
    "val_samples": len(val_view),
    "model_path": best_model_path
}
dataset.save()

print("Evaluation metrics saved to dataset.info['eval_v0']")

## Summary

In this step, you:

1. **Exported data for YOLOv8** - Converted FiftyOne dataset to YOLO format
2. **Trained a baseline model** - YOLOv8n on your annotated batch
3. **Ran inference** - Added `predictions_v0` to validation samples
4. **Evaluated thoroughly**:
   - mAP and per-class metrics
   - Confusion matrix analysis
   - FP/FN breakdown per sample
5. **Tagged failure cases** - `failure:high_fn`, `failure:high_fp` for next iteration

**Key Insight**: Don't just look at mAP. The confusion matrix and per-sample failures tell you *what to label next*.

**Artifacts Created**:
- `predictions_v0` field on validation samples
- `eval_v0` evaluation key with metrics
- Failure tags for targeted selection
- Model checkpoint at `/tmp/yolo_runs/kitti_v0/weights/best.pt`

**Next up**: Step 5 - Iteration: Hybrid Acquisition Loop