# Step 1: Setup - Flatten Dataset and Create Splits

In this step, we'll load the KITTI-style `quickstart-groups` dataset, flatten it to a single camera view for annotation, and establish the three critical data splits that make iterative annotation actually work.

**Why This Matters**: Without proper splits, you'll contaminate your evaluation and think you're improving when you're not. This step is non-negotiable.

## Install Dependencies

In [None]:
!pip install -U fiftyone

## Load the Grouped Dataset

The `quickstart-groups` dataset contains 200 KITTI scenes with multiple sensor modalities:
- `left`: Left camera images with 2D detections
- `right`: Right camera images
- `pcd`: Point cloud data with 3D annotations

Let's load it and explore the structure.

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz

# Load the grouped dataset
dataset = foz.load_zoo_dataset("quickstart-groups")

print(f"Dataset: {dataset.name}")
print(f"Media type: {dataset.media_type}")
print(f"Num groups: {len(dataset)}")
print(f"Group slices: {dataset.group_slices}")
print(f"\nFields:")
print(dataset)

In [None]:
# Launch the App to explore the grouped data
session = fo.launch_app(dataset)

## Flatten to Left Camera Slice

Human annotation in FiftyOne works best with standard image datasets. We'll flatten the grouped dataset to work with just the left camera images, which have 2D bounding box annotations.

This approach:
1. Preserves the 2D detection workflow (our focus for this guide)
2. Avoids UI complexity with grouped data
3. Gives us a clean image dataset for YOLOv8 training

In [None]:
# Select only the left camera slice and flatten to a standard image dataset
left_view = dataset.select_group_slices(["left"], flat=True)

print(f"Flattened view: {len(left_view)} samples")
print(f"Media type: {left_view.media_type}")
print(f"\nSample fields:")
for field_name, field in left_view.get_field_schema().items():
    print(f"  {field_name}: {field}")

In [None]:
# Clone to a new dataset for our annotation workflow
# This ensures we don't modify the original zoo dataset
annotation_dataset = left_view.clone("kitti_annotation_tutorial")
annotation_dataset.persistent = True

print(f"Created: {annotation_dataset.name}")
print(f"Num samples: {len(annotation_dataset)}")

## Create the Three Critical Splits

Now we establish the splits that make this workflow actually work:

| Split | Size | Purpose |
|-------|------|----------|
| **test** | 15% | Frozen. Never used for selection. Final evaluation only. |
| **golden** | 5% | Heavily reviewed QA set. Detects label drift. |
| **pool** | 80% | Active learning pool. All new labels come from here. |

We'll use FiftyOne tags to mark these splits.

In [None]:
import random

# Set seed for reproducibility
random.seed(42)

# Get all sample IDs and shuffle
sample_ids = list(annotation_dataset.values("id"))
random.shuffle(sample_ids)

# Calculate split sizes
n_total = len(sample_ids)
n_test = int(0.15 * n_total)      # 15% for frozen test
n_golden = int(0.05 * n_total)    # 5% for golden QA
n_pool = n_total - n_test - n_golden  # Remainder for active pool

print(f"Total samples: {n_total}")
print(f"Test split: {n_test} samples (15%)")
print(f"Golden split: {n_golden} samples (5%)")
print(f"Pool split: {n_pool} samples (80%)")

In [None]:
# Assign splits using tags
test_ids = sample_ids[:n_test]
golden_ids = sample_ids[n_test:n_test + n_golden]
pool_ids = sample_ids[n_test + n_golden:]

# Tag the samples
annotation_dataset.select(test_ids).tag_samples("split:test")
annotation_dataset.select(golden_ids).tag_samples("split:golden")
annotation_dataset.select(pool_ids).tag_samples("split:pool")

print("Splits assigned!")
print(f"  split:test - {len(annotation_dataset.match_tags('split:test'))} samples")
print(f"  split:golden - {len(annotation_dataset.match_tags('split:golden'))} samples")
print(f"  split:pool - {len(annotation_dataset.match_tags('split:pool'))} samples")

## Create Label Fields for Human Annotations

We'll keep the original `ground_truth` field intact and create a new field for human annotations. This maintains provenance and lets us compare original vs. human-corrected labels.

In [None]:
# The original annotations are in 'ground_truth'
# We'll create 'human_labels' for our annotation work

# First, let's see what classes exist in the original annotations
classes = annotation_dataset.distinct("ground_truth.detections.label")
print(f"Original classes: {classes}")

# Count detections per class
from collections import Counter

all_labels = []
for sample in annotation_dataset:
    if sample.ground_truth:
        all_labels.extend([det.label for det in sample.ground_truth.detections])

label_counts = Counter(all_labels)
print(f"\nDetection counts by class:")
for label, count in sorted(label_counts.items(), key=lambda x: -x[1]):
    print(f"  {label}: {count}")

In [None]:
# Add a field to track annotation status
annotation_dataset.add_sample_field("annotation_status", fo.StringField)

# Initialize all samples as 'unlabeled'
annotation_dataset.set_values("annotation_status", ["unlabeled"] * len(annotation_dataset))

print("Added annotation_status field (all samples start as 'unlabeled')")

## Create Helper Views

We'll save views for easy access to each split during the annotation workflow.

In [None]:
# Create and save views for each split
test_view = annotation_dataset.match_tags("split:test")
golden_view = annotation_dataset.match_tags("split:golden")
pool_view = annotation_dataset.match_tags("split:pool")

annotation_dataset.save_view("test_set", test_view)
annotation_dataset.save_view("golden_qa", golden_view)
annotation_dataset.save_view("active_pool", pool_view)

print("Saved views:")
for name in annotation_dataset.list_saved_views():
    print(f"  - {name}")

In [None]:
# Launch the App with the pool view (where we'll select samples to annotate)
session.dataset = annotation_dataset
session.view = pool_view
session.show()

## Verify the Setup

Let's confirm everything is correctly configured before moving on.

In [None]:
print("="*50)
print("SETUP VERIFICATION")
print("="*50)
print(f"\nDataset: {annotation_dataset.name}")
print(f"Total samples: {len(annotation_dataset)}")
print(f"\nSplits:")
print(f"  Test (frozen):  {len(test_view)} samples")
print(f"  Golden (QA):    {len(golden_view)} samples")
print(f"  Pool (active):  {len(pool_view)} samples")
print(f"\nLabel fields:")
print(f"  ground_truth: Original KITTI annotations")
print(f"  human_labels: (Will be created during annotation)")
print(f"\nSaved views: {annotation_dataset.list_saved_views()}")
print(f"\nClasses: {classes}")
print("\n" + "="*50)
print("Ready for Step 2: Bootstrap Selection!")
print("="*50)

## Summary

In this step, you:

1. **Loaded the quickstart-groups dataset** - Multi-modal KITTI data with camera + LiDAR
2. **Flattened to left camera images** - Created a clean image dataset for annotation
3. **Created three critical splits**:
   - **Test (15%)**: Frozen, never touched during active learning
   - **Golden (5%)**: Small QA set to detect label drift
   - **Pool (80%)**: Where all new labels will come from
4. **Set up tracking fields** - For annotation status and provenance

**Key Insight**: These splits aren't optional. Without them, you'll contaminate your evaluation and build a model that only looks good on paper.

**Next up**: Step 2 - Bootstrap Selection with Embeddings + ZCore