# Step 2: Setup Data Splits

Before iterating on annotations, you need proper data splits. Without them, you'll contaminate your evaluation and build a model that only looks good on paper.

**This step creates:**
- **Test set (15%)** - Frozen. Never used for selection or training. Final evaluation only.
- **Validation set (15%)** - For iteration decisions. Used to evaluate between training rounds.
- **Golden QA set (5%)** - Small, heavily reviewed. Detects label drift.
- **Pool (65%)** - Active learning pool. All new labels come from here.

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz
import random

DATASET_NAME = "annotation_tutorial"

In [None]:
# Load or create the dataset (idempotent - safe to rerun)
if DATASET_NAME in fo.list_datasets():
    print(f"Loading existing dataset: {DATASET_NAME}")
    dataset = fo.load_dataset(DATASET_NAME)
    
    # Check if splits already exist
    existing_views = dataset.list_saved_views()
    if "pool" in existing_views:
        print("Splits already exist. Skipping creation.")
        SPLITS_EXIST = True
    else:
        SPLITS_EXIST = False
else:
    print(f"Creating new dataset: {DATASET_NAME}")
    dataset = foz.load_zoo_dataset("quickstart", dataset_name=DATASET_NAME)
    dataset.persistent = True
    SPLITS_EXIST = False

print(f"\nDataset: {dataset.name}")
print(f"Samples: {len(dataset)}")

In [None]:
# Create splits (only if they don't exist)
if not SPLITS_EXIST:
    random.seed(42)
    sample_ids = list(dataset.values("id"))
    random.shuffle(sample_ids)

    n = len(sample_ids)
    n_test = int(0.15 * n)      # 15% frozen test
    n_val = int(0.15 * n)       # 15% validation
    n_golden = int(0.05 * n)    # 5% golden QA
    # Remainder is pool

    test_ids = sample_ids[:n_test]
    val_ids = sample_ids[n_test:n_test + n_val]
    golden_ids = sample_ids[n_test + n_val:n_test + n_val + n_golden]
    pool_ids = sample_ids[n_test + n_val + n_golden:]

    print(f"Creating splits:")
    print(f"  Test:   {len(test_ids)} ({100*len(test_ids)/n:.0f}%)")
    print(f"  Val:    {len(val_ids)} ({100*len(val_ids)/n:.0f}%)")
    print(f"  Golden: {len(golden_ids)} ({100*len(golden_ids)/n:.0f}%)")
    print(f"  Pool:   {len(pool_ids)} ({100*len(pool_ids)/n:.0f}%)")

    # Tag samples by split
    dataset.select(test_ids).tag_samples("split:test")
    dataset.select(val_ids).tag_samples("split:val")
    dataset.select(golden_ids).tag_samples("split:golden")
    dataset.select(pool_ids).tag_samples("split:pool")

    # Save views for easy access
    dataset.save_view("test_set", dataset.match_tags("split:test"))
    dataset.save_view("val_set", dataset.match_tags("split:val"))
    dataset.save_view("golden_qa", dataset.match_tags("split:golden"))
    dataset.save_view("pool", dataset.match_tags("split:pool"))

    print("\nSplits created and saved as views.")
else:
    print("Using existing splits.")

In [None]:
# Add annotation tracking field (idempotent)
if "annotation_status" not in dataset.get_field_schema():
    dataset.add_sample_field("annotation_status", fo.StringField)
    dataset.set_values("annotation_status", ["unlabeled"] * len(dataset))
    print("Added annotation_status field (all samples start as 'unlabeled')")
else:
    print("annotation_status field already exists.")

## Why These Splits Matter

| Split | Purpose | Rules |
|-------|---------|-------|
| **Test** | Final evaluation | Never touch. Never use for selection. Check only at the end. |
| **Val** | Iteration decisions | Evaluate after each training round. Guides what to label next. |
| **Golden** | Label quality check | Heavily reviewed. Re-check after each annotation round for drift. |
| **Pool** | Labeling source | All new annotations come from here. |

**If you skip this:** Your metrics become meaningless. You'll overfit to your own selection strategy.

In [None]:
# Verify setup
print("Saved views:", dataset.list_saved_views())
print(f"\nTest set: {len(dataset.load_saved_view('test_set'))} samples")
print(f"Val set: {len(dataset.load_saved_view('val_set'))} samples")
print(f"Golden QA: {len(dataset.load_saved_view('golden_qa'))} samples")
print(f"Pool: {len(dataset.load_saved_view('pool'))} samples")

## Summary

You created four data splits with clear purposes:
- Test (frozen), Val (iteration), Golden (QA), Pool (labeling source)

**Next:** Step 3 - Smart Sample Selection