# Step 2: Bootstrap Selection - Embeddings + ZCore

Before we have a model, we can't chase failures. So our first batch must be selected for **coverage** - we want diverse samples that represent the full distribution of our data.

This step teaches you to:
1. Compute embeddings to understand your dataset's structure
2. Use the ZCore operator to select a coverage-optimized batch
3. Visualize the selection to verify diversity

**Why This Matters**: Random sampling wastes labels on redundant near-duplicates. ZCore ensures your first batch covers the data manifold efficiently.

## Load the Dataset from Step 1

In [None]:
import fiftyone as fo

# Load the dataset we created in Step 1
dataset = fo.load_dataset("kitti_annotation_tutorial")

# Load the pool view (where we select samples from)
pool_view = dataset.load_saved_view("active_pool")

print(f"Dataset: {dataset.name}")
print(f"Pool size: {len(pool_view)} samples available for selection")

## Compute Image Embeddings

Embeddings map images to a vector space where similar images are close together. We'll use these to:
1. Understand the structure of our dataset
2. Select diverse samples for annotation
3. Identify clusters and outliers

FiftyOne's Brain module provides easy access to embedding computation.

In [None]:
import fiftyone.brain as fob

# Compute embeddings using a pre-trained model
# This may take a few minutes depending on your hardware
results = fob.compute_visualization(
    dataset,
    embeddings="embeddings",  # Store embeddings in this field
    brain_key="img_viz",      # Name for this brain run
    verbose=True
)

print(f"\nEmbeddings computed and stored in 'embeddings' field")
print(f"Brain run saved as: img_viz")

In [None]:
# Launch the App to visualize the embeddings
session = fo.launch_app(dataset)

In the FiftyOne App, click on the **Embeddings** panel to see the 2D projection of your dataset. You'll notice:
- **Clusters**: Groups of similar images (same scene type, lighting, etc.)
- **Outliers**: Unusual samples at the edges
- **Dense regions**: Areas with many similar samples (redundant for labeling)

Random sampling would over-sample dense regions. We want to spread across clusters.

## ZCore: Zero-Shot Coreset Selection

ZCore selects a **coreset** - a small subset that represents the full dataset. It works by:
1. Computing pairwise distances in embedding space
2. Selecting samples that maximize coverage (minimize redundancy)
3. Ensuring selected samples span the full distribution

The result: labeling fewer samples while maintaining dataset coverage.

### Using ZCore via the Operators Panel

In the FiftyOne App:
1. Press **`** (backtick) to open the Operators panel
2. Search for **"zcore"** or **"coreset"**
3. Configure the operator:
   - **embeddings_field**: `embeddings`
   - **target_size**: Number of samples to select (we'll use ~20% of pool)
   - **output_tag**: `to_annotate_v0`

Alternatively, we can run ZCore programmatically:

In [None]:
# Calculate target batch size (20% of pool for first iteration)
batch_size = int(0.20 * len(pool_view))
print(f"Target batch size: {batch_size} samples (20% of pool)")

In [None]:
# Compute uniqueness scores as a proxy for ZCore
# Samples with high uniqueness are less redundant
fob.compute_uniqueness(
    pool_view,
    uniqueness_field="uniqueness",
    embeddings="embeddings"
)

print("Uniqueness scores computed!")

In [None]:
# Select samples with highest uniqueness (most diverse)
# This ensures coverage across the embedding space
batch_v0_view = pool_view.sort_by("uniqueness", reverse=True).limit(batch_size)

print(f"Selected {len(batch_v0_view)} samples for Batch v0")

In [None]:
# Tag the selected samples
batch_v0_view.tag_samples("batch:v0")
batch_v0_view.tag_samples("to_annotate")

# Update annotation status
batch_v0_ids = list(batch_v0_view.values("id"))
dataset.select(batch_v0_ids).set_values("annotation_status", ["selected"] * len(batch_v0_ids))

print(f"Tagged {len(batch_v0_ids)} samples with 'batch:v0' and 'to_annotate'")

In [None]:
# Save this selection as a view for easy access
batch_v0 = dataset.match_tags("batch:v0")
dataset.save_view("batch_v0_to_annotate", batch_v0)

print(f"Saved view: batch_v0_to_annotate ({len(batch_v0)} samples)")

## Visualize the Selection

Let's verify that our selection provides good coverage by looking at the embeddings visualization.

In [None]:
# View the selected samples in the App
session.view = batch_v0
session.show()

In the embeddings panel, the selected samples should be **spread across clusters**, not concentrated in one area. If you see good coverage, the selection is working.

### What to Look For:
- **Good**: Selected samples span multiple regions of the embedding space
- **Bad**: Selected samples clump in one area (indicates a bug in selection)
- **Check**: Compare selected vs. unselected in the embeddings view

In [None]:
# Compare selection statistics with full pool
print("Selection Statistics:")
print(f"  Pool size: {len(pool_view)}")
print(f"  Selected: {len(batch_v0)} ({100*len(batch_v0)/len(pool_view):.1f}%)")
print(f"  Remaining: {len(pool_view) - len(batch_v0)}")

# Check class distribution in selected batch
from collections import Counter

selected_labels = []
for sample in batch_v0:
    if sample.ground_truth:
        selected_labels.extend([det.label for det in sample.ground_truth.detections])

print(f"\nClass distribution in Batch v0:")
for label, count in sorted(Counter(selected_labels).items(), key=lambda x: -x[1]):
    print(f"  {label}: {count}")

## Summary

In this step, you:

1. **Computed embeddings** - Mapped images to a vector space for similarity analysis
2. **Ran diversity selection** - Used uniqueness scoring to select coverage-optimized samples
3. **Created Batch v0** - Selected ~20% of the pool for initial annotation
4. **Tagged and saved** - Samples are tagged `batch:v0` and `to_annotate` for tracking

**Key Insight**: This first batch maximizes coverage, not model performance. We're seeding the loop with diverse examples. Model-driven selection comes after we have a trained model.

**Artifacts Created**:
- `embeddings` field on all samples
- `uniqueness` field on pool samples
- `batch:v0` and `to_annotate` tags on selected samples
- `batch_v0_to_annotate` saved view

**Next up**: Step 3 - Human Annotation Pass + QA