# Step 3: Smart Sample Selection

Random sampling wastes labels on redundant near-duplicates. This step uses **diversity-based selection** to pick high-value samples that cover your data distribution efficiently.

We'll use **ZCore (Zero-Shot Coreset Selection)** to score samples based on:
- **Coverage**: How much of the embedding space does this sample represent?
- **Redundancy**: How many near-duplicates exist?

High ZCore score = valuable for labeling. Low score = redundant, skip it.

> **Reference:** ZCore is from [Voxel51's research](https://github.com/voxel51/zcore). The implementation below is simplified for understanding. For production use with large datasets, see the official repository.

In [None]:
import fiftyone as fo
import fiftyone.brain as fob
import numpy as np

dataset = fo.load_dataset("annotation_tutorial")
pool = dataset.load_saved_view("pool")

print(f"Pool size: {len(pool)} samples")

## Compute Embeddings

Diversity selection needs embeddings to understand dataset structure. We'll compute them using FiftyOne Brain.

In [None]:
# Compute embeddings (takes a few minutes)
fob.compute_visualization(
    dataset,
    embeddings="embeddings",
    brain_key="img_viz",
    verbose=True
)

print("Embeddings computed.")

## ZCore: Zero-Shot Coreset Selection

ZCore scores each sample by iteratively:
1. Sampling random points in embedding space
2. Finding the nearest data point (coverage bonus)
3. Penalizing nearby neighbors (redundancy penalty)

The result: samples covering unique regions score high; redundant samples score low.

> **Note:** This is a simplified reference implementation. It works well for datasets up to a few thousand samples. For larger datasets, use the optimized version at [github.com/voxel51/zcore](https://github.com/voxel51/zcore).

In [None]:
def zcore_score(embeddings, n_sample=10000, sample_dim=2, redund_nn=100, redund_exp=4, seed=42):
    """
    Compute ZCore scores for coverage-based sample selection.
    
    Reference implementation from https://github.com/voxel51/zcore
    For production use with large datasets, use the official package.
    
    Args:
        embeddings: np.array of shape (n_samples, embedding_dim)
        n_sample: Number of random samples to draw
        sample_dim: Number of dimensions to sample at a time
        redund_nn: Number of nearest neighbors for redundancy penalty
        redund_exp: Exponent for distance-based redundancy penalty
        seed: Random seed for reproducibility
    
    Returns:
        Normalized scores (0-1) where higher = more valuable for labeling
    """
    np.random.seed(seed)
    
    n = len(embeddings)
    n_dim = embeddings.shape[1]
    
    # Embedding statistics
    emb_min = np.min(embeddings, axis=0)
    emb_max = np.max(embeddings, axis=0)
    emb_med = np.median(embeddings, axis=0)
    
    # Initialize scores
    scores = np.random.uniform(0, 1, n)
    
    for i in range(n_sample):
        if i % 2000 == 0:
            print(f"  ZCore progress: {i}/{n_sample}")
        
        # Random embedding dimensions
        dim = np.random.choice(n_dim, min(sample_dim, n_dim), replace=False)
        
        # Sample point using triangular distribution (biased toward median)
        sample = np.random.triangular(emb_min[dim], emb_med[dim], emb_max[dim])
        
        # Coverage: find nearest sample to random point
        embed_dist = np.sum(np.abs(embeddings[:, dim] - sample), axis=1)
        idx = np.argmin(embed_dist)
        scores[idx] += 1  # Reward coverage
        
        # Redundancy: penalize nearby neighbors
        cover_sample = embeddings[idx, dim]
        nn_dist = np.sum(np.abs(embeddings[:, dim] - cover_sample), axis=1)
        nn = np.argsort(nn_dist)[1:]  # Exclude self
        
        if nn_dist[nn[0]] == 0:
            # Exact duplicate
            scores[nn[0]] -= 1
        else:
            # Distance-weighted penalty for neighbors
            nn = nn[:redund_nn]
            dist_penalty = 1 / (nn_dist[nn] ** redund_exp + 1e-8)
            dist_penalty /= np.sum(dist_penalty)
            scores[nn] -= dist_penalty
    
    # Normalize to [0, 1]
    scores = (scores - np.min(scores)) / (np.max(scores) - np.min(scores) + 1e-8)
    
    return scores.astype(np.float32)

In [None]:
# Get embeddings for pool samples
pool_samples = list(pool)
embeddings = np.array([s.embeddings for s in pool_samples if s.embeddings is not None])
valid_ids = [s.id for s in pool_samples if s.embeddings is not None]

print(f"Computing ZCore for {len(embeddings)} samples...")
print(f"Embedding dimension: {embeddings.shape[1]}")

In [None]:
# Compute ZCore scores
# n_sample=5000 is fast for tutorials; increase for larger datasets
scores = zcore_score(
    embeddings,
    n_sample=5000,
    sample_dim=2,
    redund_nn=50,
    redund_exp=4,
    seed=42
)

print(f"\nZCore scores computed!")
print(f"Score range: {scores.min():.3f} - {scores.max():.3f}")
print(f"Mean: {scores.mean():.3f}, Std: {scores.std():.3f}")

In [None]:
# Add ZCore scores to samples
for sample_id, score in zip(valid_ids, scores):
    sample = dataset[sample_id]
    sample["zcore"] = float(score)
    sample.save()

print(f"Added 'zcore' field to {len(valid_ids)} samples")

## Select Your First Batch

**How many samples to label?**

A good starting point:
- **50-200 samples** for initial batch (1-2 hours of labeling)
- Size subsequent batches based on your labeling throughput
- Typical iteration: half-day to one-day of labeling per batch

For this tutorial with ~130 pool samples, we'll select about 30 (roughly 25%).

In [None]:
# Reload pool to see new zcore field
pool = dataset.load_saved_view("pool")

# Select top samples by ZCore score
# Adjust batch_size based on your labeling capacity
batch_size = min(30, int(0.25 * len(pool)))  # ~30 samples or 25% of pool
batch_v0 = pool.sort_by("zcore", reverse=True).limit(batch_size)

print(f"Selected {len(batch_v0)} samples for Batch 0")
print(f"ZCore range of selected: {min(s.zcore for s in batch_v0):.3f} - {max(s.zcore for s in batch_v0):.3f}")

In [None]:
# Tag the selection
batch_v0.tag_samples("batch:v0")
batch_v0.tag_samples("to_annotate")
batch_v0.set_values("annotation_status", ["selected"] * len(batch_v0))

# Save as view
dataset.save_view("batch_v0", dataset.match_tags("batch:v0"))

print(f"Tagged and saved view: batch_v0")

In [None]:
# Visualize in the App
session = fo.launch_app(dataset)

In the App:
1. Open the **Embeddings** panel to see the 2D projection
2. Color by the `zcore` field to see the score distribution
3. Filter by `batch:v0` tag to see your selection
4. Verify high-ZCore samples are spread across clusters (good coverage)

## Why Diversity Sampling Beats Random

| Method | What it does | Result |
|--------|-------------|--------|
| **Random** | Picks samples uniformly | Over-samples dense regions, misses rare cases |
| **ZCore** | Balances coverage vs redundancy | Maximizes diversity, fewer wasted labels |

Research shows diversity-based selection can significantly reduce labeling requirements while maintaining model performance. See the [ZCore paper](https://arxiv.org/pdf/2411.15349) for detailed benchmarks on ImageNet.

## Summary

You selected a diverse batch using ZCore:
- Computed embeddings to map images to vector space
- Ran ZCore algorithm to score coverage vs redundancy
- Selected top samples by ZCore score
- Tagged as `batch:v0` and `to_annotate`

**Artifacts:**
- `embeddings` field on all samples
- `zcore` field with selection scores
- `batch_v0` saved view

**Next:** Step 4 - Annotation + QA