feat: training dataset export pipeline to YOLO-seg format

## Problem / Motivation

To train the pottery-specific YOLO-seg model on Modal (milestone [#2](https://github.com/shaoster/glaze/milestone/2), sub-task #6), we need a reproducible way to assemble a training dataset from two heterogeneous sources:

1. `CropAnnotation` rows produced by the in-app supervised labeling UI ([#420](https://github.com/shaoster/glaze/issues/420)).
2. The Label Studio seed export (~100 hand-labeled images) from the bootstrap effort (sub-task #4).

Without a unified, versioned, deterministic export pipeline:

- Training runs are not reproducible (we cannot say "model v3 was trained on dataset v7").
- We cannot defensibly enforce that only privacy-eligible imagery enters the training corpus.
- Splits drift between runs, contaminating eval results.
- Schema mismatches between the two sources (in-app polygons vs. Label Studio JSON) silently corrupt training data.

## Proposed Solution

A Bazel-runnable Python tool — e.g. `tools/export_crop_dataset.py` with a `py_binary` target — that materializes both sources into a single **YOLO-seg format** dataset (polygon-style instance masks, **not** COCO) on disk and/or pushed to Modal Volume / S3.

### Inputs

- `CropAnnotation` rows from the Glaze DB (joined to `Image`, `Piece`, and the annotator).
- Label Studio export bundle from sub-task #4 (JSON + image assets).
- A config (CLI flags or YAML) controlling: output root, target backend (local dir / Modal Volume / S3), split ratios, RNG seed, optional source filters, dataset version tag.

### Outputs

YOLO-seg layout:

```
dataset-v<version>/
  data.yaml            # class names, paths to train/val/test
  images/
    train/  val/  test/
  labels/              # one .txt per image: "<class> x1 y1 x2 y2 ... xN yN" (normalized polygon)
    train/  val/  test/
  dataset_card.json    # see below
  manifest.csv         # one row per included image with provenance
```

### Dataset Card

`dataset_card.json` must capture:

- Dataset version + git SHA the export was built from.
- Total counts, per-split counts, per-source counts (`crop_annotation` vs `label_studio`), per-labeler counts.
- RNG seed used for splits.
- Source-image content hash (e.g. SHA256 over sorted image IDs) so reruns with the same inputs produce the same dataset.
- List of excluded images and exclusion reason (privacy gate, missing mask, degenerate polygon, etc.).

### Splits

- Deterministic `train` / `val` / `test` split (default 80/10/10, configurable) keyed on a stable identifier (e.g. `Image.id`) and a CLI-provided seed.
- Splits must be **stratified by source** so the seed corpus does not all fall into one split.
- Same `(seed, input set)` -> identical split assignment across machines.

### Sharing / PII Gating

Only eligible images flow through:

- Public-library images (`user=NULL`) are always eligible.
- Potter-owned images are eligible only if a sharing/consent flag permits training use (mechanism to be defined alongside this work — likely a per-user `allow_training_use` boolean defaulting to the Privacy Policy's stated terms).
- Surface ineligible-image counts in the dataset card and in the manifest exclusion log.

### Schema Normalization

Both sources must be normalized into YOLO-seg polygons before write:

- `CropAnnotation` polygons -> normalized `[0,1]` coordinates relative to image dims.
- Label Studio export -> parsed via the labeling-interface config from #4; converted to the same normalized polygon format.
- Validate that every polygon has >=3 distinct vertices and lies within `[0,1]`; reject and log degenerate masks.

## Acceptance Criteria

- [ ] `tools/export_crop_dataset.py` exists and is exposed as a `py_binary` runnable via `bazel run //tools:export_crop_dataset -- <args>`.
- [ ] Reads from both `CropAnnotation` (sub-task #3) and the Label Studio export (sub-task #4); fails loudly if either source schema is unrecognized.
- [ ] Emits YOLO-seg-formatted polygon labels (not COCO, not bbox-only).
- [ ] Produces a `data.yaml` consumable by Ultralytics YOLO-seg training.
- [ ] Train/val/test splits are deterministic given the same `--seed` and input set; verified by a test that runs the exporter twice and diffs the manifests.
- [ ] Splits are stratified by source so Label Studio seed data is distributed across all three splits.
- [ ] Emits `dataset_card.json` with version, git SHA, counts, source breakdown, labeler distribution, seed, and input-set hash.
- [ ] Emits `manifest.csv` listing every included image and every excluded image with reason.
- [ ] Sharing/PII gate excludes potter-owned images that lack training consent; public-library images pass through.
- [ ] Degenerate or malformed polygons are dropped with a logged reason — never silently included.
- [ ] Supports writing to a local directory, a Modal Volume, or an S3 bucket (at least one remote backend implemented; others stubbed with clear `NotImplementedError`).
- [ ] **Privacy Policy verification**: confirm in writing (in the PR description and as a checked DoD box) that the current Glaze Privacy Policy permits using potter-uploaded piece imagery to improve product features (the milestone's stated assumption). If it does not, **stop and surface a blocker on this issue** rather than shipping the exporter.
- [ ] Unit tests cover: split determinism, source stratification, polygon normalization from both source schemas, PII gate behavior, malformed-polygon rejection.
- [ ] Sample export of a small fixture dataset committed under `tools/testdata/` for tests.
- [ ] README or docstring at the top of the tool documents CLI flags, output layout, and how to plug the result into sub-task #6's training script.

## Dependencies

- Blocked on sub-task #3 (CropAnnotation model + UI).
- Blocked on sub-task #4 (Label Studio seed export bundle and labeling-interface config).
- Unblocks sub-task #6 (YOLO-seg training on Modal).

## Out of Scope

- The training script itself (sub-task #6).
- Adding new fields to `CropAnnotation` beyond what sub-task #3 already defines.
- Building the per-user `allow_training_use` consent UI — this exporter consumes the flag; the UI for setting it can land separately if needed.
- COCO-format export, bbox-only export, or non-segmentation label formats.
- Active-learning sample selection or class-balanced resampling (future work).



## Milestone Cross-References

- Depends on [#420 — Supervised Crop Annotation UI + Per-Potter Padding](https://github.com/shaoster/glaze/issues/420)
- Depends on [#414 — Label Studio Seed-Labeling Bootstrap](https://github.com/shaoster/glaze/issues/414)
- Blocks [#416 — YOLO-seg Training Script on Modal](https://github.com/shaoster/glaze/issues/416)

Part of milestone [#2 — Custom Pottery Crop Model](https://github.com/shaoster/glaze/milestone/2).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: training dataset export pipeline to YOLO-seg format #415

Problem / Motivation

Proposed Solution

Inputs

Outputs

Dataset Card

Splits

Sharing / PII Gating

Schema Normalization

Acceptance Criteria

Dependencies

Out of Scope

Milestone Cross-References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: training dataset export pipeline to YOLO-seg format #415

Description

Problem / Motivation

Proposed Solution

Inputs

Outputs

Dataset Card

Splits

Sharing / PII Gating

Schema Normalization

Acceptance Criteria

Dependencies

Out of Scope

Milestone Cross-References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions