Skip to content

feat: training dataset export pipeline to YOLO-seg format #415

@shaoster

Description

@shaoster

Problem / Motivation

To train the pottery-specific YOLO-seg model on Modal (milestone #2, sub-task #6), we need a reproducible way to assemble a training dataset from two heterogeneous sources:

  1. CropAnnotation rows produced by the in-app supervised labeling UI (#420).
  2. The Label Studio seed export (~100 hand-labeled images) from the bootstrap effort (sub-task Fix agent configurations #4).

Without a unified, versioned, deterministic export pipeline:

  • Training runs are not reproducible (we cannot say "model v3 was trained on dataset v7").
  • We cannot defensibly enforce that only privacy-eligible imagery enters the training corpus.
  • Splits drift between runs, contaminating eval results.
  • Schema mismatches between the two sources (in-app polygons vs. Label Studio JSON) silently corrupt training data.

Proposed Solution

A Bazel-runnable Python tool — e.g. tools/export_crop_dataset.py with a py_binary target — that materializes both sources into a single YOLO-seg format dataset (polygon-style instance masks, not COCO) on disk and/or pushed to Modal Volume / S3.

Inputs

  • CropAnnotation rows from the Glaze DB (joined to Image, Piece, and the annotator).
  • Label Studio export bundle from sub-task Fix agent configurations #4 (JSON + image assets).
  • A config (CLI flags or YAML) controlling: output root, target backend (local dir / Modal Volume / S3), split ratios, RNG seed, optional source filters, dataset version tag.

Outputs

YOLO-seg layout:

dataset-v<version>/
  data.yaml            # class names, paths to train/val/test
  images/
    train/  val/  test/
  labels/              # one .txt per image: "<class> x1 y1 x2 y2 ... xN yN" (normalized polygon)
    train/  val/  test/
  dataset_card.json    # see below
  manifest.csv         # one row per included image with provenance

Dataset Card

dataset_card.json must capture:

  • Dataset version + git SHA the export was built from.
  • Total counts, per-split counts, per-source counts (crop_annotation vs label_studio), per-labeler counts.
  • RNG seed used for splits.
  • Source-image content hash (e.g. SHA256 over sorted image IDs) so reruns with the same inputs produce the same dataset.
  • List of excluded images and exclusion reason (privacy gate, missing mask, degenerate polygon, etc.).

Splits

  • Deterministic train / val / test split (default 80/10/10, configurable) keyed on a stable identifier (e.g. Image.id) and a CLI-provided seed.
  • Splits must be stratified by source so the seed corpus does not all fall into one split.
  • Same (seed, input set) -> identical split assignment across machines.

Sharing / PII Gating

Only eligible images flow through:

  • Public-library images (user=NULL) are always eligible.
  • Potter-owned images are eligible only if a sharing/consent flag permits training use (mechanism to be defined alongside this work — likely a per-user allow_training_use boolean defaulting to the Privacy Policy's stated terms).
  • Surface ineligible-image counts in the dataset card and in the manifest exclusion log.

Schema Normalization

Both sources must be normalized into YOLO-seg polygons before write:

  • CropAnnotation polygons -> normalized [0,1] coordinates relative to image dims.
  • Label Studio export -> parsed via the labeling-interface config from Fix agent configurations #4; converted to the same normalized polygon format.
  • Validate that every polygon has >=3 distinct vertices and lies within [0,1]; reject and log degenerate masks.

Acceptance Criteria

  • tools/export_crop_dataset.py exists and is exposed as a py_binary runnable via bazel run //tools:export_crop_dataset -- <args>.
  • Reads from both CropAnnotation (sub-task Set up agentic workflows on github #3) and the Label Studio export (sub-task Fix agent configurations #4); fails loudly if either source schema is unrecognized.
  • Emits YOLO-seg-formatted polygon labels (not COCO, not bbox-only).
  • Produces a data.yaml consumable by Ultralytics YOLO-seg training.
  • Train/val/test splits are deterministic given the same --seed and input set; verified by a test that runs the exporter twice and diffs the manifests.
  • Splits are stratified by source so Label Studio seed data is distributed across all three splits.
  • Emits dataset_card.json with version, git SHA, counts, source breakdown, labeler distribution, seed, and input-set hash.
  • Emits manifest.csv listing every included image and every excluded image with reason.
  • Sharing/PII gate excludes potter-owned images that lack training consent; public-library images pass through.
  • Degenerate or malformed polygons are dropped with a logged reason — never silently included.
  • Supports writing to a local directory, a Modal Volume, or an S3 bucket (at least one remote backend implemented; others stubbed with clear NotImplementedError).
  • Privacy Policy verification: confirm in writing (in the PR description and as a checked DoD box) that the current Glaze Privacy Policy permits using potter-uploaded piece imagery to improve product features (the milestone's stated assumption). If it does not, stop and surface a blocker on this issue rather than shipping the exporter.
  • Unit tests cover: split determinism, source stratification, polygon normalization from both source schemas, PII gate behavior, malformed-polygon rejection.
  • Sample export of a small fixture dataset committed under tools/testdata/ for tests.
  • README or docstring at the top of the tool documents CLI flags, output layout, and how to plug the result into sub-task Allow claude agent to skip interactive permissions #6's training script.

Dependencies

Out of Scope

  • The training script itself (sub-task Allow claude agent to skip interactive permissions #6).
  • Adding new fields to CropAnnotation beyond what sub-task Set up agentic workflows on github #3 already defines.
  • Building the per-user allow_training_use consent UI — this exporter consumes the flag; the UI for setting it can land separately if needed.
  • COCO-format export, bbox-only export, or non-segmentation label formats.
  • Active-learning sample selection or class-balanced resampling (future work).

Milestone Cross-References

Part of milestone #2 — Custom Pottery Crop Model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions