You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To train the pottery-specific YOLO-seg model on Modal (milestone #2, sub-task #6), we need a reproducible way to assemble a training dataset from two heterogeneous sources:
CropAnnotation rows produced by the in-app supervised labeling UI (#420).
The Label Studio seed export (~100 hand-labeled images) from the bootstrap effort (sub-task Fix agent configurations #4).
Without a unified, versioned, deterministic export pipeline:
Training runs are not reproducible (we cannot say "model v3 was trained on dataset v7").
We cannot defensibly enforce that only privacy-eligible imagery enters the training corpus.
Splits drift between runs, contaminating eval results.
Schema mismatches between the two sources (in-app polygons vs. Label Studio JSON) silently corrupt training data.
Proposed Solution
A Bazel-runnable Python tool — e.g. tools/export_crop_dataset.py with a py_binary target — that materializes both sources into a single YOLO-seg format dataset (polygon-style instance masks, not COCO) on disk and/or pushed to Modal Volume / S3.
Inputs
CropAnnotation rows from the Glaze DB (joined to Image, Piece, and the annotator).
A config (CLI flags or YAML) controlling: output root, target backend (local dir / Modal Volume / S3), split ratios, RNG seed, optional source filters, dataset version tag.
Outputs
YOLO-seg layout:
dataset-v<version>/
data.yaml # class names, paths to train/val/test
images/
train/ val/ test/
labels/ # one .txt per image: "<class> x1 y1 x2 y2 ... xN yN" (normalized polygon)
train/ val/ test/
dataset_card.json # see below
manifest.csv # one row per included image with provenance
Dataset Card
dataset_card.json must capture:
Dataset version + git SHA the export was built from.
Total counts, per-split counts, per-source counts (crop_annotation vs label_studio), per-labeler counts.
RNG seed used for splits.
Source-image content hash (e.g. SHA256 over sorted image IDs) so reruns with the same inputs produce the same dataset.
List of excluded images and exclusion reason (privacy gate, missing mask, degenerate polygon, etc.).
Splits
Deterministic train / val / test split (default 80/10/10, configurable) keyed on a stable identifier (e.g. Image.id) and a CLI-provided seed.
Splits must be stratified by source so the seed corpus does not all fall into one split.
Same (seed, input set) -> identical split assignment across machines.
Sharing / PII Gating
Only eligible images flow through:
Public-library images (user=NULL) are always eligible.
Potter-owned images are eligible only if a sharing/consent flag permits training use (mechanism to be defined alongside this work — likely a per-user allow_training_use boolean defaulting to the Privacy Policy's stated terms).
Surface ineligible-image counts in the dataset card and in the manifest exclusion log.
Schema Normalization
Both sources must be normalized into YOLO-seg polygons before write:
CropAnnotation polygons -> normalized [0,1] coordinates relative to image dims.
Label Studio export -> parsed via the labeling-interface config from Fix agent configurations #4; converted to the same normalized polygon format.
Validate that every polygon has >=3 distinct vertices and lies within [0,1]; reject and log degenerate masks.
Acceptance Criteria
tools/export_crop_dataset.py exists and is exposed as a py_binary runnable via bazel run //tools:export_crop_dataset -- <args>.
Emits YOLO-seg-formatted polygon labels (not COCO, not bbox-only).
Produces a data.yaml consumable by Ultralytics YOLO-seg training.
Train/val/test splits are deterministic given the same --seed and input set; verified by a test that runs the exporter twice and diffs the manifests.
Splits are stratified by source so Label Studio seed data is distributed across all three splits.
Emits dataset_card.json with version, git SHA, counts, source breakdown, labeler distribution, seed, and input-set hash.
Emits manifest.csv listing every included image and every excluded image with reason.
Sharing/PII gate excludes potter-owned images that lack training consent; public-library images pass through.
Degenerate or malformed polygons are dropped with a logged reason — never silently included.
Supports writing to a local directory, a Modal Volume, or an S3 bucket (at least one remote backend implemented; others stubbed with clear NotImplementedError).
Privacy Policy verification: confirm in writing (in the PR description and as a checked DoD box) that the current Glaze Privacy Policy permits using potter-uploaded piece imagery to improve product features (the milestone's stated assumption). If it does not, stop and surface a blocker on this issue rather than shipping the exporter.
Unit tests cover: split determinism, source stratification, polygon normalization from both source schemas, PII gate behavior, malformed-polygon rejection.
Sample export of a small fixture dataset committed under tools/testdata/ for tests.
Problem / Motivation
To train the pottery-specific YOLO-seg model on Modal (milestone #2, sub-task #6), we need a reproducible way to assemble a training dataset from two heterogeneous sources:
CropAnnotationrows produced by the in-app supervised labeling UI (#420).Without a unified, versioned, deterministic export pipeline:
Proposed Solution
A Bazel-runnable Python tool — e.g.
tools/export_crop_dataset.pywith apy_binarytarget — that materializes both sources into a single YOLO-seg format dataset (polygon-style instance masks, not COCO) on disk and/or pushed to Modal Volume / S3.Inputs
CropAnnotationrows from the Glaze DB (joined toImage,Piece, and the annotator).Outputs
YOLO-seg layout:
Dataset Card
dataset_card.jsonmust capture:crop_annotationvslabel_studio), per-labeler counts.Splits
train/val/testsplit (default 80/10/10, configurable) keyed on a stable identifier (e.g.Image.id) and a CLI-provided seed.(seed, input set)-> identical split assignment across machines.Sharing / PII Gating
Only eligible images flow through:
user=NULL) are always eligible.allow_training_useboolean defaulting to the Privacy Policy's stated terms).Schema Normalization
Both sources must be normalized into YOLO-seg polygons before write:
CropAnnotationpolygons -> normalized[0,1]coordinates relative to image dims.[0,1]; reject and log degenerate masks.Acceptance Criteria
tools/export_crop_dataset.pyexists and is exposed as apy_binaryrunnable viabazel run //tools:export_crop_dataset -- <args>.CropAnnotation(sub-task Set up agentic workflows on github #3) and the Label Studio export (sub-task Fix agent configurations #4); fails loudly if either source schema is unrecognized.data.yamlconsumable by Ultralytics YOLO-seg training.--seedand input set; verified by a test that runs the exporter twice and diffs the manifests.dataset_card.jsonwith version, git SHA, counts, source breakdown, labeler distribution, seed, and input-set hash.manifest.csvlisting every included image and every excluded image with reason.NotImplementedError).tools/testdata/for tests.Dependencies
Out of Scope
CropAnnotationbeyond what sub-task Set up agentic workflows on github #3 already defines.allow_training_useconsent UI — this exporter consumes the flag; the UI for setting it can land separately if needed.Milestone Cross-References
Part of milestone #2 — Custom Pottery Crop Model.