feat: YOLO-seg training script for pottery crop model (local GPU)

## Problem / Motivation

Milestone [#2](https://github.com/shaoster/glaze/milestone/2) replaces the off-the-shelf rembg/u2net pipeline backing `piece_image_crop_service` with a pottery-tuned Ultralytics YOLO-seg model. The training dataset will be materialized by [#415](https://github.com/shaoster/glaze/issues/415) in YOLO-seg format with deterministic splits and a versioned dataset card. We need a script that can consume that dataset and produce a trained foreground-segmentation checkpoint **on a local developer GPU** (Modal GPU remains an optional path for collaborators without local hardware), ready for the evaluation harness in [#417](https://github.com/shaoster/glaze/issues/417) to grade.

This issue STOPS at "can produce a checkpoint with passable Ultralytics val metrics." The rembg-baseline comparison, locked success-criteria check, and production go/no-go report all happen in [#7](https://github.com/shaoster/glaze/issues/7).

## Proposed Solution

Add `tools/train_crop_model.py` wrapping the Ultralytics YOLO-seg trainer, with two execution modes:

- **Local (primary)**: runs `ultralytics` directly on the maintainer's local GPU against a dataset on disk. This is the gating path — full training runs and the checkpoint that feeds [#417](https://github.com/shaoster/glaze/issues/417) come from here.
- **Modal (optional)**: same entrypoint, deployable as a Modal GPU function for collaborators without local hardware. Pulls the dataset from the Modal Volume / S3 location materialized by [#415](https://github.com/shaoster/glaze/issues/415) and writes checkpoints to a versioned Modal Volume path that [#419](https://github.com/shaoster/glaze/issues/419)'s backend registry can later resolve. Not required to ship this issue, but the abstraction should be cleanly factored so it isn't a rewrite to add.

### Model output

A foreground segmentation mask (per the locked design decision in milestone [#2](https://github.com/shaoster/glaze/milestone/2)). Pottery is a single-class problem here — one foreground class, no maker-mark or sub-part heads. The tight bbox and any padding are derived downstream and are explicitly **out of scope** for this issue.

### Configuration

Training is driven by a YAML config (e.g. `tools/configs/crop_model/<name>.yaml`) covering at minimum:

- Base model (e.g. `yolov8n-seg.pt`, `yolov8s-seg.pt`) and image size.
- Epochs, batch size, optimizer, LR schedule, augmentation toggles.
- Dataset version pointer (the versioned dataset card from [#5](https://github.com/shaoster/glaze/issues/5)).
- Output directory layout under the Modal Volume.

A small set of CLI overrides (`--epochs`, `--batch`, `--resume`, `--config`, `--run-id`) is enough — no full hyperparameter sweep machinery in this ticket.

### Checkpoint upload

Checkpoints land at a versioned path. Local mode writes to a directory under the repo (gitignored) or a configurable `--output-dir`; Modal mode writes the same layout to a Modal Volume.

```
crop-models/<run_id>/
  best.pt
  last.pt
  args.yaml          # full resolved Ultralytics args
  run.json           # run_id, dataset version, git sha, hparams, final val metrics
  results.csv        # Ultralytics per-epoch log
```

`run_id` is a deterministic slug (e.g. `<config-name>-<utc-timestamp>-<git-shortsha>`). The structure is what [#419](https://github.com/shaoster/glaze/issues/419) will read from (regardless of whether the checkpoint sat on local disk first or directly on a Modal Volume); this issue only has to write it consistently. A small `--upload-to <modal-volume|s3>` flag (or a separate subcommand) is sufficient for shipping local checkpoints to a remote location when serving picks them up.

### Resumable runs

`--resume` with no value resumes the latest checkpoint for the given `run_id`/output dir; `--resume <path>` resumes from an explicit checkpoint. Local resume must round-trip a killed-and-restarted process; if Modal mode is exercised, resume must also round-trip a preemption.

### Experiment tracking

Minimal, in-repo: every run writes `run.json` containing run id, resolved hparams, dataset version, git sha, Ultralytics version, final val metrics (mAP50, mAP50-95, mask IoU on val). No W&B / MLflow integration in this ticket — keep dependencies tight. A small `tools/list_crop_model_runs.py` (or equivalent subcommand) that enumerates `crop-models/*/run.json` on the Volume is a nice-to-have but not required.

### Bazel wiring

`tools/train_crop_model.py` is exposed as a `py_binary` (and the Modal entrypoint as a separate target if cleaner). Use the existing patterns in `tools/` — no new Python deps without a follow-up (`ultralytics` lands via the standard dep-add flow; flag if it isn't already declared).

## Acceptance Criteria

- [ ] `tools/train_crop_model.py` exists with a documented CLI and is runnable as a `py_binary`.
- [ ] At least one example config under `tools/configs/crop_model/` (e.g. `yolov8n-seg-default.yaml`) is checked in and used by tests/docs.
- [ ] Runs locally against a tiny fixture dataset (smoke test) AND against the full [#415](https://github.com/shaoster/glaze/issues/415) dataset on the maintainer's GPU.
- [ ] Modal entrypoint exists and can be invoked (`modal run …`) — full Modal training is **optional**; it must work, but the gating training run for this milestone is local.
- [ ] Trained checkpoints, `args.yaml`, `run.json`, and `results.csv` land in a versioned `crop-models/<run_id>/` directory locally; `--upload-to` (or equivalent) ships the same layout to a Modal Volume / S3 path.
- [ ] `--resume` resumes the latest checkpoint after a simulated process kill and produces a continuous `results.csv`.
- [ ] `run.json` captures: run id, dataset version, git sha, Ultralytics version, resolved hparams, final Ultralytics val metrics (mAP50, mAP50-95, mask IoU).
- [ ] A README or top-of-file docstring documents: how to launch a local training run, where checkpoints land, how to optionally run on Modal, and how to resume.
- [ ] At least one full training run completes end-to-end on the [#415](https://github.com/shaoster/glaze/issues/415) dataset (locally) and produces a checkpoint with **non-trivial Ultralytics val metrics** (i.e. clearly above random — exact thresholds are NOT gated here; [#417](https://github.com/shaoster/glaze/issues/417) sets the production bar).
- [ ] Lints + tests pass (`rtk bazel build --config=lint //...`, `rtk bazel test //...`). Tests cover config parsing, run-id generation, and the upload-path layout — full GPU training is exercised manually, not in CI.

## Out of Scope

- Comparing the trained model against rembg, evaluating on the flagged-bad corpus from [#1](https://github.com/shaoster/glaze/issues/1)/[#2](https://github.com/shaoster/glaze/issues/2), or making a go/no-go call against the locked success criteria — all of that lives in [#7](https://github.com/shaoster/glaze/issues/7).
- Serving / inference endpoints, the backend registry, and per-version Modal apps — [#419](https://github.com/shaoster/glaze/issues/419).
- A/B routing, shadow mode, dashboards — [#418](https://github.com/shaoster/glaze/issues/418).
- Padding logic (per-potter setting, downstream of the model) — [#420](https://github.com/shaoster/glaze/issues/420) / [#419](https://github.com/shaoster/glaze/issues/419).
- Hyperparameter sweeps, NAS, distillation, multi-GPU/DDP — explicitly deferred. One config + one GPU is enough to clear this ticket.
- W&B / MLflow / external experiment trackers.
- Any changes to `piece_image_crop_service.py` or the existing rembg path.

## Dependencies

- Depends on [#415](https://github.com/shaoster/glaze/issues/415) (Training Dataset Export Pipeline) — needs the YOLO-seg-formatted dataset and dataset version card.
- Unblocks [#417](https://github.com/shaoster/glaze/issues/417) (Evaluation Harness & Baseline Report) and [#419](https://github.com/shaoster/glaze/issues/419) (Per-Model-Version Modal Apps + Backend Registry).



## Milestone Cross-References

- Depends on [#415 — Training Dataset Export Pipeline](https://github.com/shaoster/glaze/issues/415)
- Blocks [#417 — Evaluation Harness & Baseline Report](https://github.com/shaoster/glaze/issues/417)

Part of milestone [#2 — Custom Pottery Crop Model](https://github.com/shaoster/glaze/milestone/2).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: YOLO-seg training script for pottery crop model (local GPU) #416

Problem / Motivation

Proposed Solution

Model output

Configuration

Checkpoint upload

Resumable runs

Experiment tracking

Bazel wiring

Acceptance Criteria

Out of Scope

Dependencies

Milestone Cross-References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: YOLO-seg training script for pottery crop model (local GPU) #416

Description

Problem / Motivation

Proposed Solution

Model output

Configuration

Checkpoint upload

Resumable runs

Experiment tracking

Bazel wiring

Acceptance Criteria

Out of Scope

Dependencies

Milestone Cross-References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions