You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Once the pottery-tuned YOLO-seg checkpoint produced by the training script (#416) exists, we need an objective, reproducible way to decide whether it is good enough to promote to production as a registered backend in piece_image_crop_service. Without a locked evaluation harness:
We cannot tell whether the trained model actually beats rembg/u2net on the failure modes that motivated this milestone.
We cannot tell whether the checkpoint regresses on cases rembg already handles well.
Promotion to piece_image_crop_service via the per-model-version Modal apps (#419) and A/B routing (#418) becomes a gut call instead of a measured decision.
This is the production-gating ticket for the milestone: every checkpoint promoted to production must pass through this harness, and the harness must emit an explicit go/no-go recommendation against the success criteria locked in #413.
Proposed Solution
Build a Bazel-runnable evaluation script (e.g. tools/eval_crop_model.py) that:
Loads a YOLO-seg checkpoint produced by #416's training pipeline (from a Modal Volume or local path).
Loads two evaluation corpora:
The held-out test split materialized by #415's dataset export pipeline.
The flagged corpus — CropRun rows where source.type = "human" (human-submitted corrections from #421), filtered to those that also have a ground-truth CropAnnotation mask from #420 or a hand-label from #414.
Rembg/u2net baseline (via the existing piece_image_crop_service code path, or a direct rembg call configured identically) → predicted foreground mask.
Scores mask-vs-mask, not bbox-vs-bbox. The primary metric is mask IoU between each backend's predicted mask and the ground-truth annotation mask. Bbox IoU is a derived secondary metric only — the model output of record is the mask, per the locked design decisions in milestone #2. Also compute:
Per-image mask IoU distribution (mean, median, p10, p90).
Mask recall and precision at the pixel level.
Pass rate = fraction of images with mask IoU above the threshold locked in #413.
Per-failure-class breakdown using the taxonomy defined in #413 — the classes are derived from the analysis there, not from a fixed enum in the data model. Map images to classes using the human-submitted CropRun.notes and CropRun.source fields accumulated by #421.
Compares against the success criteria from #413. The thresholds (e.g. minimum pass rate on the flagged corpus, minimum mean IoU on the held-out test split, maximum regression on any failure class vs. rembg) are inputs to this harness, not invented here. The harness reads them from a config artifact owned by #413.
Emits two artifacts:
A machine-readable metrics table (JSON or CSV) keyed by (source_backend, corpus, failure_class) with all per-image and aggregate metrics. Suitable for archival and trend comparison across future checkpoints.
A human-readable report (Markdown) summarizing the comparison, including a side-by-side mask IoU table (trained vs. rembg) per corpus and per failure class, plus a small gallery of best-improvement and worst-regression cases.
Emits an explicit go/no-go recommendation. The final line of the human-readable report — and a top-level field in the JSON — is recommendation: "go" | "no-go", with the deciding criterion cited. This is the headline deliverable. "Go" means the checkpoint is cleared for promotion via #419 and #418. "No-go" means the checkpoint must not be registered as a production backend, and the report names the criterion(a) it failed.
The harness should be runnable both locally (against a downloaded checkpoint and a local dataset snapshot) and on Modal (against a checkpoint in the Modal Volume), and should be deterministic given a fixed checkpoint + fixed dataset version.
Data Model Reference
The flagged corpus is queried from CropRun (introduced in #421):
# Human-flagged images (source.type="human" means a human submitted a corrected crop)flagged=CropRun.objects.filter(source__type="human").select_related("image")
# Join to ground-truth annotations from #420 or #414 where available
CropRun.source is a JSONField with schema {type, backend, deployment, version}. CropRun.notes holds free-form reporter context used to map images to failure classes from #413.
Acceptance Criteria
Eval script lives in tools/ and is wired as a Bazel target so it can be invoked reproducibly.
Loads a YOLO-seg checkpoint produced by #416 (Modal Volume path and local path both supported).
Evaluates on both the held-out test split (from #415) and the flagged corpus (human-submitted CropRun rows from #421, restricted to those with ground-truth annotations).
Runs rembg/u2net as the baseline backend on the exact same images, using the same code path / configuration as production piece_image_crop_service.
Primary metric is mask IoU (mask-vs-mask comparison), not bbox IoU. Bbox metrics, if reported, are derived and clearly labeled as secondary.
Reports mask IoU, mask recall, mask precision, pass rate at the locked threshold, and a per-failure-class breakdown using the taxonomy from #413.
Reads success-criteria thresholds from the artifact owned by #413 — does not hardcode them.
Emits a machine-readable metrics table (JSON or CSV) suitable for archival and cross-checkpoint comparison.
Emits a human-readable Markdown report with side-by-side trained-vs-rembg comparisons per corpus and per failure class.
Emits an explicit go or no-go recommendation, with the deciding criterion cited, in both the JSON output and the Markdown report. This is the production-promotion gate for #419 and #418.
Deterministic given a fixed checkpoint and a fixed dataset version (same inputs → identical metrics output).
First baseline run is executed against the initial checkpoint from #416, and the resulting report + go/no-go is attached to this issue before closing.
Out of Scope
Defining or modifying the success criteria themselves — those are locked in #413 and consumed here as an input.
Refactoring piece_image_crop_service for per-model-version Modal apps — that is #419.
A/B routing, shadow mode, or the comparison dashboard — that is #418.
Annotation UI, labeling rubric, or per-potter padding — those are #420 and #414.
Dataset export format and split logic — owned by #415; this issue is a consumer.
Continuous evaluation on a schedule / nightly job — this issue ships the harness; cron wiring can be a follow-up if useful.
Dependencies
#413 — Rembg Failure-Mode Analysis: provides the locked success criteria (thresholds, per-class targets) that this harness scores against.
#416 — YOLO-seg Training Script: provides the checkpoint format and Modal Volume location this harness loads from.
#445 — W&B evaluation: experiment tracking artifacts may inform run comparison.
Problem / Motivation
Once the pottery-tuned YOLO-seg checkpoint produced by the training script (#416) exists, we need an objective, reproducible way to decide whether it is good enough to promote to production as a registered backend in
piece_image_crop_service. Without a locked evaluation harness:piece_image_crop_servicevia the per-model-version Modal apps (#419) and A/B routing (#418) becomes a gut call instead of a measured decision.This is the production-gating ticket for the milestone: every checkpoint promoted to production must pass through this harness, and the harness must emit an explicit go/no-go recommendation against the success criteria locked in #413.
Proposed Solution
Build a Bazel-runnable evaluation script (e.g.
tools/eval_crop_model.py) that:CropRunrows wheresource.type = "human"(human-submitted corrections from #421), filtered to those that also have a ground-truthCropAnnotationmask from #420 or a hand-label from #414.piece_image_crop_servicecode path, or a direct rembg call configured identically) → predicted foreground mask.CropRun.notesandCropRun.sourcefields accumulated by #421.(source_backend, corpus, failure_class)with all per-image and aggregate metrics. Suitable for archival and trend comparison across future checkpoints.recommendation: "go" | "no-go", with the deciding criterion cited. This is the headline deliverable. "Go" means the checkpoint is cleared for promotion via #419 and #418. "No-go" means the checkpoint must not be registered as a production backend, and the report names the criterion(a) it failed.The harness should be runnable both locally (against a downloaded checkpoint and a local dataset snapshot) and on Modal (against a checkpoint in the Modal Volume), and should be deterministic given a fixed checkpoint + fixed dataset version.
Data Model Reference
The flagged corpus is queried from
CropRun(introduced in #421):CropRun.sourceis a JSONField with schema{type, backend, deployment, version}.CropRun.notesholds free-form reporter context used to map images to failure classes from #413.Acceptance Criteria
tools/and is wired as a Bazel target so it can be invoked reproducibly.CropRunrows from #421, restricted to those with ground-truth annotations).piece_image_crop_service.goorno-gorecommendation, with the deciding criterion cited, in both the JSON output and the Markdown report. This is the production-promotion gate for #419 and #418.Out of Scope
piece_image_crop_servicefor per-model-version Modal apps — that is #419.Dependencies
Milestone Cross-References
Part of milestone #2 — Custom Pottery Crop Model.