Skip to content

feat: evaluation harness + go/no-go report for trained crop model #417

@shaoster

Description

@shaoster

Problem / Motivation

Once the pottery-tuned YOLO-seg checkpoint produced by the training script (#416) exists, we need an objective, reproducible way to decide whether it is good enough to promote to production as a registered backend in piece_image_crop_service. Without a locked evaluation harness:

  • We cannot tell whether the trained model actually beats rembg/u2net on the failure modes that motivated this milestone.
  • We cannot tell whether the checkpoint regresses on cases rembg already handles well.
  • Promotion to piece_image_crop_service via the per-model-version Modal apps (#419) and A/B routing (#418) becomes a gut call instead of a measured decision.

This is the production-gating ticket for the milestone: every checkpoint promoted to production must pass through this harness, and the harness must emit an explicit go/no-go recommendation against the success criteria locked in #413.

Proposed Solution

Build a Bazel-runnable evaluation script (e.g. tools/eval_crop_model.py) that:

  1. Loads a YOLO-seg checkpoint produced by #416's training pipeline (from a Modal Volume or local path).
  2. Loads two evaluation corpora:
    • The held-out test split materialized by #415's dataset export pipeline.
    • The flagged corpusCropRun rows where source.type = "human" (human-submitted corrections from #421), filtered to those that also have a ground-truth CropAnnotation mask from #420 or a hand-label from #414.
  3. Runs both backends end-to-end on each image:
    • Trained YOLO-seg checkpoint → predicted foreground mask.
    • Rembg/u2net baseline (via the existing piece_image_crop_service code path, or a direct rembg call configured identically) → predicted foreground mask.
  4. Scores mask-vs-mask, not bbox-vs-bbox. The primary metric is mask IoU between each backend's predicted mask and the ground-truth annotation mask. Bbox IoU is a derived secondary metric only — the model output of record is the mask, per the locked design decisions in milestone #2. Also compute:
    • Per-image mask IoU distribution (mean, median, p10, p90).
    • Mask recall and precision at the pixel level.
    • Pass rate = fraction of images with mask IoU above the threshold locked in #413.
    • Per-failure-class breakdown using the taxonomy defined in #413 — the classes are derived from the analysis there, not from a fixed enum in the data model. Map images to classes using the human-submitted CropRun.notes and CropRun.source fields accumulated by #421.
  5. Compares against the success criteria from #413. The thresholds (e.g. minimum pass rate on the flagged corpus, minimum mean IoU on the held-out test split, maximum regression on any failure class vs. rembg) are inputs to this harness, not invented here. The harness reads them from a config artifact owned by #413.
  6. Emits two artifacts:
    • A machine-readable metrics table (JSON or CSV) keyed by (source_backend, corpus, failure_class) with all per-image and aggregate metrics. Suitable for archival and trend comparison across future checkpoints.
    • A human-readable report (Markdown) summarizing the comparison, including a side-by-side mask IoU table (trained vs. rembg) per corpus and per failure class, plus a small gallery of best-improvement and worst-regression cases.
  7. Emits an explicit go/no-go recommendation. The final line of the human-readable report — and a top-level field in the JSON — is recommendation: "go" | "no-go", with the deciding criterion cited. This is the headline deliverable. "Go" means the checkpoint is cleared for promotion via #419 and #418. "No-go" means the checkpoint must not be registered as a production backend, and the report names the criterion(a) it failed.

The harness should be runnable both locally (against a downloaded checkpoint and a local dataset snapshot) and on Modal (against a checkpoint in the Modal Volume), and should be deterministic given a fixed checkpoint + fixed dataset version.

Data Model Reference

The flagged corpus is queried from CropRun (introduced in #421):

# Human-flagged images (source.type="human" means a human submitted a corrected crop)
flagged = CropRun.objects.filter(source__type="human").select_related("image")

# Join to ground-truth annotations from #420 or #414 where available

CropRun.source is a JSONField with schema {type, backend, deployment, version}. CropRun.notes holds free-form reporter context used to map images to failure classes from #413.

Acceptance Criteria

  • Eval script lives in tools/ and is wired as a Bazel target so it can be invoked reproducibly.
  • Loads a YOLO-seg checkpoint produced by #416 (Modal Volume path and local path both supported).
  • Evaluates on both the held-out test split (from #415) and the flagged corpus (human-submitted CropRun rows from #421, restricted to those with ground-truth annotations).
  • Runs rembg/u2net as the baseline backend on the exact same images, using the same code path / configuration as production piece_image_crop_service.
  • Primary metric is mask IoU (mask-vs-mask comparison), not bbox IoU. Bbox metrics, if reported, are derived and clearly labeled as secondary.
  • Reports mask IoU, mask recall, mask precision, pass rate at the locked threshold, and a per-failure-class breakdown using the taxonomy from #413.
  • Reads success-criteria thresholds from the artifact owned by #413 — does not hardcode them.
  • Emits a machine-readable metrics table (JSON or CSV) suitable for archival and cross-checkpoint comparison.
  • Emits a human-readable Markdown report with side-by-side trained-vs-rembg comparisons per corpus and per failure class.
  • Emits an explicit go or no-go recommendation, with the deciding criterion cited, in both the JSON output and the Markdown report. This is the production-promotion gate for #419 and #418.
  • Deterministic given a fixed checkpoint and a fixed dataset version (same inputs → identical metrics output).
  • First baseline run is executed against the initial checkpoint from #416, and the resulting report + go/no-go is attached to this issue before closing.

Out of Scope

  • Defining or modifying the success criteria themselves — those are locked in #413 and consumed here as an input.
  • Refactoring piece_image_crop_service for per-model-version Modal apps — that is #419.
  • A/B routing, shadow mode, or the comparison dashboard — that is #418.
  • Annotation UI, labeling rubric, or per-potter padding — those are #420 and #414.
  • Dataset export format and split logic — owned by #415; this issue is a consumer.
  • Continuous evaluation on a schedule / nightly job — this issue ships the harness; cron wiring can be a follow-up if useful.

Dependencies

  • #413 — Rembg Failure-Mode Analysis: provides the locked success criteria (thresholds, per-class targets) that this harness scores against.
  • #416 — YOLO-seg Training Script: provides the checkpoint format and Modal Volume location this harness loads from.
  • #445 — W&B evaluation: experiment tracking artifacts may inform run comparison.
  • Implicit upstream: #421 (flagged corpus), #420/#414 (ground-truth masks), #415 (held-out test split).

Milestone Cross-References

Part of milestone #2 — Custom Pottery Crop Model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions