feat: evaluation harness + go/no-go report for trained crop model

## Problem / Motivation

Once the pottery-tuned YOLO-seg checkpoint produced by the training script ([#416](https://github.com/shaoster/glaze/issues/416)) exists, we need an objective, reproducible way to decide whether it is **good enough to promote to production** as a registered backend in `piece_image_crop_service`. Without a locked evaluation harness:

- We cannot tell whether the trained model actually beats rembg/u2net on the failure modes that motivated this milestone.
- We cannot tell whether the checkpoint regresses on cases rembg already handles well.
- Promotion to `piece_image_crop_service` via the per-model-version Modal apps ([#419](https://github.com/shaoster/glaze/issues/419)) and A/B routing ([#418](https://github.com/shaoster/glaze/issues/418)) becomes a gut call instead of a measured decision.

This is the **production-gating ticket** for the milestone: every checkpoint promoted to production must pass through this harness, and the harness must emit an explicit **go/no-go recommendation** against the success criteria locked in [#413](https://github.com/shaoster/glaze/issues/413).

## Proposed Solution

Build a Bazel-runnable evaluation script (e.g. `tools/eval_crop_model.py`) that:

1. **Loads a YOLO-seg checkpoint** produced by [#416](https://github.com/shaoster/glaze/issues/416)'s training pipeline (from a Modal Volume or local path).
2. **Loads two evaluation corpora**:
   - The **held-out test split** materialized by [#415](https://github.com/shaoster/glaze/issues/415)'s dataset export pipeline.
   - The **flagged corpus** — `CropRun` rows where `source.type = "human"` (human-submitted corrections from [#421](https://github.com/shaoster/glaze/issues/421)), filtered to those that also have a ground-truth `CropAnnotation` mask from [#420](https://github.com/shaoster/glaze/issues/420) or a hand-label from [#414](https://github.com/shaoster/glaze/issues/414).
3. **Runs both backends end-to-end on each image**:
   - Trained YOLO-seg checkpoint → predicted foreground mask.
   - Rembg/u2net baseline (via the existing `piece_image_crop_service` code path, or a direct rembg call configured identically) → predicted foreground mask.
4. **Scores mask-vs-mask, not bbox-vs-bbox.** The primary metric is **mask IoU** between each backend's predicted mask and the ground-truth annotation mask. Bbox IoU is a derived secondary metric only — the model output of record is the mask, per the locked design decisions in milestone #2. Also compute:
   - **Per-image mask IoU** distribution (mean, median, p10, p90).
   - **Mask recall and precision** at the pixel level.
   - **Pass rate** = fraction of images with mask IoU above the threshold locked in [#413](https://github.com/shaoster/glaze/issues/413).
   - **Per-failure-class breakdown** using the taxonomy defined in [#413](https://github.com/shaoster/glaze/issues/413) — the classes are derived from the analysis there, not from a fixed enum in the data model. Map images to classes using the human-submitted `CropRun.notes` and `CropRun.source` fields accumulated by [#421](https://github.com/shaoster/glaze/issues/421).
5. **Compares against the success criteria from [#413](https://github.com/shaoster/glaze/issues/413).** The thresholds (e.g. minimum pass rate on the flagged corpus, minimum mean IoU on the held-out test split, maximum regression on any failure class vs. rembg) are inputs to this harness, not invented here. The harness reads them from a config artifact owned by [#413](https://github.com/shaoster/glaze/issues/413).
6. **Emits two artifacts**:
   - A **machine-readable metrics table** (JSON or CSV) keyed by `(source_backend, corpus, failure_class)` with all per-image and aggregate metrics. Suitable for archival and trend comparison across future checkpoints.
   - A **human-readable report** (Markdown) summarizing the comparison, including a side-by-side mask IoU table (trained vs. rembg) per corpus and per failure class, plus a small gallery of best-improvement and worst-regression cases.
7. **Emits an explicit go/no-go recommendation.** The final line of the human-readable report — and a top-level field in the JSON — is `recommendation: "go" | "no-go"`, with the deciding criterion cited. This is the headline deliverable. "Go" means the checkpoint is cleared for promotion via [#419](https://github.com/shaoster/glaze/issues/419) and [#418](https://github.com/shaoster/glaze/issues/418). "No-go" means the checkpoint must not be registered as a production backend, and the report names the criterion(a) it failed.

The harness should be runnable both locally (against a downloaded checkpoint and a local dataset snapshot) and on Modal (against a checkpoint in the Modal Volume), and should be deterministic given a fixed checkpoint + fixed dataset version.

## Data Model Reference

The flagged corpus is queried from `CropRun` (introduced in [#421](https://github.com/shaoster/glaze/issues/421)):

```python
# Human-flagged images (source.type="human" means a human submitted a corrected crop)
flagged = CropRun.objects.filter(source__type="human").select_related("image")

# Join to ground-truth annotations from #420 or #414 where available
```

`CropRun.source` is a JSONField with schema `{type, backend, deployment, version}`. `CropRun.notes` holds free-form reporter context used to map images to failure classes from [#413](https://github.com/shaoster/glaze/issues/413).

## Acceptance Criteria

- [ ] Eval script lives in `tools/` and is wired as a Bazel target so it can be invoked reproducibly.
- [ ] Loads a YOLO-seg checkpoint produced by [#416](https://github.com/shaoster/glaze/issues/416) (Modal Volume path and local path both supported).
- [ ] Evaluates on **both** the held-out test split (from [#415](https://github.com/shaoster/glaze/issues/415)) **and** the flagged corpus (human-submitted `CropRun` rows from [#421](https://github.com/shaoster/glaze/issues/421), restricted to those with ground-truth annotations).
- [ ] Runs rembg/u2net as the baseline backend on the exact same images, using the same code path / configuration as production `piece_image_crop_service`.
- [ ] **Primary metric is mask IoU (mask-vs-mask comparison), not bbox IoU.** Bbox metrics, if reported, are derived and clearly labeled as secondary.
- [ ] Reports mask IoU, mask recall, mask precision, pass rate at the locked threshold, and a per-failure-class breakdown using the taxonomy from [#413](https://github.com/shaoster/glaze/issues/413).
- [ ] Reads success-criteria thresholds from the artifact owned by [#413](https://github.com/shaoster/glaze/issues/413) — does not hardcode them.
- [ ] Emits a machine-readable metrics table (JSON or CSV) suitable for archival and cross-checkpoint comparison.
- [ ] Emits a human-readable Markdown report with side-by-side trained-vs-rembg comparisons per corpus and per failure class.
- [ ] **Emits an explicit `go` or `no-go` recommendation**, with the deciding criterion cited, in both the JSON output and the Markdown report. This is the production-promotion gate for [#419](https://github.com/shaoster/glaze/issues/419) and [#418](https://github.com/shaoster/glaze/issues/418).
- [ ] Deterministic given a fixed checkpoint and a fixed dataset version (same inputs → identical metrics output).
- [ ] First baseline run is executed against the initial checkpoint from [#416](https://github.com/shaoster/glaze/issues/416), and the resulting report + go/no-go is attached to this issue before closing.

## Out of Scope

- Defining or modifying the success criteria themselves — those are locked in [#413](https://github.com/shaoster/glaze/issues/413) and consumed here as an input.
- Refactoring `piece_image_crop_service` for per-model-version Modal apps — that is [#419](https://github.com/shaoster/glaze/issues/419).
- A/B routing, shadow mode, or the comparison dashboard — that is [#418](https://github.com/shaoster/glaze/issues/418).
- Annotation UI, labeling rubric, or per-potter padding — those are [#420](https://github.com/shaoster/glaze/issues/420) and [#414](https://github.com/shaoster/glaze/issues/414).
- Dataset export format and split logic — owned by [#415](https://github.com/shaoster/glaze/issues/415); this issue is a consumer.
- Continuous evaluation on a schedule / nightly job — this issue ships the harness; cron wiring can be a follow-up if useful.

## Dependencies

- **[#413](https://github.com/shaoster/glaze/issues/413)** — Rembg Failure-Mode Analysis: provides the locked success criteria (thresholds, per-class targets) that this harness scores against.
- **[#416](https://github.com/shaoster/glaze/issues/416)** — YOLO-seg Training Script: provides the checkpoint format and Modal Volume location this harness loads from.
- **[#445](https://github.com/shaoster/glaze/issues/445)** — W&B evaluation: experiment tracking artifacts may inform run comparison.
- Implicit upstream: **[#421](https://github.com/shaoster/glaze/issues/421)** (flagged corpus), **[#420](https://github.com/shaoster/glaze/issues/420)/[#414](https://github.com/shaoster/glaze/issues/414)** (ground-truth masks), **[#415](https://github.com/shaoster/glaze/issues/415)** (held-out test split).

## Milestone Cross-References

- Depends on [#413 — Rembg Failure-Mode Analysis (mask-level)](https://github.com/shaoster/glaze/issues/413)
- Depends on [#416 — YOLO-seg Training Script on Modal](https://github.com/shaoster/glaze/issues/416)
- Depends on [#445 — Evaluate W&B for experiment tracking](https://github.com/shaoster/glaze/issues/445)
- Blocks [#418 — A/B Routing, Versioning & Comparison Dashboard](https://github.com/shaoster/glaze/issues/418)

Part of milestone [#2 — Custom Pottery Crop Model](https://github.com/shaoster/glaze/milestone/2).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: evaluation harness + go/no-go report for trained crop model #417

Problem / Motivation

Proposed Solution

Data Model Reference

Acceptance Criteria

Out of Scope

Dependencies

Milestone Cross-References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: evaluation harness + go/no-go report for trained crop model #417

Description

Problem / Motivation

Proposed Solution

Data Model Reference

Acceptance Criteria

Out of Scope

Dependencies

Milestone Cross-References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions