feat: A/B routing, paired CropRun persistence, and backend comparison dashboard

## Problem / Motivation

[#419](https://github.com/shaoster/glaze/issues/419) introduces a backend registry that lets `piece_image_crop_service` dispatch to any registered model version via its own Modal app. What's still missing is the apparatus to **safely compare two backend versions in production** and decide whether one should replace another.

Concretely we need to answer questions like:

- Does `yolo-seg-v2` agree with `yolo-seg-v1` on the masks it produces in real traffic?
- Where do they disagree, and are those disagreements systematic (e.g., translucent glaze, holloware)?
- Is the new version faster? More reliable? Worth promoting?

This must work for **any two registered backends** — `rembg-u2net` vs. `yolo-seg-v1`, `yolo-seg-v1` vs. `yolo-seg-v2`, or a hypothetical future `sam-finetune` vs. anything else. The mechanism is generalized backend-version comparison, not a one-off rembg-vs-trained shim.

We also want a closed feedback loop: when two backends disagree meaningfully on the same image, that image is exactly the sort of edge case our annotation surface from [#420](https://github.com/shaoster/glaze/issues/420) should be re-labeling.

## Proposed Solution

Build A/B routing, paired-run persistence, and an admin comparison dashboard on top of the registry from [#419](https://github.com/shaoster/glaze/issues/419).

### Routing modes (both required, admin-toggleable)

Add an admin-controlled `RoutingPolicy` (Django model + admin UI) that pins one **primary backend version** (the version whose mask is returned to the end user) and selects between two modes:

1. **Shadow mode** — every inference request runs the primary AND every other enabled backend in parallel. Only the primary's result is returned to the caller; the other runs are persisted as `CropRun`s for offline comparison. Maximizes signal per request; costs N× compute.
2. **Probabilistic split** — each request is routed to exactly one backend by weighted random draw from a configurable `{backend: weight}` map. The chosen backend's result is returned and persisted. Cheaper, but only yields paired data when the same image is later run through another backend (or via an explicit backfill job — see below).

A single admin toggle switches between modes. Weights and enabled-backend set are configurable per policy. The routing layer lives in `piece_image_crop_service` (or its Django-side caller in `api/tasks.py:detect_subject_crop`) and consults the active `RoutingPolicy` per request.

### Paired-run persistence

Extend the `CropRun` model from [#421](https://github.com/shaoster/glaze/issues/421) with a `pair_group` UUID field so that all `CropRun`s produced for a single source image under shadow mode share a key. The `CropRun.source` JSONField (schema: `{type, backend, deployment, version}`) already identifies which backend produced each run — use `source.backend` and `source.version` for per-backend aggregation in the dashboard.

Persist for every run: `source`, `mask_asset`, `latency_ms`, `error` (if any), and the relative `{x, y, w, h}` bbox derivation. This is the substrate the dashboard reads from.

For probabilistic split, provide a Bazel-runnable **backfill job** that takes a list of source images and re-runs them through any chosen secondary backend to synthesize pairs (so we can still build agreement statistics retroactively).

### Comparison dashboard (admin)

A new admin view, parameterized by two backend identifiers `(A, B)` (matched against `CropRun.source.backend`), that surfaces:

- **Agreement rate**: distribution of **mask IoU** across all paired `CropRun`s, plus summary statistics (median, p10, p90).
- **Latency comparison**: per-backend `latency_ms` distribution, including p50/p95.
- **Error rate**: percentage of runs where `status = "error"` or `mask_asset` is null.
- **Disagreement drill-down**: paginated list of paired runs below a configurable IoU threshold, side-by-side mask preview (rendered from `mask_asset` Cloudinary references), link to the source image and (if any) the linked `Piece`.
- **Sample size + confidence** indicators so promote/retire decisions are not made on noise.

The dashboard is **generalized**: pick any two backends from the registry and the dashboard recomputes. It is not hardcoded to rembg-vs-trained.

### Closed feedback loop — disagreement → annotation queue

When a paired comparison's mask IoU falls below a configurable threshold (admin setting on `RoutingPolicy`), the source image is automatically enqueued into the annotation surface from [#420](https://github.com/shaoster/glaze/issues/420) for human re-labeling. The resulting `CropAnnotation` then becomes ground truth that feeds back into the next training run via the dataset export pipeline ([#415](https://github.com/shaoster/glaze/issues/415)). This is the closed loop that makes A/B routing valuable beyond just "pick a winner."

Deduplicate aggressively — the same image disagreeing across multiple runs should be queued once, not N times.

## Acceptance Criteria

- [ ] `RoutingPolicy` model + admin: pin primary backend version, choose mode (`shadow` | `probabilistic`), configure per-backend weights and enabled set, set disagreement IoU threshold
- [ ] Single active `RoutingPolicy` consulted on every crop-service request
- [ ] **Shadow mode**: all enabled backends run in parallel per request; primary's mask returned; all results persisted as paired `CropRun`s sharing a `pair_group`
- [ ] **Probabilistic split**: weighted random routing to exactly one backend per request; result returned and persisted
- [ ] Routing layer is backend-agnostic — works for any pair of registered versions from [#419](https://github.com/shaoster/glaze/issues/419)
- [ ] Bazel-runnable backfill tool to synthesize pairs for probabilistic-split traffic
- [ ] Admin dashboard view parameterized by `(backend_a, backend_b)` (matched against `CropRun.source.backend`), surfacing: mask-IoU distribution + summary stats, latency comparison, error rates, paginated disagreement drill-down with side-by-side mask previews
- [ ] Paired runs below the configured IoU threshold are auto-enqueued into the annotation surface from [#420](https://github.com/shaoster/glaze/issues/420), deduplicated per source image
- [ ] Admin can promote a backend (make it primary) or retire one (mark deprecated in the registry) directly from the dashboard
- [ ] Tests: routing-mode selection, paired persistence under shadow, weighted draw under probabilistic split, IoU computation, threshold-triggered annotation enqueue, dashboard query correctness
- [ ] DoD checklist from `.agents/skills/github-pr/SKILL.md` satisfied

## Out of Scope

- Multi-armed bandit or other adaptive routing — explicitly out; weights are statically configured per policy.
- Automatic promote/retire — promotion remains a human decision driven by dashboard evidence.
- Per-user cohort routing (e.g., "potter X always gets v2") — interesting, but not in this slice.
- The actual annotation UI itself — that ships in [#420](https://github.com/shaoster/glaze/issues/420); this issue only wires the enqueue side.
- Training a new model — covered in [#416](https://github.com/shaoster/glaze/issues/416); this issue assumes a v2 checkpoint exists per [#417](https://github.com/shaoster/glaze/issues/417).

## Dependencies

- [#421](https://github.com/shaoster/glaze/issues/421) — `CropRun` model (with `source` JSONField and `mask_asset`); this issue adds `pair_group` to it
- [#417](https://github.com/shaoster/glaze/issues/417) — a v2 checkpoint must exist so the dashboard has something meaningful to compare against the rembg baseline
- [#419](https://github.com/shaoster/glaze/issues/419) — backend registry + per-version Modal apps

## Milestone Cross-References

- Depends on [#421 — Bad-Crop Tagging & Inference Run Persistence](https://github.com/shaoster/glaze/issues/421)
- Depends on [#417 — Evaluation Harness & Baseline Report](https://github.com/shaoster/glaze/issues/417)
- Depends on [#419 — Per-Model-Version Modal Apps + Backend Registry](https://github.com/shaoster/glaze/issues/419)

Part of milestone [#2 — Custom Pottery Crop Model](https://github.com/shaoster/glaze/milestone/2).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: A/B routing, paired CropRun persistence, and backend comparison dashboard #418

Problem / Motivation

Proposed Solution

Routing modes (both required, admin-toggleable)

Paired-run persistence

Comparison dashboard (admin)

Closed feedback loop — disagreement → annotation queue

Acceptance Criteria

Out of Scope

Dependencies

Milestone Cross-References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: A/B routing, paired CropRun persistence, and backend comparison dashboard #418

Description

Problem / Motivation

Proposed Solution

Routing modes (both required, admin-toggleable)

Paired-run persistence

Comparison dashboard (admin)

Closed feedback loop — disagreement → annotation queue

Acceptance Criteria

Out of Scope

Dependencies

Milestone Cross-References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions