You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#419 introduces a backend registry that lets piece_image_crop_service dispatch to any registered model version via its own Modal app. What's still missing is the apparatus to safely compare two backend versions in production and decide whether one should replace another.
Concretely we need to answer questions like:
Does yolo-seg-v2 agree with yolo-seg-v1 on the masks it produces in real traffic?
Where do they disagree, and are those disagreements systematic (e.g., translucent glaze, holloware)?
Is the new version faster? More reliable? Worth promoting?
This must work for any two registered backends — rembg-u2net vs. yolo-seg-v1, yolo-seg-v1 vs. yolo-seg-v2, or a hypothetical future sam-finetune vs. anything else. The mechanism is generalized backend-version comparison, not a one-off rembg-vs-trained shim.
We also want a closed feedback loop: when two backends disagree meaningfully on the same image, that image is exactly the sort of edge case our annotation surface from #420 should be re-labeling.
Proposed Solution
Build A/B routing, paired-run persistence, and an admin comparison dashboard on top of the registry from #419.
Routing modes (both required, admin-toggleable)
Add an admin-controlled RoutingPolicy (Django model + admin UI) that pins one primary backend version (the version whose mask is returned to the end user) and selects between two modes:
Shadow mode — every inference request runs the primary AND every other enabled backend in parallel. Only the primary's result is returned to the caller; the other runs are persisted as CropRuns for offline comparison. Maximizes signal per request; costs N× compute.
Probabilistic split — each request is routed to exactly one backend by weighted random draw from a configurable {backend: weight} map. The chosen backend's result is returned and persisted. Cheaper, but only yields paired data when the same image is later run through another backend (or via an explicit backfill job — see below).
A single admin toggle switches between modes. Weights and enabled-backend set are configurable per policy. The routing layer lives in piece_image_crop_service (or its Django-side caller in api/tasks.py:detect_subject_crop) and consults the active RoutingPolicy per request.
Paired-run persistence
Extend the CropRun model from #421 with a pair_group UUID field so that all CropRuns produced for a single source image under shadow mode share a key. The CropRun.source JSONField (schema: {type, backend, deployment, version}) already identifies which backend produced each run — use source.backend and source.version for per-backend aggregation in the dashboard.
Persist for every run: source, mask_asset, latency_ms, error (if any), and the relative {x, y, w, h} bbox derivation. This is the substrate the dashboard reads from.
For probabilistic split, provide a Bazel-runnable backfill job that takes a list of source images and re-runs them through any chosen secondary backend to synthesize pairs (so we can still build agreement statistics retroactively).
Comparison dashboard (admin)
A new admin view, parameterized by two backend identifiers (A, B) (matched against CropRun.source.backend), that surfaces:
Agreement rate: distribution of mask IoU across all paired CropRuns, plus summary statistics (median, p10, p90).
Latency comparison: per-backend latency_ms distribution, including p50/p95.
Error rate: percentage of runs where status = "error" or mask_asset is null.
Disagreement drill-down: paginated list of paired runs below a configurable IoU threshold, side-by-side mask preview (rendered from mask_asset Cloudinary references), link to the source image and (if any) the linked Piece.
Sample size + confidence indicators so promote/retire decisions are not made on noise.
The dashboard is generalized: pick any two backends from the registry and the dashboard recomputes. It is not hardcoded to rembg-vs-trained.
When a paired comparison's mask IoU falls below a configurable threshold (admin setting on RoutingPolicy), the source image is automatically enqueued into the annotation surface from #420 for human re-labeling. The resulting CropAnnotation then becomes ground truth that feeds back into the next training run via the dataset export pipeline (#415). This is the closed loop that makes A/B routing valuable beyond just "pick a winner."
Deduplicate aggressively — the same image disagreeing across multiple runs should be queued once, not N times.
Acceptance Criteria
RoutingPolicy model + admin: pin primary backend version, choose mode (shadow | probabilistic), configure per-backend weights and enabled set, set disagreement IoU threshold
Single active RoutingPolicy consulted on every crop-service request
Shadow mode: all enabled backends run in parallel per request; primary's mask returned; all results persisted as paired CropRuns sharing a pair_group
Probabilistic split: weighted random routing to exactly one backend per request; result returned and persisted
Routing layer is backend-agnostic — works for any pair of registered versions from #419
Bazel-runnable backfill tool to synthesize pairs for probabilistic-split traffic
Admin dashboard view parameterized by (backend_a, backend_b) (matched against CropRun.source.backend), surfacing: mask-IoU distribution + summary stats, latency comparison, error rates, paginated disagreement drill-down with side-by-side mask previews
Paired runs below the configured IoU threshold are auto-enqueued into the annotation surface from #420, deduplicated per source image
Admin can promote a backend (make it primary) or retire one (mark deprecated in the registry) directly from the dashboard
Tests: routing-mode selection, paired persistence under shadow, weighted draw under probabilistic split, IoU computation, threshold-triggered annotation enqueue, dashboard query correctness
DoD checklist from .agents/skills/github-pr/SKILL.md satisfied
Out of Scope
Multi-armed bandit or other adaptive routing — explicitly out; weights are statically configured per policy.
Automatic promote/retire — promotion remains a human decision driven by dashboard evidence.
Per-user cohort routing (e.g., "potter X always gets v2") — interesting, but not in this slice.
The actual annotation UI itself — that ships in #420; this issue only wires the enqueue side.
Training a new model — covered in #416; this issue assumes a v2 checkpoint exists per #417.
Dependencies
#421 — CropRun model (with source JSONField and mask_asset); this issue adds pair_group to it
#417 — a v2 checkpoint must exist so the dashboard has something meaningful to compare against the rembg baseline
Problem / Motivation
#419 introduces a backend registry that lets
piece_image_crop_servicedispatch to any registered model version via its own Modal app. What's still missing is the apparatus to safely compare two backend versions in production and decide whether one should replace another.Concretely we need to answer questions like:
yolo-seg-v2agree withyolo-seg-v1on the masks it produces in real traffic?This must work for any two registered backends —
rembg-u2netvs.yolo-seg-v1,yolo-seg-v1vs.yolo-seg-v2, or a hypothetical futuresam-finetunevs. anything else. The mechanism is generalized backend-version comparison, not a one-off rembg-vs-trained shim.We also want a closed feedback loop: when two backends disagree meaningfully on the same image, that image is exactly the sort of edge case our annotation surface from #420 should be re-labeling.
Proposed Solution
Build A/B routing, paired-run persistence, and an admin comparison dashboard on top of the registry from #419.
Routing modes (both required, admin-toggleable)
Add an admin-controlled
RoutingPolicy(Django model + admin UI) that pins one primary backend version (the version whose mask is returned to the end user) and selects between two modes:CropRuns for offline comparison. Maximizes signal per request; costs N× compute.{backend: weight}map. The chosen backend's result is returned and persisted. Cheaper, but only yields paired data when the same image is later run through another backend (or via an explicit backfill job — see below).A single admin toggle switches between modes. Weights and enabled-backend set are configurable per policy. The routing layer lives in
piece_image_crop_service(or its Django-side caller inapi/tasks.py:detect_subject_crop) and consults the activeRoutingPolicyper request.Paired-run persistence
Extend the
CropRunmodel from #421 with apair_groupUUID field so that allCropRuns produced for a single source image under shadow mode share a key. TheCropRun.sourceJSONField (schema:{type, backend, deployment, version}) already identifies which backend produced each run — usesource.backendandsource.versionfor per-backend aggregation in the dashboard.Persist for every run:
source,mask_asset,latency_ms,error(if any), and the relative{x, y, w, h}bbox derivation. This is the substrate the dashboard reads from.For probabilistic split, provide a Bazel-runnable backfill job that takes a list of source images and re-runs them through any chosen secondary backend to synthesize pairs (so we can still build agreement statistics retroactively).
Comparison dashboard (admin)
A new admin view, parameterized by two backend identifiers
(A, B)(matched againstCropRun.source.backend), that surfaces:CropRuns, plus summary statistics (median, p10, p90).latency_msdistribution, including p50/p95.status = "error"ormask_assetis null.mask_assetCloudinary references), link to the source image and (if any) the linkedPiece.The dashboard is generalized: pick any two backends from the registry and the dashboard recomputes. It is not hardcoded to rembg-vs-trained.
Closed feedback loop — disagreement → annotation queue
When a paired comparison's mask IoU falls below a configurable threshold (admin setting on
RoutingPolicy), the source image is automatically enqueued into the annotation surface from #420 for human re-labeling. The resultingCropAnnotationthen becomes ground truth that feeds back into the next training run via the dataset export pipeline (#415). This is the closed loop that makes A/B routing valuable beyond just "pick a winner."Deduplicate aggressively — the same image disagreeing across multiple runs should be queued once, not N times.
Acceptance Criteria
RoutingPolicymodel + admin: pin primary backend version, choose mode (shadow|probabilistic), configure per-backend weights and enabled set, set disagreement IoU thresholdRoutingPolicyconsulted on every crop-service requestCropRuns sharing apair_group(backend_a, backend_b)(matched againstCropRun.source.backend), surfacing: mask-IoU distribution + summary stats, latency comparison, error rates, paginated disagreement drill-down with side-by-side mask previews.agents/skills/github-pr/SKILL.mdsatisfiedOut of Scope
Dependencies
CropRunmodel (withsourceJSONField andmask_asset); this issue addspair_groupto itMilestone Cross-References
Part of milestone #2 — Custom Pottery Crop Model.