feat(router): v0.54 cluster artifact, TAU2-weighted off v0.52 by steventohme · Pull Request #235 · workweave/router

steventohme · 2026-05-22T23:50:14Z

Summary

v0.54 retrains the cluster scorer off v0.52 to penalize models that benchmark poorly on tool-use, in response to prod routing tool-heavy Claude Code requests to xiaomi/mimo-v2.5 which can't drive the tool loop.

Three knob changes vs v0.52:

TAU2_BENCH_TELECOM tier weight 0.3 → 2.0 (the tool-use signal; lives in workweave's eval/aa_metrics.py)
aa_evidence_scale 1.0 → 3.0 (lift AA priors out of the noise floor)
alpha 0.53 → 0.70 (tilt α-blend toward quality vs cost — explicit cost trade)

Also: AA priors regenerated without --registry so deployed models actually receive AA evidence (the NATIVE_OVERRIDE_BENCHMARKS default drops every bench for deployed models, which made an initial run a no-op).

Impact

Prod-bad clusters from the routing log (top_p=[2,8,9,12]):

Cluster	mimo-v2.5 rank v0.52 → v0.54	New top-1
2	1st → 2nd	gemini-3.1-pro-preview
8	1st → 7th	gpt-5.5
9	10th → 11th	qwen3-235b-a22b-2507
12	7th → 11th	claude-opus-4-7

Cost trade explicit:

Cluster 12 top-1: qwen3.6-35b-a3b ($0.0004/1k) → claude-opus-4-7 ($0.034/1k)
Cluster 8 top-1: mimo ($0.0009) → gpt-5.5 ($0.015)

Roster unchanged from v0.52 (19 deployed models).

Test plan

Workweave PR follow-up bumps gitlink + updates eval/aa_metrics.py (TAU2 0.3 → 2.0) + regenerated aa_quality_priors.json
Verify latest pointer reads v0.54 after merge
Smoke test: send a has_tools=true Claude Code request locally against the new bundle, confirm routing decision logs a non-mimo model for a cluster where mimo previously won

🤖 Generated with Claude Code

v0.54 retrains the cluster scorer off v0.52 with three changes aimed at penalizing models that benchmark poorly on tool-use: - aa_evidence_scale 1.0 -> 3.0 (lift AA priors out of the noise floor) - alpha 0.53 -> 0.70 (tilt α-blend toward quality vs cost) - TAU2_BENCH_TELECOM tier weight 0.3 -> 2.0 (tool-use signal; carried via the priors file regenerated WITHOUT --registry so deployed models receive AA evidence; tier weight bump lives in workweave's eval/aa_metrics.py) Motivation: production v0.52 was routing tool-heavy Claude Code requests (has_tools=true) to xiaomi/mimo-v2.5, which can't drive the tool loop and produced empty-tool-result loops on the client. Impact on the four prod-bad clusters (top_p=[2,8,9,12]): mimo-v2.5 rank v0.52 -> v0.54: cluster 2: 1st -> 2nd (gemini-3.1-pro-preview takes 1st) cluster 8: 1st -> 7th (gpt-5.5 takes 1st) cluster 9: 10th -> 11th cluster 12: 7th -> 11th (claude-opus-4-7 takes 1st) Cost trade-off (explicit, from alpha=0.70): cluster 12 top-1: qwen3.6-35b-a3b ($0.0004/1k) -> claude-opus-4-7 ($0.034/1k) cluster 8 top-1: mimo ($0.0009) -> gpt-5.5 ($0.015) Roster unchanged from v0.52 (19 deployed models). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

steventohme merged commit 335a6fa into main May 22, 2026
7 checks passed

steventohme deleted the steven/router-v0-54-tau2-weighted branch May 22, 2026 23:52

steventohme mentioned this pull request May 23, 2026

feat(router): v0.54 retrain — drop TAU2 lever #236

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(router): v0.54 cluster artifact, TAU2-weighted off v0.52#235

feat(router): v0.54 cluster artifact, TAU2-weighted off v0.52#235
steventohme merged 1 commit into
mainfrom
steven/router-v0-54-tau2-weighted

steventohme commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

steventohme commented May 22, 2026

Summary

Impact

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant