feat(router): v0.54 cluster artifact, TAU2-weighted off v0.52#235
Merged
Conversation
v0.54 retrains the cluster scorer off v0.52 with three changes aimed at
penalizing models that benchmark poorly on tool-use:
- aa_evidence_scale 1.0 -> 3.0 (lift AA priors out of the noise floor)
- alpha 0.53 -> 0.70 (tilt α-blend toward quality vs cost)
- TAU2_BENCH_TELECOM tier weight 0.3 -> 2.0 (tool-use signal; carried
via the priors file regenerated WITHOUT --registry so deployed
models receive AA evidence; tier weight bump lives in workweave's
eval/aa_metrics.py)
Motivation: production v0.52 was routing tool-heavy Claude Code
requests (has_tools=true) to xiaomi/mimo-v2.5, which can't drive the
tool loop and produced empty-tool-result loops on the client.
Impact on the four prod-bad clusters (top_p=[2,8,9,12]):
mimo-v2.5 rank v0.52 -> v0.54:
cluster 2: 1st -> 2nd (gemini-3.1-pro-preview takes 1st)
cluster 8: 1st -> 7th (gpt-5.5 takes 1st)
cluster 9: 10th -> 11th
cluster 12: 7th -> 11th (claude-opus-4-7 takes 1st)
Cost trade-off (explicit, from alpha=0.70):
cluster 12 top-1: qwen3.6-35b-a3b ($0.0004/1k) -> claude-opus-4-7 ($0.034/1k)
cluster 8 top-1: mimo ($0.0009) -> gpt-5.5 ($0.015)
Roster unchanged from v0.52 (19 deployed models).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
v0.54 retrains the cluster scorer off v0.52 to penalize models that benchmark poorly on tool-use, in response to prod routing tool-heavy Claude Code requests to
xiaomi/mimo-v2.5which can't drive the tool loop.Three knob changes vs v0.52:
TAU2_BENCH_TELECOMtier weight0.3 → 2.0(the tool-use signal; lives in workweave'seval/aa_metrics.py)aa_evidence_scale 1.0 → 3.0(lift AA priors out of the noise floor)alpha 0.53 → 0.70(tilt α-blend toward quality vs cost — explicit cost trade)Also: AA priors regenerated without
--registryso deployed models actually receive AA evidence (theNATIVE_OVERRIDE_BENCHMARKSdefault drops every bench for deployed models, which made an initial run a no-op).Impact
Prod-bad clusters from the routing log (
top_p=[2,8,9,12]):Cost trade explicit:
qwen3.6-35b-a3b($0.0004/1k) →claude-opus-4-7($0.034/1k)mimo($0.0009) →gpt-5.5($0.015)Roster unchanged from v0.52 (19 deployed models).
Test plan
eval/aa_metrics.py(TAU2 0.3 → 2.0) + regeneratedaa_quality_priors.jsonlatestpointer readsv0.54after mergehas_tools=trueClaude Code request locally against the new bundle, confirm routing decision logs a non-mimo model for a cluster where mimo previously won🤖 Generated with Claude Code