Skip to content

feat(router): v0.54 cluster artifact, TAU2-weighted off v0.52#235

Merged
steventohme merged 1 commit into
mainfrom
steven/router-v0-54-tau2-weighted
May 22, 2026
Merged

feat(router): v0.54 cluster artifact, TAU2-weighted off v0.52#235
steventohme merged 1 commit into
mainfrom
steven/router-v0-54-tau2-weighted

Conversation

@steventohme
Copy link
Copy Markdown
Collaborator

Summary

v0.54 retrains the cluster scorer off v0.52 to penalize models that benchmark poorly on tool-use, in response to prod routing tool-heavy Claude Code requests to xiaomi/mimo-v2.5 which can't drive the tool loop.

Three knob changes vs v0.52:

  • TAU2_BENCH_TELECOM tier weight 0.3 → 2.0 (the tool-use signal; lives in workweave's eval/aa_metrics.py)
  • aa_evidence_scale 1.0 → 3.0 (lift AA priors out of the noise floor)
  • alpha 0.53 → 0.70 (tilt α-blend toward quality vs cost — explicit cost trade)

Also: AA priors regenerated without --registry so deployed models actually receive AA evidence (the NATIVE_OVERRIDE_BENCHMARKS default drops every bench for deployed models, which made an initial run a no-op).

Impact

Prod-bad clusters from the routing log (top_p=[2,8,9,12]):

Cluster mimo-v2.5 rank v0.52 → v0.54 New top-1
2 1st → 2nd gemini-3.1-pro-preview
8 1st → 7th gpt-5.5
9 10th → 11th qwen3-235b-a22b-2507
12 7th → 11th claude-opus-4-7

Cost trade explicit:

  • Cluster 12 top-1: qwen3.6-35b-a3b ($0.0004/1k) → claude-opus-4-7 ($0.034/1k)
  • Cluster 8 top-1: mimo ($0.0009) → gpt-5.5 ($0.015)

Roster unchanged from v0.52 (19 deployed models).

Test plan

  • Workweave PR follow-up bumps gitlink + updates eval/aa_metrics.py (TAU2 0.3 → 2.0) + regenerated aa_quality_priors.json
  • Verify latest pointer reads v0.54 after merge
  • Smoke test: send a has_tools=true Claude Code request locally against the new bundle, confirm routing decision logs a non-mimo model for a cluster where mimo previously won

🤖 Generated with Claude Code

v0.54 retrains the cluster scorer off v0.52 with three changes aimed at
penalizing models that benchmark poorly on tool-use:

  - aa_evidence_scale 1.0 -> 3.0 (lift AA priors out of the noise floor)
  - alpha 0.53 -> 0.70 (tilt α-blend toward quality vs cost)
  - TAU2_BENCH_TELECOM tier weight 0.3 -> 2.0 (tool-use signal; carried
    via the priors file regenerated WITHOUT --registry so deployed
    models receive AA evidence; tier weight bump lives in workweave's
    eval/aa_metrics.py)

Motivation: production v0.52 was routing tool-heavy Claude Code
requests (has_tools=true) to xiaomi/mimo-v2.5, which can't drive the
tool loop and produced empty-tool-result loops on the client.

Impact on the four prod-bad clusters (top_p=[2,8,9,12]):

  mimo-v2.5 rank v0.52 -> v0.54:
    cluster  2:  1st -> 2nd   (gemini-3.1-pro-preview takes 1st)
    cluster  8:  1st -> 7th   (gpt-5.5 takes 1st)
    cluster  9: 10th -> 11th
    cluster 12:  7th -> 11th  (claude-opus-4-7 takes 1st)

Cost trade-off (explicit, from alpha=0.70):
    cluster 12 top-1: qwen3.6-35b-a3b ($0.0004/1k) -> claude-opus-4-7 ($0.034/1k)
    cluster  8 top-1: mimo ($0.0009)              -> gpt-5.5 ($0.015)

Roster unchanged from v0.52 (19 deployed models).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@steventohme steventohme merged commit 335a6fa into main May 22, 2026
7 checks passed
@steventohme steventohme deleted the steven/router-v0-54-tau2-weighted branch May 22, 2026 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant