Skip to content

v0.6.70 — Cloud routing restored (#500)

Choose a tag to compare

@raullenchai raullenchai released this 01 Jun 13:57

Fixes

  • #500--cloud-model routing silently never fires. Two regressions traced to #155 (SimpleEngine deletion) had silently disabled cloud routing for ~6 weeks:
    1. BatchedEngine.build_prompt was missing; the call site was guarded by hasattr(engine, "build_prompt") which turned False against BatchedEngine and skipped the cloud branch with no log line.
    2. engine.model.estimate_new_tokens was missing too; even after the hasattr was removed, the cloud branch reached the line and fell through a try/except ("falling back to local").

Both methods are now part of the engine contract on BatchedEngine. Regression gates in tests/test_cloud_router.py::TestEngineCloudRoutingContract pin the surface, including AST-grade scans that prevent the original silent-skip patterns from reappearing.

End-to-end verified: rapid-mlx serve qwen3.5-4b --cloud-model deepseek/deepseek-chat --cloud-threshold 10 now routes the request to deepseek-v4-flash with a visible [CLOUD ROUTE] 30 new tokens > threshold 10, routing to deepseek/deepseek-chat log line.

Notes

BatchedEngine.estimate_new_tokens returns (total, total) — conservative vs the SimpleEngine version which peeked at per-engine cached token IDs. BatchedEngine's prefix cache is shared across requests so there is no single "cached prefix" to compare; correctness is preserved (threshold semantics still hold), only an optional perf optimization is deferred.

Validation

Release-smoke (clean-room PyPI install+import), CLI/config audit, make smoke (lint + audit + 1933 unit tests), make stress 8/8, microbench (sub-second roundtrip at 10/100/1000-word prompts), latency 77-124 tok/s mean 103.5 over 10 sequential requests, integrations anthropic_sdk 5/5 + pydantic_ai 6/6 + smolagents 4/4.

🤖 Generated with Claude Code