v0.6.70 — Cloud routing restored (#500)
Fixes
- #500 —
--cloud-modelrouting silently never fires. Two regressions traced to #155 (SimpleEngine deletion) had silently disabled cloud routing for ~6 weeks:BatchedEngine.build_promptwas missing; the call site was guarded byhasattr(engine, "build_prompt")which turnedFalseagainstBatchedEngineand skipped the cloud branch with no log line.engine.model.estimate_new_tokenswas missing too; even after the hasattr was removed, the cloud branch reached the line and fell through a try/except ("falling back to local").
Both methods are now part of the engine contract on BatchedEngine. Regression gates in tests/test_cloud_router.py::TestEngineCloudRoutingContract pin the surface, including AST-grade scans that prevent the original silent-skip patterns from reappearing.
End-to-end verified: rapid-mlx serve qwen3.5-4b --cloud-model deepseek/deepseek-chat --cloud-threshold 10 now routes the request to deepseek-v4-flash with a visible [CLOUD ROUTE] 30 new tokens > threshold 10, routing to deepseek/deepseek-chat log line.
Notes
BatchedEngine.estimate_new_tokens returns (total, total) — conservative vs the SimpleEngine version which peeked at per-engine cached token IDs. BatchedEngine's prefix cache is shared across requests so there is no single "cached prefix" to compare; correctness is preserved (threshold semantics still hold), only an optional perf optimization is deferred.
Validation
Release-smoke (clean-room PyPI install+import), CLI/config audit, make smoke (lint + audit + 1933 unit tests), make stress 8/8, microbench (sub-second roundtrip at 10/100/1000-word prompts), latency 77-124 tok/s mean 103.5 over 10 sequential requests, integrations anthropic_sdk 5/5 + pydantic_ai 6/6 + smolagents 4/4.
🤖 Generated with Claude Code