v0.6.68
v0.6.68 — chat launch polish + MTP guard
First release since v0.6.66 (0.6.67 was skipped — squash subject didn't match auto-release regex).
Highlights
-
🛡️ MTP injection no longer crashes on hybrid VLM models (#477, #483).
rapid-mlx serve <Qwen3.6-VL-MTP-model> --enable-mtp --force-spec-decodepreviously crashed twice — once on the outer VLM args lookup, once on the hybrid Gated-DeltaNet_stepmissing. Both paths now fail cleanly with a single warning each; request continues without MTP. Proper VLM+MTP support is tracked as a follow-up. -
💬
rapid-mlx chatpre-launch UX polish (#482):- Auto-bump
max_tokensto 4096 when--thinkis set (empty-answer fix on small reasoning models) - Atexit zombie reap for spawned
servesubprocesses + SIGTERM handler --portrange validator + pre-flight TCP proberapid-mlx runalias for chat;rapid-mlx lsfor cached models- Download confirmation gate (
[y/N]for ≥10 GiB models,RAPID_MLX_AUTO_PULL=1to skip) /bye/?REPL aliases;infobox truncation; first-launch codex tip- New
vllm_mlx/_download_gate.pymodule
- Auto-bump
Bug fixes
fix(api): honor max_completion_tokens on chat completions(#459)fix(api): honor parallel_tool_calls=false by capping response to 1 call(#464)fix(api): honor legacy functions/function_call by normalizing to tools(#465)fix(anthropic): forward stop_sequences to engine on /v1/messages(#462)fix(chat): preserve channel split on logprobs non-stream path(#460)fix(routes): reject audio_url on text-only models (mirror image/video gate)(#466)fix(mllm): propagate VLM image fetch errors to HTTP 400(#458)fix(engine): propagate per-token logprobs through OutputRouter(#456)fix(streaming): populate reasoning_tokens in usage chunk for OutputRouter models(#454)fix(usage): proportional reasoning_tokens split when content non-empty(#453)
Tooling
feat(pr_validate): integrate Google eng-practices code-review tiering(#474)docs(benchmarks): add community DFlash bench for Qwen3.6-35B-A3B-8bit on M3 Ultra(#473)
Install
```bash
brew upgrade rapid-mlx
or
pip install -U rapid-mlx==0.6.68
```