Skip to content

v0.6.68

Choose a tag to compare

@raullenchai raullenchai released this 28 May 15:54

v0.6.68 — chat launch polish + MTP guard

First release since v0.6.66 (0.6.67 was skipped — squash subject didn't match auto-release regex).

Highlights

  • 🛡️ MTP injection no longer crashes on hybrid VLM models (#477, #483). rapid-mlx serve <Qwen3.6-VL-MTP-model> --enable-mtp --force-spec-decode previously crashed twice — once on the outer VLM args lookup, once on the hybrid Gated-DeltaNet _step missing. Both paths now fail cleanly with a single warning each; request continues without MTP. Proper VLM+MTP support is tracked as a follow-up.

  • 💬 rapid-mlx chat pre-launch UX polish (#482):

    • Auto-bump max_tokens to 4096 when --think is set (empty-answer fix on small reasoning models)
    • Atexit zombie reap for spawned serve subprocesses + SIGTERM handler
    • --port range validator + pre-flight TCP probe
    • rapid-mlx run alias for chat; rapid-mlx ls for cached models
    • Download confirmation gate ([y/N] for ≥10 GiB models, RAPID_MLX_AUTO_PULL=1 to skip)
    • /bye /? REPL aliases; info box truncation; first-launch codex tip
    • New vllm_mlx/_download_gate.py module

Bug fixes

  • fix(api): honor max_completion_tokens on chat completions (#459)
  • fix(api): honor parallel_tool_calls=false by capping response to 1 call (#464)
  • fix(api): honor legacy functions/function_call by normalizing to tools (#465)
  • fix(anthropic): forward stop_sequences to engine on /v1/messages (#462)
  • fix(chat): preserve channel split on logprobs non-stream path (#460)
  • fix(routes): reject audio_url on text-only models (mirror image/video gate) (#466)
  • fix(mllm): propagate VLM image fetch errors to HTTP 400 (#458)
  • fix(engine): propagate per-token logprobs through OutputRouter (#456)
  • fix(streaming): populate reasoning_tokens in usage chunk for OutputRouter models (#454)
  • fix(usage): proportional reasoning_tokens split when content non-empty (#453)

Tooling

  • feat(pr_validate): integrate Google eng-practices code-review tiering (#474)
  • docs(benchmarks): add community DFlash bench for Qwen3.6-35B-A3B-8bit on M3 Ultra (#473)

Install

```bash
brew upgrade rapid-mlx

or

pip install -U rapid-mlx==0.6.68
```