afm v0.9.13
OpenAI-compatible local LLM inference for Apple Silicon (MLX + Apple Foundation Models).
Highlights since v0.9.12 (73 commits)
New models
- cohere2_moe — Cohere North-Mini-Code (30B-A3B MoE). Correct across streaming, non-streaming, prefix-cache, and concurrency (#139).
⚡ Speculative decoding (quality-preserving)
--mtp— Qwen3.6 self-speculative decoding via the in-model MTP head → ~+52% decode.--eagle3 <drafter>— dense Gemma4-31B EAGLE3 drafter → ~+30% decode.- Both work streaming and non-streaming. Bit-exact to greedy on short generations, near-greedy on long ones.
APIs & agent-friendliness
/v1/embeddingson the main server (Apple NaturalLanguage) (#132, #133).- Mid-stream cancel +
/v1/tokenize//v1/count_tokens+/openapi.json&/docs(#126). - vLLM-namespaced
/metrics+ Grafana dashboard (#122). - Apple-native Vision OCR and Speech transcription HTTP APIs (thanks @jesserobbins).
Performance & platform
- Backported mlx-swift 0.31.3 adaptive-block SDPA → ~+10% decode @16k (pin stays 0.30.3).
- Eager
<think>-tag streaming + Metal-kernel prewarm. Swift 6 language mode migration.
Fixes
--no-think/ server-defaultenable_thinking=falsenow actually disables thinking on reasoning models (was a silent no-op).- MTP reject path retains the committed token in the KV/GDN cache (fixes garbled output on longer generations).
Known limitations
--no-think+ high--concurrentcan corrupt output (#140). Default behavior unaffected; use lower concurrency or omit--no-think.- MTP is bit-exact to greedy on short generations; longer ones stay greedy-quality but may differ token-for-token.
Install
brew tap scouzi1966/afm && brew install scouzi1966/afm/afm # or: brew upgrade afm
pip install macafmSHA256 (afm-v0.9.13-arm64.tar.gz): 443bf74650fece15f7ce02663f6d5dd14a7b638c937f80262e426903a6abf42b