afm v0.9.13

Latest

Latest

scouzi1966 released this 21 Jun 19:12

· 4 commits to main since this release

dedad02

afm v0.9.13

OpenAI-compatible local LLM inference for Apple Silicon (MLX + Apple Foundation Models).

Highlights since v0.9.12 (73 commits)

New models

cohere2_moe — Cohere North-Mini-Code (30B-A3B MoE). Correct across streaming, non-streaming, prefix-cache, and concurrency (#139).

⚡ Speculative decoding (quality-preserving)

--mtp — Qwen3.6 self-speculative decoding via the in-model MTP head → ~+52% decode.
--eagle3 <drafter> — dense Gemma4-31B EAGLE3 drafter → ~+30% decode.
Both work streaming and non-streaming. Bit-exact to greedy on short generations, near-greedy on long ones.

APIs & agent-friendliness

/v1/embeddings on the main server (Apple NaturalLanguage) (#132, #133).
Mid-stream cancel + /v1/tokenize / /v1/count_tokens + /openapi.json & /docs (#126).
vLLM-namespaced /metrics + Grafana dashboard (#122).
Apple-native Vision OCR and Speech transcription HTTP APIs (thanks @jesserobbins).

Performance & platform

Backported mlx-swift 0.31.3 adaptive-block SDPA → ~+10% decode @16k (pin stays 0.30.3).
Eager <think>-tag streaming + Metal-kernel prewarm. Swift 6 language mode migration.

Fixes

--no-think / server-default enable_thinking=false now actually disables thinking on reasoning models (was a silent no-op).
MTP reject path retains the committed token in the KV/GDN cache (fixes garbled output on longer generations).

Known limitations

--no-think + high --concurrent can corrupt output (#140). Default behavior unaffected; use lower concurrency or omit --no-think.
MTP is bit-exact to greedy on short generations; longer ones stay greedy-quality but may differ token-for-token.

Install

brew tap scouzi1966/afm && brew install scouzi1966/afm/afm   # or: brew upgrade afm
pip install macafm

SHA256 (afm-v0.9.13-arm64.tar.gz): 443bf74650fece15f7ce02663f6d5dd14a7b638c937f80262e426903a6abf42b

Contributors

jesserobbins and 16k

Assets 3