Skip to content

afm v0.9.13

Latest

Choose a tag to compare

@scouzi1966 scouzi1966 released this 21 Jun 19:12
· 4 commits to main since this release

afm v0.9.13

OpenAI-compatible local LLM inference for Apple Silicon (MLX + Apple Foundation Models).

Highlights since v0.9.12 (73 commits)

New models

  • cohere2_moe — Cohere North-Mini-Code (30B-A3B MoE). Correct across streaming, non-streaming, prefix-cache, and concurrency (#139).

⚡ Speculative decoding (quality-preserving)

  • --mtp — Qwen3.6 self-speculative decoding via the in-model MTP head → ~+52% decode.
  • --eagle3 <drafter> — dense Gemma4-31B EAGLE3 drafter → ~+30% decode.
  • Both work streaming and non-streaming. Bit-exact to greedy on short generations, near-greedy on long ones.

APIs & agent-friendliness

  • /v1/embeddings on the main server (Apple NaturalLanguage) (#132, #133).
  • Mid-stream cancel + /v1/tokenize / /v1/count_tokens + /openapi.json & /docs (#126).
  • vLLM-namespaced /metrics + Grafana dashboard (#122).
  • Apple-native Vision OCR and Speech transcription HTTP APIs (thanks @jesserobbins).

Performance & platform

  • Backported mlx-swift 0.31.3 adaptive-block SDPA~+10% decode @16k (pin stays 0.30.3).
  • Eager <think>-tag streaming + Metal-kernel prewarm. Swift 6 language mode migration.

Fixes

  • --no-think / server-default enable_thinking=false now actually disables thinking on reasoning models (was a silent no-op).
  • MTP reject path retains the committed token in the KV/GDN cache (fixes garbled output on longer generations).

Known limitations

  • --no-think + high --concurrent can corrupt output (#140). Default behavior unaffected; use lower concurrency or omit --no-think.
  • MTP is bit-exact to greedy on short generations; longer ones stay greedy-quality but may differ token-for-token.

Install

brew tap scouzi1966/afm && brew install scouzi1966/afm/afm   # or: brew upgrade afm
pip install macafm

SHA256 (afm-v0.9.13-arm64.tar.gz): 443bf74650fece15f7ce02663f6d5dd14a7b638c937f80262e426903a6abf42b