Self-hosted Claude Code on a Mac Studio with vllm-mlx — full recipe #534

vinayvobbili · 2026-05-14T17:54:42Z

vinayvobbili
May 14, 2026

Updates — 2026-05-16 ⚙️

The recipe has shifted since this post went up. Summary of what's changed and where to look inline:

§2 model choice → swapped to Qwen3-Coder-30B-A3B-Instruct-8bit. Native 256K context (no YaRN/linear hacks), parser qwen3_coder (alias of qwen3_xml), enable_thinking still false because filename contains "Coder". Same prefix-stability properties verified under the multi-turn fixture; cache snapshot drops from ~67 MB (dense 32B) to ~25 MB (A3B MoE) for the same prompt.

§3 streaming buffer → the qwen3_coder parser actually emits typed tool_calls deltas natively (verified via direct SSE probe). The shim's buffer is now preserved as the proven path, not a hard requirement — could be dropped to enable true streaming on the tool-call route.

§6 prefix-KV cache → feat: extend system-prompt KV cache to pure-LLM stream_chat path #523's single-slot snapshot now extended to a 4-slot LRU keyed by system-prefix hash. Driver: Claude Code's main-agent (~28K tokens) + sub-agent (~6–7K tokens) dispatch was thrashing the single slot. Follow-up PR feat: multi-slot LRU for system-KV cache + hit-ratio counters (follow-up to #523) #541 just opened; measured ~170× TTFT speedup on warm prefixes in A,A,B,A,B traffic (cold 144s → warm 0.85s). Capacity tunable via VLLM_MLX_SYSTEM_KV_SLOTS env var; =1 reverts to feat: extend system-prompt KV cache to pure-LLM stream_chat path #523 behavior.

§Open / known-bad → context-length section is obsolete. Qwen3-Coder-8bit is native 256K; the recipe now caps at 128K via --max-request-tokens 131072 (chose 128K over 256K for snapshot-RAM headroom on a 96 GB Mac).

Two client-side defaults discovered after the original post:

CLAUDE_CODE_DISABLE_1M_CONTEXT=1 — Claude Code v2.1.x sends the context-1m-2025-08-07 beta header by default, which (a) misleads users when the backend caps below 1M and (b) shows up as a [1m] tag on the model name in the status line. Sixth env var to add to §"the five env vars". Documented in the Claude Code changelog as a real off-switch, not a hack.

--default-temperature 0.3 on the vllm-mlx CLI. Claude Code sends temperature=None, so without a server-side default the model would run unbounded tool cascades on simple greetings (8+ tool rounds investigating "Good morning"). The 0.3 default keeps cascades terminating at ~4 rounds with the model offering a clarifying question instead.

Rest of the post (shim architecture, ccr wiring, billing-header strip, direct-bind + bearer auth) is unchanged. — @vinayvobbili

Following up on #521 — here's the end-to-end recipe for running Claude Code (the CLI agent, not the chat UI) against vllm-mlx on a Mac. Sharing because the inference-server side is well covered in this repo's docs but the rest of the chain (Anthropic Messages API translation, model-pick rationale, prefix-cache hygiene) tends to live in scattered gists.

What runs where

Mac Studio (96 GB unified memory): vllm-mlx serving mlx-community/Qwen2.5-Coder-32B-Instruct-8bit on port 8003, bound to 0.0.0.0 with --api-key. --tool-call-parser hermes, no reasoning parser, ~34 GB resident.
Linux box on the same LAN: two small services in front of the Mac:
- claude-code-router (ccr, port 8050) — Anthropic Messages API ↔ OpenAI translation, the existing OSS project.
- A 60-line shim (port 8051) — exposes /v1/models with claude-* aliases so vanilla claude clients accept it as an Anthropic endpoint, proxies /v1/messages to ccr, and buffers SSE for the tool-call path (more on that below).
Laptop / workstation: claude CLI configured with five env vars pointing at the shim. That's it on the client.

The five env vars (no surprises if you've configured Claude Code against any non-Anthropic backend before):

export ANTHROPIC_BASE_URL=http://shim-host:8051
export ANTHROPIC_AUTH_TOKEN=<bearer>
export ANTHROPIC_MODEL=claude-opus-4-7
export ANTHROPIC_SMALL_FAST_MODEL=claude-haiku-4-5
export DISABLE_NONESSENTIAL_TRAFFIC=1

Six things that aren't obvious

1. Why vllm-mlx (vs Ollama / LM Studio / native scripts)

OpenAI-compatible API with real tool-call parsing for Qwen / Hermes / GLM / DeepSeek / Llama families, configurable per-model. Ollama's tool-call story is a moving target and historically didn't preserve the message structure Claude Code expects; LM Studio's API is fine but doesn't expose the parser knobs. vllm-mlx's --tool-call-parser flag is the cleanest path I've found for getting tool-using agents to behave on Apple Silicon.

2. Why Qwen2.5-Coder-32B over GLM-4.7-Flash / Qwen3-32B for the coding agent

Tested all three at 8-bit on the same hardware. Two reasons Qwen2.5-Coder won:

Parser maturity. GLM and Qwen3 use thinking-mode templates with <think> blocks and enable_thinking — great for chat, awkward for an agent that re-renders the same tool list every turn (the think suffix gets re-emitted on intermediate messages, which doesn't compose with prefix caching). Qwen2.5-Coder has none of that.
Template stability across turns. Verified by rendering the chat template under a multi-turn fixture (pure chat / +tools / single tool-call / parallel tool-calls / tool-response-as-final). Every previous-turn render is a strict prefix of the next-turn render — meaning the KV cache from turn N is reusable on turn N+1 without invalidation. Qwen3 and the GLM thinking variants don't satisfy this; the rendered prefix shifts at the boundary in subtle ways (last_query_index, retroactive-rewrite branches, etc.).

If you only do single-turn completions, model choice matters less. If you're running an agent with long tool-rich contexts, prefix-cache friendliness is the lever.

3. The shim's streaming buffer (a vllm-mlx tool-parser quirk worth knowing)

vllm-mlx's tool-call parsers emit the tool-call JSON as content deltas in the streaming response, then re-tag the buffered text as a tool call at the end. This is correct for OpenAI's streaming spec, but the Anthropic Messages SSE protocol expects tool calls in their own typed events (content_block_start with type: tool_use). If you proxy ccr's stream straight through, Claude Code sees the tool call as text mid-stream and either renders the JSON to the user or breaks the conversation flow.

The fix in the shim: detect the alias path, buffer the upstream stream until completion, then synthesize a clean Anthropic SSE sequence (message_start → content_block_start for text → text deltas → content_block_stop → optional content_block_start for tool_use → input_json deltas → content_block_stop → message_stop). ~80 lines. Worth the latency cost (you lose token-by-token streaming on the tool-call path) for correctness.

4. Strip the `x-anthropic-billing-header` (cited from #277)

Claude Code rotates a header like x-anthropic-billing-header: cch=<digits> per turn. If your shim or ccr happens to fold any request header into the cache key (some do, for "isolation"), this rotates the cache key on every turn and busts the prefix cache. PR #277 strips it on /v1/messages only — measured 13–15× speedup on turn-2 latency in this stack once it landed. If you're rolling your own shim, drop the header on the way in.

5. Direct-bind + bearer auth instead of an SSH tunnel

Old setup tunneled ssh -R 8027:localhost:8003 from the Mac into the Linux box and dialed localhost:8027 from ccr. New setup: bind vllm-mlx to 0.0.0.0 on the Mac with --api-key <bearer>, keep the Mac on a private LAN, dial <mac-host>:8003 from ccr with the bearer. No tunnel session to babysit, no reconnect storms when the Mac suspends. The launchctl plist owns the service end-to-end.

<key>ProgramArguments</key>
<array>
  <string>/Users/<you>/.venvs/vllm-mlx/bin/python</string>
  <string>-m</string><string>vllm_mlx.server</string>
  <string>--model</string><string>mlx-community/Qwen2.5-Coder-32B-Instruct-8bit</string>
  <string>--host</string><string>0.0.0.0</string>
  <string>--port</string><string>8003</string>
  <string>--tool-call-parser</string><string>hermes</string>
  <string>--api-key</string><string>$BEARER</string>
</array>
<key>RunAtLoad</key><true/>
<key>KeepAlive</key><true/>

6. The prefix-KV cache (cited from #523)

By default the stream_chat path on the pure-LLM engine re-prefills the full system+tools prefix every turn — for a long Claude Code system prompt that's ~23K tokens of redundant prefill on turn 2+. PR #523 extends the existing MLLM-path system-prompt KV cache (single-slot, hash-keyed, snapshot/restore) into the stream_chat path. Detection is by probe-divergence (render the template with two different user contents, take the shared prefix) so it's model-agnostic. Falls back to uncached generation on any mismatch.

Measured locally on this stack:

Turn 1: ~100 s (full prefill) — same as before.
Turn 2 (no cache): ~100 s — same.
Turn 2 (with cache): ~7 s — about 14× on the prefill-bound part.

Numbers are bigger when the system prompt is bigger. Claude Code's tool-rich system prompt is around the upper end of where this matters.

Putting it together

End-to-end happy path on turn 2 of a Claude Code conversation:

claude sends Anthropic-formatted request to shim:8051.
Shim drops x-anthropic-billing-header, forwards to ccr:8050.
ccr translates to OpenAI format, sends to mac-host:8003.
vllm-mlx hashes the system prefix, hits cache, restores KV, prefills only the suffix, streams.
ccr translates SSE chunks to Anthropic events; shim buffers tool-call payloads and re-emits clean Anthropic SSE.
Claude Code consumes the stream as if it came from api.anthropic.com.

Open / known-bad

Context length on Qwen2.5-Coder-8bit caps at 32K in the current mlx-lm. YaRN extension crashes on first decode; linear 2x cratered quality even on small prompts. Watching mlx-lm for an upstream fix; for now, long contexts spill back to the cloud model in router config.
No streaming on the tool-call path because of the shim's buffer (point 3). Acceptable trade-off for correctness; might be reworkable as a streaming state machine if anyone wants to take it on.

Happy to expand on any piece. Code refs to ccr, the shim, the launchctl plist, and the multi-turn template-stability test are reproducible from this writeup if anyone wants to replicate.

— @vinayvobbili

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-hosted Claude Code on a Mac Studio with vllm-mlx — full recipe #534

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Self-hosted Claude Code on a Mac Studio with vllm-mlx — full recipe #534

Uh oh!

Uh oh!

vinayvobbili May 14, 2026

What runs where

Six things that aren't obvious

1. Why vllm-mlx (vs Ollama / LM Studio / native scripts)

2. Why Qwen2.5-Coder-32B over GLM-4.7-Flash / Qwen3-32B for the coding agent

3. The shim's streaming buffer (a vllm-mlx tool-parser quirk worth knowing)

4. Strip the x-anthropic-billing-header (cited from #277)

5. Direct-bind + bearer auth instead of an SSH tunnel

6. The prefix-KV cache (cited from #523)

Putting it together

Open / known-bad

Replies: 0 comments

vinayvobbili
May 14, 2026

4. Strip the `x-anthropic-billing-header` (cited from #277)