Self-hosted Claude Code on a Mac Studio with vllm-mlx — full recipe #534
vinayvobbili
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Following up on #521 — here's the end-to-end recipe for running Claude Code (the CLI agent, not the chat UI) against vllm-mlx on a Mac. Sharing because the inference-server side is well covered in this repo's docs but the rest of the chain (Anthropic Messages API translation, model-pick rationale, prefix-cache hygiene) tends to live in scattered gists.
What runs where
mlx-community/Qwen2.5-Coder-32B-Instruct-8biton port 8003, bound to0.0.0.0with--api-key.--tool-call-parser hermes, no reasoning parser, ~34 GB resident.claude-code-router(ccr, port 8050) — Anthropic Messages API ↔ OpenAI translation, the existing OSS project./v1/modelswithclaude-*aliases so vanillaclaudeclients accept it as an Anthropic endpoint, proxies/v1/messagesto ccr, and buffers SSE for the tool-call path (more on that below).claudeCLI configured with five env vars pointing at the shim. That's it on the client.The five env vars (no surprises if you've configured Claude Code against any non-Anthropic backend before):
Six things that aren't obvious
1. Why vllm-mlx (vs Ollama / LM Studio / native scripts)
OpenAI-compatible API with real tool-call parsing for Qwen / Hermes / GLM / DeepSeek / Llama families, configurable per-model. Ollama's tool-call story is a moving target and historically didn't preserve the message structure Claude Code expects; LM Studio's API is fine but doesn't expose the parser knobs. vllm-mlx's
--tool-call-parserflag is the cleanest path I've found for getting tool-using agents to behave on Apple Silicon.2. Why Qwen2.5-Coder-32B over GLM-4.7-Flash / Qwen3-32B for the coding agent
Tested all three at 8-bit on the same hardware. Two reasons Qwen2.5-Coder won:
<think>blocks andenable_thinking— great for chat, awkward for an agent that re-renders the same tool list every turn (the think suffix gets re-emitted on intermediate messages, which doesn't compose with prefix caching). Qwen2.5-Coder has none of that.last_query_index, retroactive-rewrite branches, etc.).If you only do single-turn completions, model choice matters less. If you're running an agent with long tool-rich contexts, prefix-cache friendliness is the lever.
3. The shim's streaming buffer (a vllm-mlx tool-parser quirk worth knowing)
vllm-mlx's tool-call parsers emit the tool-call JSON as content deltas in the streaming response, then re-tag the buffered text as a tool call at the end. This is correct for OpenAI's streaming spec, but the Anthropic Messages SSE protocol expects tool calls in their own typed events (
content_block_startwithtype: tool_use). If you proxy ccr's stream straight through, Claude Code sees the tool call as text mid-stream and either renders the JSON to the user or breaks the conversation flow.The fix in the shim: detect the alias path, buffer the upstream stream until completion, then synthesize a clean Anthropic SSE sequence (
message_start→content_block_startfor text → text deltas →content_block_stop→ optionalcontent_block_startfortool_use→ input_json deltas →content_block_stop→message_stop). ~80 lines. Worth the latency cost (you lose token-by-token streaming on the tool-call path) for correctness.4. Strip the
x-anthropic-billing-header(cited from #277)Claude Code rotates a header like
x-anthropic-billing-header: cch=<digits>per turn. If your shim or ccr happens to fold any request header into the cache key (some do, for "isolation"), this rotates the cache key on every turn and busts the prefix cache. PR #277 strips it on/v1/messagesonly — measured 13–15× speedup on turn-2 latency in this stack once it landed. If you're rolling your own shim, drop the header on the way in.5. Direct-bind + bearer auth instead of an SSH tunnel
Old setup tunneled
ssh -R 8027:localhost:8003from the Mac into the Linux box and dialedlocalhost:8027from ccr. New setup: bind vllm-mlx to0.0.0.0on the Mac with--api-key <bearer>, keep the Mac on a private LAN, dial<mac-host>:8003from ccr with the bearer. No tunnel session to babysit, no reconnect storms when the Mac suspends. The launchctl plist owns the service end-to-end.6. The prefix-KV cache (cited from #523)
By default the
stream_chatpath on the pure-LLM engine re-prefills the full system+tools prefix every turn — for a long Claude Code system prompt that's ~23K tokens of redundant prefill on turn 2+. PR #523 extends the existing MLLM-path system-prompt KV cache (single-slot, hash-keyed, snapshot/restore) into thestream_chatpath. Detection is by probe-divergence (render the template with two different user contents, take the shared prefix) so it's model-agnostic. Falls back to uncached generation on any mismatch.Measured locally on this stack:
Numbers are bigger when the system prompt is bigger. Claude Code's tool-rich system prompt is around the upper end of where this matters.
Putting it together
End-to-end happy path on turn 2 of a Claude Code conversation:
claudesends Anthropic-formatted request to shim:8051.x-anthropic-billing-header, forwards to ccr:8050.api.anthropic.com.Open / known-bad
Happy to expand on any piece. Code refs to ccr, the shim, the launchctl plist, and the multi-turn template-stability test are reproducible from this writeup if anyone wants to replicate.
— @vinayvobbili
Beta Was this translation helpful? Give feedback.
All reactions