langchain-failover — keeping a two-Mac vllm-mlx setup online across host blips (tool-calling intact) #588
vinayvobbili
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Follow-on to my self-hosted recipe in #534. That post runs the whole thing on one Mac; this is the piece that keeps it alive when one Mac isn't enough. ⚙️
The problem. A single vllm-mlx box is a single point of failure. When it goes down — OOM, a model swap, or just the macOS memory compressor freezing a cold ~30 GB process overnight — every LangChain call that depends on it stalls. The obvious fix is a second box with a backup model and a client-side failover. (It doesn't have to be the same model — mine runs GLM-4.7-Flash as primary and a Qwen3 model as the fallback; the wrapper just needs two chat models.) But every off-the-shelf LangChain option I tried (
.with_fallbacks()and friends) had two issues for an agent workload:>=1.4, whereBaseChatModel.bind_toolsraises by default), naive wrappers break at bind time — and an agent that loses its tools mid-loop is useless.What I built.
FailoverChatModel— a small wrapper around two chat models:bind_toolsbinds both legs and returns anotherFailoverChatModel, so tool-calling keeps working across a failover.__cause__/__context__chain), so a bad prompt still raises instead of silently retrying on a second endpoint.stream(), only fails over before the first token, so you never get duplicated half-streamed output.It's provider-agnostic (any OpenAI-compatible endpoint), but it's a natural fit here because the convenience constructor auto-discovers the served model from
/models:Or wrap your own
ChatOpenAIinstances and bind tools as usual:One honest caveat: a different backup model still has to satisfy your tool and output-schema expectations — failover keeps the endpoint alive, it doesn't guarantee the backup reasons or formats identically. Worth a quick check that your prompts behave on both legs.
Where it came from. Extracted from the failover layer behind a multi-agent AI SOC I run on local LLMs — the same two-box vllm-mlx pattern from #534, scaled to keep a 24×7 agent cascade online through host blips. Full context/writeup: https://vinayvobbili.github.io/posts/building-soc-in-a-box/
Code + tests: https://github.com/vinayvobbili/langchain-failover
Sharing in case anyone else here runs vllm-mlx across two Macs and wants HA without losing tool-calling. Feedback and PRs welcome 🙏
Beta Was this translation helpful? Give feedback.
All reactions