langchain-failover — keeping a two-Mac vllm-mlx setup online across host blips (tool-calling intact) #588

vinayvobbili · 2026-05-30T22:35:32Z

vinayvobbili
May 30, 2026

Follow-on to my self-hosted recipe in #534. That post runs the whole thing on one Mac; this is the piece that keeps it alive when one Mac isn't enough. ⚙️

The problem. A single vllm-mlx box is a single point of failure. When it goes down — OOM, a model swap, or just the macOS memory compressor freezing a cold ~30 GB process overnight — every LangChain call that depends on it stalls. The obvious fix is a second box with a backup model and a client-side failover. (It doesn't have to be the same model — mine runs GLM-4.7-Flash as primary and a Qwen3 model as the fallback; the wrapper just needs two chat models.) But every off-the-shelf LangChain option I tried (.with_fallbacks() and friends) had two issues for an agent workload:

Tool-calling didn't survive the failover. With strict langchain-core (>=1.4, where BaseChatModel.bind_tools raises by default), naive wrappers break at bind time — and an agent that loses its tools mid-loop is useless.
Stateless retry. They re-try the (still-dead) primary on every call instead of remembering it's down.

What I built. FailoverChatModel — a small wrapper around two chat models:

bind_tools binds both legs and returns another FailoverChatModel, so tool-calling keeps working across a failover.
Connection-aware — only fails over on connection/network errors (walks the exception __cause__/__context__ chain), so a bad prompt still raises instead of silently retrying on a second endpoint.
Stateful — remembers which leg it's on, logs the flip both ways, and switches back the moment the primary answers again.
Mid-stream safe — during stream(), only fails over before the first token, so you never get duplicated half-streamed output.

It's provider-agnostic (any OpenAI-compatible endpoint), but it's a natural fit here because the convenience constructor auto-discovers the served model from /models:

from langchain_failover import create_failover_llm

llm = create_failover_llm(
    primary_url="http://mac-a:8000/v1",    # vllm-mlx box A — primary model
    secondary_url="http://mac-b:8000/v1",  # vllm-mlx box B — a different backup model is fine
)
# discovers each served model id from /models — no hardcoding, mismatched models OK

Or wrap your own ChatOpenAI instances and bind tools as usual:

from langchain_failover import FailoverChatModel
llm = FailoverChatModel(primary=primary, secondary=backup).bind_tools(my_tools)

One honest caveat: a different backup model still has to satisfy your tool and output-schema expectations — failover keeps the endpoint alive, it doesn't guarantee the backup reasons or formats identically. Worth a quick check that your prompts behave on both legs.

Where it came from. Extracted from the failover layer behind a multi-agent AI SOC I run on local LLMs — the same two-box vllm-mlx pattern from #534, scaled to keep a 24×7 agent cascade online through host blips. Full context/writeup: https://vinayvobbili.github.io/posts/building-soc-in-a-box/

pip install langchain-failover

Code + tests: https://github.com/vinayvobbili/langchain-failover

Sharing in case anyone else here runs vllm-mlx across two Macs and wants HA without losing tool-calling. Feedback and PRs welcome 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

langchain-failover — keeping a two-Mac vllm-mlx setup online across host blips (tool-calling intact) #588

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

langchain-failover — keeping a two-Mac vllm-mlx setup online across host blips (tool-calling intact) #588

Uh oh!

vinayvobbili May 30, 2026

Replies: 0 comments

vinayvobbili
May 30, 2026