Parallel multi-provider CLI dispatcher for Claude Code plugins. Dispatches prompts to codex, gemini, and cursor-agent in parallel, routing each model to its native CLI when possible (separate rate-limit buckets) and falling back to cursor-agent for everything else. Includes a 3-stage peer-review "council" pattern inspired by karpathy/llm-council, adapted for CLI providers with strict JSON parsing.
No API keys required. The library talks to CLI tools (codex, gemini, cursor-agent) which carry their own OAuth flows. There's no OPENAI_API_KEY or ANTHROPIC_API_KEY anywhere.
No Claude Code plugin. This is a Python library with three console scripts. It's designed to be consumed as a dependency by other Python-based Claude Code plugins (e.g. gw, aisci) via uv pip install.
Rate-bucket spreading. When all three CLIs are installed, a 3-way panel dispatches across 3 independent rate buckets: Codex/OpenAI OAuth + Google AI OAuth + Cursor subscription. One hot bucket doesn't throttle the rest.
Via uv from git:
uv pip install "claude-multi-model @ git+https://github.com/stmailabs/claude-multi-model.git"Or add it as a dependency in a consuming project's pyproject.toml:
dependencies = [
"claude-multi-model @ git+https://github.com/stmailabs/claude-multi-model.git",
]The library needs at least one of these installed on $PATH:
| CLI | Install | Native models | OAuth |
|---|---|---|---|
codex |
npm install -g @openai/codex |
GPT-5.x, o3, o4, Codex family | codex login |
gemini |
npm install -g @google/gemini-cli |
Gemini 3.x family | gemini auth |
cursor-agent |
cursor.com — install Cursor, accept CLI prompt | 85 models across 7 families (Claude, GPT, Gemini, Grok, Kimi, Composer) | Cursor subscription |
Check what you have:
uv run mm-detectOutput:
✓ codex codex-cli 0.118.0
path: /Users/c/.npm-global/bin/codex
✓ gemini 0.37.0
path: /Users/c/.npm-global/bin/gemini
✓ cursor-agent 2026.04.08-a41fba1
path: /Users/c/.local/bin/cursor-agent
GPT / Codex family → codex exec (OpenAI OAuth bucket)
Gemini family → gemini -p (Google OAuth bucket)
Grok / Kimi / Composer / anything else → cursor-agent (Cursor bucket)
Claude family → [REFUSED by default]
From inside Claude Code, use the Agent tool instead.
Opt-in override: pass --allow-cursor-claude to route via cursor.
Why Claude models are refused: inside Claude Code, you have the Agent tool primitive that spawns a Claude subagent in-process (no subprocess, no auth, shared session budget, structured return value). Routing Claude through cursor-agent would add a subprocess spawn, a separate rate bucket, and output parsing. If the caller really wants that — maybe to use cursor-agent as a second Claude source for parallelism — they pass --allow-cursor-claude and acknowledge the trade-off.
Why cursor-agent is the fallback for everything else: cursor-agent exposes 85 models from 7 provider families via a single CLI with native JSON output. It's the widest reach for any model we can't route to a direct CLI.
from multi_model import dispatch
responses = dispatch(
prompt="Verify these 30 citations and return JSON: [...]",
models=["gpt-5.4-high", "gemini-3.1-pro", "grok-4-20-thinking"],
timeout=300,
)
for r in responses:
print(f"{r.model} via {r.cli_used}: {r.text[:80]}...")
# All fields: model, text, cli_used, exit_code, duration_s, error, metadataDispatch runs the three models in parallel through three different CLIs (and three separate rate buckets) via concurrent.futures.ThreadPoolExecutor. Results come back in the same order as the input models list.
from multi_model.council import run_council
result = run_council(
prompt="What's the best approach to verifying citations in a grant proposal?",
panel=["gpt-5.4-high", "gemini-3.1-pro", "grok-4-20-thinking"],
chairman="gpt-5.4-high",
)
print(result.stage3_synthesis) # Chairman's final synthesized answer
print(result.aggregate_ranks) # {model: avg_peer_rank_position}
print(result.stage2_rankings) # {reviewer: [model1, model2, ...]}
print(result.stage1_responses) # {model: text}The council protocol:
- Stage 1 — Parallel Open Answer. All panel models answer the prompt independently.
- Stage 2 — Blind Peer Review. Responses are anonymized as "Response A", "Response B", etc. Each reviewer sees the full anonymized set and ranks them best-to-worst, returning structured JSON. The anonymization prevents models from favoring their own responses.
- Stage 3 — Chairman Synthesis. A designated chairman model receives the de-anonymized stage 1 responses + stage 2 rankings and produces a single final answer.
Aggregate rankings are computed across all reviewers: each model's average rank position across every reviewer, lower = better peer-perceived quality.
If you're writing a Claude Code skill that wants to mix Claude workers (via the Agent tool) with external workers (via this library), inject a custom dispatch_fn:
from multi_model.council import run_council, run_stage1, run_stage2, run_stage3
from multi_model.types import Response
def skill_dispatch(prompt: str, models: list[str]) -> list[Response]:
"""Hybrid dispatch: Claude via Agent tool, others via multi_model."""
from multi_model import dispatch
claude_models = [m for m in models if "claude" in m.lower()]
other_models = [m for m in models if "claude" not in m.lower()]
# Claude via Agent tool (skill-side, not in this function — see below)
claude_responses = [
Response(
model=m,
text=agent_tool_result_for(m), # filled in by the skill orchestrator
cli_used="agent-tool",
exit_code=0,
duration_s=...,
)
for m in claude_models
]
# Other providers via mm-ask
other_responses = dispatch(prompt, other_models) if other_models else []
return claude_responses + other_responses
result = run_council(
prompt="...",
panel=["claude-opus", "gpt-5.4-high", "gemini-3.1-pro"],
chairman="claude-opus",
dispatch_fn=skill_dispatch,
)The dispatch_fn hook is how gw and aisci skills combine in-process Agent tool workers with subprocess CLI workers.
uv run mm-ask \
--models gpt-5.4-high,gemini-3.1-pro,grok-4-20-thinking \
--prompt "Verify these citations: ..." \
--output results.json \
--timeout 300Flags:
--models a,b,c— comma-separated model IDs (required)--prompt TEXT/--prompt-file FILE/ stdin — prompt input (exactly one)--output FILE/ stdout — JSON output destination--timeout N— per-model timeout seconds (default 300)--allow-cursor-claude— opt-in to route Claude via cursor-agent--verbose/-v— print routing summary to stderr
Exit codes: 0 = ok, 2 = bad args, 3 = Claude rejected, 4 = unreachable model, 5 = all dispatches failed.
uv run mm-council \
--panel gpt-5.4-high,gemini-3.1-pro,grok-4-20-thinking \
--chairman gpt-5.4-high \
--prompt-file prompt.txt \
--output council.jsonFlags:
--panel a,b,c— panel models (default:gpt-5.4-high,gemini-3.1-pro,grok-4-20-thinking)--chairman M— stage-3 synthesizer (default:gpt-5.4-high)--prompt/--prompt-file/ stdin--output FILE— JSON output--timeout N— per-model timeout--allow-cursor-claude,--verbose
Output JSON has the full CouncilResult.to_dict() shape: stage1_responses, stage2_rankings, aggregate_ranks, stage3_synthesis, chairman_model, metadata.
uv run mm-detect # human-readable
uv run mm-detect --json # JSON
uv run mm-detect --check-auth # also probe auth state (slower)src/multi_model/
├── __init__.py # public API
├── types.py # Response dataclass
├── constants.py # DEFAULT_PANEL, DEFAULT_CHAIRMAN, DEFAULT_TIMEOUT
├── detect.py # CLI availability + auth probe
├── routing.py # model → CLI routing rules
├── dispatch.py # ThreadPoolExecutor parallel dispatch
├── council.py # 3-stage council pattern
├── cli.py # mm-ask, mm-council, mm-detect entry points
└── providers/
├── __init__.py
├── _subprocess.py # shared subprocess wrapper
├── codex.py # codex exec dispatcher + stdout cleaner
├── gemini.py # gemini -p dispatcher + ANSI/noise strip
└── cursor_agent.py # cursor-agent --print + JSON parser
Total core library: ~1,200 lines of Python, stdlib-only (no runtime dependencies). Compare with claude-octopus at ~14,400 lines of bash for the same "parallel multi-provider dispatch" feature.
We evaluated forking or depending on claude-octopus and concluded it was the wrong shape for this use case:
- Footprint: octopus is ~14,400 lines of bash across 52
lib/files with 855 source files total. We use ~4% of its surface area (the dispatch primitives fromlib/dispatch.shandlib/workflows.sh). - Tight coupling: the core
probe_single_agent()function depends on 19 helper functions spanning 15 other bash files, plus ~30 global env vars. Extracting it cleanly is impractical. - Interactive gates:
/octo:researchblocks onAskUserQuestionasking "how thorough?" before dispatching — incompatible with headless/CI pipelines. - Mandatory banners: every invocation emits
🐙 CLAUDE OCTOPUS ACTIVATEDheaders. - Output location: octopus writes to
~/.claude-octopus/debates/<session>/— not where our consumers want their state. - Plugin coupling: consumers invoke
/octo:debateas a slash command, passing prompts through free-form text and parsing free-form responses. A Python library + import is strictly cleaner for Python consumers.
In contrast, claude-multi-model:
- 1,200 lines of Python, stdlib-only, easy to read
- Zero interactive gates, fully headless
- Outputs land wherever the caller writes them
- Clean Python API with typed dataclasses + CLI entry points
- Ships as a regular Python dependency via
uv pip install
The 3-stage council pattern (parallel answer → blind peer review → chairman synthesis) is adapted from karpathy/llm-council. We re-implement the same idea with CLI providers instead of OpenRouter, stricter JSON parsing instead of regex-matching on "FINAL RANKING:" text markers, and polymorphic dispatch so skills can mix in-process Agent tool calls with subprocess dispatches.
The provider CLI knowledge (especially codex exec flags, gemini stdin piping, auth retry patterns, and macOS keychain workarounds) was informed by reading claude-octopus's bash source. We don't copy any code, but octopus's well-commented dispatch logic is a useful reference for the sharp edges of each CLI.
MIT. See LICENSE.