A multi-agent orchestrator that builds software autonomously by looping between Claude Code (implementer) and OpenAI Codex CLI (adversarial reviewer). Confidence thresholds across 9 quality dimensions act as the objective function. When all thresholds pass, the code auto-merges to main.
Inspired by Karpathy's autoresearch (modify → execute → evaluate → decide) and Cline Kanban (parallel agent management with visual dashboard).
┌──────────────┐ handoff ┌──────────────────┐
│ Claude Code │ ──────────> │ Codex CLI │
│ implementer │ <────────── │ adversarial │
│ │ scores │ reviewer │
└──────┬───────┘ └────────┬──────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────┐
│ .orchestrator/tracker.md │
│ (single source of truth) │
└─────────────────────────────────────────────┘
- Claude Code CLI (
claude) - Codex CLI (
codex) —npm i -g @openai/codex - Node.js 18+
- Git
# Clone into Claude Code plugins directory
git clone https://github.com/snakezilla/ringleader.git ~/.claude/plugins/local/orchestrator
# Install dashboard dependencies
cd ~/.claude/plugins/local/orchestrator/dashboard && npm install
# Add the CLI command
mkdir -p ~/.local/bin
ln -sf ~/.claude/plugins/local/orchestrator/bin/ringleader ~/.local/bin/ringleader
# Ensure ~/.local/bin is in your PATH
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrccd ~/Projects/my-app
# Inside Claude Code:
/ringleaderThis runs an 8-step interview that generates:
.orchestrator/north-star.md— immutable goal document.orchestrator/confidence-thresholds.yaml— quality gates.orchestrator/plan-of-attack.md— iteration schedule
# Dashboard mode (Ink TUI with kanban board)
ringleader ~/Projects/my-app
# Legacy spinner mode
ringleader --legacy ~/Projects/my-app
# Custom iteration count
ringleader ~/Projects/my-app 30RingLeader auto-detects existing projects and adapts:
cd ~/Projects/existing-app
/ringleader # detects code, asks improvement goalsIt measures a baseline, then improves the codebase toward your thresholds.
Each iteration runs:
- Claude Code implements (with
--dangerously-skip-permissions) - Real tools measure what they can (vitest, tsc, npm audit)
- Codex CLI answers 105 binary checklist questions with evidence
- Scores are computed deterministically from checklist answers + tool output
- EMA smoothing filters noise (±15 raw variance → ±3 final)
- If all thresholds pass → auto-merge to main
The optimal stopping problem applied to iteration budgeting:
- First 37% of iterations: Planning. Explore approaches broadly, don't commit.
- Remaining 63%: Execution. Lock into the best approach and build.
For 20 iterations: 7 planning, 13 execution.
| Dimension | Threshold | What It Measures |
|---|---|---|
| cybersecurity | 80 | OWASP, secrets, injection, auth |
| crash_resistance | 85 | Error handling, degradation, timeouts |
| code_coverage | 80 | Unit + integration tests |
| type_safety | 90 | Strict TS, Zod boundaries |
| hci_design | 75 | Accessibility, responsive, states |
| performance | 75 | Algorithms, queries, bundle size |
| api_compliance | 85 | Response format, status codes, schema |
| documentation | 70 | README, JSDoc, ADRs |
| test_effectiveness | 60 | Mutation testing, meaningful assertions |
Thresholds adjust based on project type (CLI doesn't need hci_design) and risk profile (financial data raises cybersecurity to 95).
Three layers eliminate LLM scoring variance:
Layer 1 — Binary Checklists. 105 yes/no questions across 9 dimensions. Codex answers each with evidence. Score = items_passed / total × 100. Computed by us, not the LLM.
Layer 2 — Tool Anchoring. Real tool output provides hard measurements:
vitest --coverage→ code_coverage (90% weight)tsc --strict --noEmit→ type_safety (60% weight)npm audit→ cybersecurity (40% weight)
Layer 3 — EMA + Winsorization. Smooths remaining variance:
smoothed = 0.4 × raw + 0.6 × previous
Claude builds. Codex reviews. Codex never modifies code — it only scores and critiques. This creator-critic separation prevents self-approval bias (different model, different weights, different blind spots).
Codex is prompted as a "naysayer" with instructions to be harsh and cite file:line evidence for every finding.
Early iterations have lower targets so the loop can make incremental progress:
| Progress | Target Multiplier |
|---|---|
| 0-25% | 30% of threshold |
| 25-50% | 50% |
| 50-75% | 75% |
| 75-100% | Full threshold |
For existing codebases, targets start from the measured baseline, not from zero.
The Ink TUI dashboard shows real-time progress:
┌─ Status Bar ────────────────────────────────────────────────┐
│ RingLeader │ Iter 11/23 │ execution: core features │
├─ Kanban ──────────┬─ Agents ──────────┬─ Scores ───────────┤
│ BACKLOG │ security-reviewer│ cyber ▁▃▅▄▅▆ 28 │
│ ├ hci_design 6 │ ├ Edit auth.ts │ crash ▁▂▄▃▃▃ 24 │
│ └ docs 12 │ ├ 3m 22s · ~12k │ cover ▁▁▁▁▂▂ 20 │
│ IN PROGRESS │ └ cybersecurity │ types ▁▁▂▂▃▃ 29 │
│ ├ cyber 28 │ │ hci ▂▅▅▂▁▁ 6 │
│ └ types 29 │ tdd-guide │ perf ▂▄▄▂▃▂ 18 │
│ DONE │ ├ Bash: pnpm test│ api ▂▅▄▂▃▃ 21 │
│ └ (none yet) │ └ code_coverage │ docs ▃▆▆▅▂▂ 12 │
├─ Findings ──────────────────────────────────────────────────┤
│ CRITICAL: SQL interpolation at validate.ts:95 │
├─ Git Log ───────────────────────────────────────────────────┤
│ a281abd iter-010 codex review (composite: 21.0) │
└─────────────────────────────────────────────────────────────┘
orchestrator/
├── ARCHITECTURE.md # v1 architecture
├── ARCHITECTURE-v2.md # v2 architecture (deterministic scoring, existing codebases)
├── plugin.json # Claude Code plugin manifest
├── skills/
│ ├── north-star/SKILL.md # /north-star — project interview
│ └── orchestrate/SKILL.md # /orchestrate — per-iteration brain
├── agents/
│ ├── iteration-planner.md # Secretary problem phase/team selector
│ └── codex-critic.md # Scoring prompt assembler
├── bin/
│ ├── loop.sh # Legacy shell loop
│ └── ringleader # CLI entry point
├── lib/
│ ├── scoring.sh # Shell-based scoring (legacy)
│ ├── handoff.sh # Shell-based prompts (legacy)
│ └── guardrails.sh # Deadlock, budget, timeout
├── templates/
│ ├── tracker.md # Tracker doc template
│ ├── confidence-thresholds.yaml
│ ├── plan-of-attack.md
│ ├── codex-system-prompt.md # Adversarial reviewer prompt (v2 checklist)
│ └── checklists/ # 9 YAML files, 105 binary questions
├── dashboard/
│ ├── index.tsx # Ink TUI entry point
│ ├── package.json
│ ├── components/ # StatusBar, KanbanBoard, AgentPanel, ScorePanel, etc.
│ ├── engine/ # orchestrator, tool-runner, checklist-scorer, etc.
│ └── state/ # store, types, file-watcher
RingLeader creates .orchestrator/ in your project:
.orchestrator/
├── north-star.md # Immutable goal (what to build)
├── tracker.md # Living state doc
├── confidence-thresholds.yaml # Quality gates
├── plan-of-attack.md # Iteration schedule
├── baseline.json # Baseline scores (existing codebases)
├── handoffs/ # Per-iteration agent handoff notes
├── logs/
│ ├── scores.jsonl # Score history (one JSON line per iteration)
│ ├── tool-results-iter-NNN.json
│ ├── claude-iter-NNN.log
│ └── codex-iter-NNN.log
├── decisions/ # Architecture decision records
└── scratch/ # Disposable planning artifacts
| Guardrail | Behavior |
|---|---|
| Max iterations | Hard stop at N |
| Timeouts | 10min planning, 15min execution |
| Deadlock detection | 3 identical composites → agent switch → human escalation |
| Budget tracking | Cumulative tokens in cost.json |
| Human checkpoint | Pause at midpoint (N/2) |
| Score regression | EMA + winsorization caps ±20 deviation |
| Non-regression | Existing codebases: no dimension drops >5 from high-water |
| Git safety | Every iteration committed, fully revertable |
- Karpathy's autoresearch — modify → execute → evaluate → decide. Single objective metric. Human-editable strategy doc.
- Secretary problem — Optimal stopping for planning vs execution budget.
- Cline Kanban — Visual board for parallel agent management.
- RocketEval (ICLR 2025) — Binary checklist decomposition for stable LLM scoring.
- G-Eval — Chain-of-thought before scoring for consistency.
- gstack — Cognitive gearing and review checklists.
MIT