RingLeader

A multi-agent orchestrator that builds software autonomously by looping between Claude Code (implementer) and OpenAI Codex CLI (adversarial reviewer). Confidence thresholds across 9 quality dimensions act as the objective function. When all thresholds pass, the code auto-merges to main.

Inspired by Karpathy's autoresearch (modify → execute → evaluate → decide) and Cline Kanban (parallel agent management with visual dashboard).

┌──────────────┐    handoff    ┌──────────────────┐
│  Claude Code  │ ──────────> │  Codex CLI        │
│  implementer  │ <────────── │  adversarial      │
│               │   scores    │  reviewer         │
└──────┬───────┘              └────────┬──────────┘
       │                               │
       ▼                               ▼
┌─────────────────────────────────────────────┐
│         .orchestrator/tracker.md            │
│         (single source of truth)            │
└─────────────────────────────────────────────┘

Quick Start

Prerequisites

Claude Code CLI (claude)
Codex CLI (codex) — npm i -g @openai/codex
Node.js 18+
Git

Install

# Clone into Claude Code plugins directory
git clone https://github.com/snakezilla/ringleader.git ~/.claude/plugins/local/orchestrator

# Install dashboard dependencies
cd ~/.claude/plugins/local/orchestrator/dashboard && npm install

# Add the CLI command
mkdir -p ~/.local/bin
ln -sf ~/.claude/plugins/local/orchestrator/bin/ringleader ~/.local/bin/ringleader

# Ensure ~/.local/bin is in your PATH
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc

Initialize a Project

cd ~/Projects/my-app

# Inside Claude Code:
/ringleader

This runs an 8-step interview that generates:

.orchestrator/north-star.md — immutable goal document
.orchestrator/confidence-thresholds.yaml — quality gates
.orchestrator/plan-of-attack.md — iteration schedule

Run

# Dashboard mode (Ink TUI with kanban board)
ringleader ~/Projects/my-app

# Legacy spinner mode
ringleader --legacy ~/Projects/my-app

# Custom iteration count
ringleader ~/Projects/my-app 30

Existing Codebases

RingLeader auto-detects existing projects and adapts:

cd ~/Projects/existing-app
/ringleader    # detects code, asks improvement goals

It measures a baseline, then improves the codebase toward your thresholds.

How It Works

The Loop

Each iteration runs:

Claude Code implements (with --dangerously-skip-permissions)
Real tools measure what they can (vitest, tsc, npm audit)
Codex CLI answers 105 binary checklist questions with evidence
Scores are computed deterministically from checklist answers + tool output
EMA smoothing filters noise (±15 raw variance → ±3 final)
If all thresholds pass → auto-merge to main

Secretary Problem Scheduling

The optimal stopping problem applied to iteration budgeting:

First 37% of iterations: Planning. Explore approaches broadly, don't commit.
Remaining 63%: Execution. Lock into the best approach and build.

For 20 iterations: 7 planning, 13 execution.

9 Confidence Dimensions

Dimension	Threshold	What It Measures
cybersecurity	80	OWASP, secrets, injection, auth
crash_resistance	85	Error handling, degradation, timeouts
code_coverage	80	Unit + integration tests
type_safety	90	Strict TS, Zod boundaries
hci_design	75	Accessibility, responsive, states
performance	75	Algorithms, queries, bundle size
api_compliance	85	Response format, status codes, schema
documentation	70	README, JSDoc, ADRs
test_effectiveness	60	Mutation testing, meaningful assertions

Thresholds adjust based on project type (CLI doesn't need hci_design) and risk profile (financial data raises cybersecurity to 95).

Deterministic Scoring (v2)

Three layers eliminate LLM scoring variance:

Layer 1 — Binary Checklists. 105 yes/no questions across 9 dimensions. Codex answers each with evidence. Score = items_passed / total × 100. Computed by us, not the LLM.

Layer 2 — Tool Anchoring. Real tool output provides hard measurements:

vitest --coverage → code_coverage (90% weight)
tsc --strict --noEmit → type_safety (60% weight)
npm audit → cybersecurity (40% weight)

Layer 3 — EMA + Winsorization. Smooths remaining variance:

smoothed = 0.4 × raw + 0.6 × previous

Adversarial Review

Claude builds. Codex reviews. Codex never modifies code — it only scores and critiques. This creator-critic separation prevents self-approval bias (different model, different weights, different blind spots).

Codex is prompted as a "naysayer" with instructions to be harsh and cite file:line evidence for every finding.

Progressive Thresholds

Early iterations have lower targets so the loop can make incremental progress:

Progress	Target Multiplier
0-25%	30% of threshold
25-50%	50%
50-75%	75%
75-100%	Full threshold

For existing codebases, targets start from the measured baseline, not from zero.

Dashboard

The Ink TUI dashboard shows real-time progress:

┌─ Status Bar ────────────────────────────────────────────────┐
│  RingLeader  │  Iter 11/23  │  execution: core features    │
├─ Kanban ──────────┬─ Agents ──────────┬─ Scores ───────────┤
│  BACKLOG          │  security-reviewer│  cyber  ▁▃▅▄▅▆ 28  │
│  ├ hci_design 6   │  ├ Edit auth.ts   │  crash  ▁▂▄▃▃▃ 24  │
│  └ docs      12   │  ├ 3m 22s · ~12k  │  cover  ▁▁▁▁▂▂ 20  │
│  IN PROGRESS      │  └ cybersecurity  │  types  ▁▁▂▂▃▃ 29  │
│  ├ cyber    28    │                   │  hci    ▂▅▅▂▁▁  6  │
│  └ types    29    │  tdd-guide        │  perf   ▂▄▄▂▃▂ 18  │
│  DONE             │  ├ Bash: pnpm test│  api    ▂▅▄▂▃▃ 21  │
│  └ (none yet)     │  └ code_coverage  │  docs   ▃▆▆▅▂▂ 12  │
├─ Findings ──────────────────────────────────────────────────┤
│  CRITICAL: SQL interpolation at validate.ts:95              │
├─ Git Log ───────────────────────────────────────────────────┤
│  a281abd  iter-010 codex review (composite: 21.0)           │
└─────────────────────────────────────────────────────────────┘

Project Structure

orchestrator/
├── ARCHITECTURE.md              # v1 architecture
├── ARCHITECTURE-v2.md           # v2 architecture (deterministic scoring, existing codebases)
├── plugin.json                  # Claude Code plugin manifest
├── skills/
│   ├── north-star/SKILL.md      # /north-star — project interview
│   └── orchestrate/SKILL.md     # /orchestrate — per-iteration brain
├── agents/
│   ├── iteration-planner.md     # Secretary problem phase/team selector
│   └── codex-critic.md          # Scoring prompt assembler
├── bin/
│   ├── loop.sh                  # Legacy shell loop
│   └── ringleader               # CLI entry point
├── lib/
│   ├── scoring.sh               # Shell-based scoring (legacy)
│   ├── handoff.sh               # Shell-based prompts (legacy)
│   └── guardrails.sh            # Deadlock, budget, timeout
├── templates/
│   ├── tracker.md               # Tracker doc template
│   ├── confidence-thresholds.yaml
│   ├── plan-of-attack.md
│   ├── codex-system-prompt.md   # Adversarial reviewer prompt (v2 checklist)
│   └── checklists/              # 9 YAML files, 105 binary questions
├── dashboard/
│   ├── index.tsx                # Ink TUI entry point
│   ├── package.json
│   ├── components/              # StatusBar, KanbanBoard, AgentPanel, ScorePanel, etc.
│   ├── engine/                  # orchestrator, tool-runner, checklist-scorer, etc.
│   └── state/                   # store, types, file-watcher

Per-Project State

RingLeader creates .orchestrator/ in your project:

.orchestrator/
├── north-star.md                # Immutable goal (what to build)
├── tracker.md                   # Living state doc
├── confidence-thresholds.yaml   # Quality gates
├── plan-of-attack.md            # Iteration schedule
├── baseline.json                # Baseline scores (existing codebases)
├── handoffs/                    # Per-iteration agent handoff notes
├── logs/
│   ├── scores.jsonl             # Score history (one JSON line per iteration)
│   ├── tool-results-iter-NNN.json
│   ├── claude-iter-NNN.log
│   └── codex-iter-NNN.log
├── decisions/                   # Architecture decision records
└── scratch/                     # Disposable planning artifacts

Guardrails

Guardrail	Behavior
Max iterations	Hard stop at N
Timeouts	10min planning, 15min execution
Deadlock detection	3 identical composites → agent switch → human escalation
Budget tracking	Cumulative tokens in cost.json
Human checkpoint	Pause at midpoint (N/2)
Score regression	EMA + winsorization caps ±20 deviation
Non-regression	Existing codebases: no dimension drops >5 from high-water
Git safety	Every iteration committed, fully revertable

Design Influences

Karpathy's autoresearch — modify → execute → evaluate → decide. Single objective metric. Human-editable strategy doc.
Secretary problem — Optimal stopping for planning vs execution budget.
Cline Kanban — Visual board for parallel agent management.
RocketEval (ICLR 2025) — Binary checklist decomposition for stable LLM scoring.
G-Eval — Chain-of-thought before scoring for consistency.
gstack — Cognitive gearing and review checklists.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RingLeader

Quick Start

Prerequisites

Install

Initialize a Project

Run

Existing Codebases

How It Works

The Loop

Secretary Problem Scheduling

9 Confidence Dimensions

Deterministic Scoring (v2)

Adversarial Review

Progressive Thresholds

Dashboard

Project Structure

Per-Project State

Guardrails

Design Influences

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agents		agents
bin		bin
dashboard		dashboard
lib		lib
skills		skills
templates		templates
.gitignore		.gitignore
ARCHITECTURE-v2.md		ARCHITECTURE-v2.md
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
plugin.json		plugin.json

Folders and files

Latest commit

History

Repository files navigation

RingLeader

Quick Start

Prerequisites

Install

Initialize a Project

Run

Existing Codebases

How It Works

The Loop

Secretary Problem Scheduling

9 Confidence Dimensions

Deterministic Scoring (v2)

Adversarial Review

Progressive Thresholds

Dashboard

Project Structure

Per-Project State

Guardrails

Design Influences

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages