Skip to content

snakezilla/ringleader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RingLeader

A multi-agent orchestrator that builds software autonomously by looping between Claude Code (implementer) and OpenAI Codex CLI (adversarial reviewer). Confidence thresholds across 9 quality dimensions act as the objective function. When all thresholds pass, the code auto-merges to main.

Inspired by Karpathy's autoresearch (modify → execute → evaluate → decide) and Cline Kanban (parallel agent management with visual dashboard).

┌──────────────┐    handoff    ┌──────────────────┐
│  Claude Code  │ ──────────> │  Codex CLI        │
│  implementer  │ <────────── │  adversarial      │
│               │   scores    │  reviewer         │
└──────┬───────┘              └────────┬──────────┘
       │                               │
       ▼                               ▼
┌─────────────────────────────────────────────┐
│         .orchestrator/tracker.md            │
│         (single source of truth)            │
└─────────────────────────────────────────────┘

Quick Start

Prerequisites

Install

# Clone into Claude Code plugins directory
git clone https://github.com/snakezilla/ringleader.git ~/.claude/plugins/local/orchestrator

# Install dashboard dependencies
cd ~/.claude/plugins/local/orchestrator/dashboard && npm install

# Add the CLI command
mkdir -p ~/.local/bin
ln -sf ~/.claude/plugins/local/orchestrator/bin/ringleader ~/.local/bin/ringleader

# Ensure ~/.local/bin is in your PATH
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc

Initialize a Project

cd ~/Projects/my-app

# Inside Claude Code:
/ringleader

This runs an 8-step interview that generates:

  • .orchestrator/north-star.md — immutable goal document
  • .orchestrator/confidence-thresholds.yaml — quality gates
  • .orchestrator/plan-of-attack.md — iteration schedule

Run

# Dashboard mode (Ink TUI with kanban board)
ringleader ~/Projects/my-app

# Legacy spinner mode
ringleader --legacy ~/Projects/my-app

# Custom iteration count
ringleader ~/Projects/my-app 30

Existing Codebases

RingLeader auto-detects existing projects and adapts:

cd ~/Projects/existing-app
/ringleader    # detects code, asks improvement goals

It measures a baseline, then improves the codebase toward your thresholds.

How It Works

The Loop

Each iteration runs:

  1. Claude Code implements (with --dangerously-skip-permissions)
  2. Real tools measure what they can (vitest, tsc, npm audit)
  3. Codex CLI answers 105 binary checklist questions with evidence
  4. Scores are computed deterministically from checklist answers + tool output
  5. EMA smoothing filters noise (±15 raw variance → ±3 final)
  6. If all thresholds pass → auto-merge to main

Secretary Problem Scheduling

The optimal stopping problem applied to iteration budgeting:

  • First 37% of iterations: Planning. Explore approaches broadly, don't commit.
  • Remaining 63%: Execution. Lock into the best approach and build.

For 20 iterations: 7 planning, 13 execution.

9 Confidence Dimensions

Dimension Threshold What It Measures
cybersecurity 80 OWASP, secrets, injection, auth
crash_resistance 85 Error handling, degradation, timeouts
code_coverage 80 Unit + integration tests
type_safety 90 Strict TS, Zod boundaries
hci_design 75 Accessibility, responsive, states
performance 75 Algorithms, queries, bundle size
api_compliance 85 Response format, status codes, schema
documentation 70 README, JSDoc, ADRs
test_effectiveness 60 Mutation testing, meaningful assertions

Thresholds adjust based on project type (CLI doesn't need hci_design) and risk profile (financial data raises cybersecurity to 95).

Deterministic Scoring (v2)

Three layers eliminate LLM scoring variance:

Layer 1 — Binary Checklists. 105 yes/no questions across 9 dimensions. Codex answers each with evidence. Score = items_passed / total × 100. Computed by us, not the LLM.

Layer 2 — Tool Anchoring. Real tool output provides hard measurements:

  • vitest --coverage → code_coverage (90% weight)
  • tsc --strict --noEmit → type_safety (60% weight)
  • npm audit → cybersecurity (40% weight)

Layer 3 — EMA + Winsorization. Smooths remaining variance:

smoothed = 0.4 × raw + 0.6 × previous

Adversarial Review

Claude builds. Codex reviews. Codex never modifies code — it only scores and critiques. This creator-critic separation prevents self-approval bias (different model, different weights, different blind spots).

Codex is prompted as a "naysayer" with instructions to be harsh and cite file:line evidence for every finding.

Progressive Thresholds

Early iterations have lower targets so the loop can make incremental progress:

Progress Target Multiplier
0-25% 30% of threshold
25-50% 50%
50-75% 75%
75-100% Full threshold

For existing codebases, targets start from the measured baseline, not from zero.

Dashboard

The Ink TUI dashboard shows real-time progress:

┌─ Status Bar ────────────────────────────────────────────────┐
│  RingLeader  │  Iter 11/23  │  execution: core features    │
├─ Kanban ──────────┬─ Agents ──────────┬─ Scores ───────────┤
│  BACKLOG          │  security-reviewer│  cyber  ▁▃▅▄▅▆ 28  │
│  ├ hci_design 6   │  ├ Edit auth.ts   │  crash  ▁▂▄▃▃▃ 24  │
│  └ docs      12   │  ├ 3m 22s · ~12k  │  cover  ▁▁▁▁▂▂ 20  │
│  IN PROGRESS      │  └ cybersecurity  │  types  ▁▁▂▂▃▃ 29  │
│  ├ cyber    28    │                   │  hci    ▂▅▅▂▁▁  6  │
│  └ types    29    │  tdd-guide        │  perf   ▂▄▄▂▃▂ 18  │
│  DONE             │  ├ Bash: pnpm test│  api    ▂▅▄▂▃▃ 21  │
│  └ (none yet)     │  └ code_coverage  │  docs   ▃▆▆▅▂▂ 12  │
├─ Findings ──────────────────────────────────────────────────┤
│  CRITICAL: SQL interpolation at validate.ts:95              │
├─ Git Log ───────────────────────────────────────────────────┤
│  a281abd  iter-010 codex review (composite: 21.0)           │
└─────────────────────────────────────────────────────────────┘

Project Structure

orchestrator/
├── ARCHITECTURE.md              # v1 architecture
├── ARCHITECTURE-v2.md           # v2 architecture (deterministic scoring, existing codebases)
├── plugin.json                  # Claude Code plugin manifest
├── skills/
│   ├── north-star/SKILL.md      # /north-star — project interview
│   └── orchestrate/SKILL.md     # /orchestrate — per-iteration brain
├── agents/
│   ├── iteration-planner.md     # Secretary problem phase/team selector
│   └── codex-critic.md          # Scoring prompt assembler
├── bin/
│   ├── loop.sh                  # Legacy shell loop
│   └── ringleader               # CLI entry point
├── lib/
│   ├── scoring.sh               # Shell-based scoring (legacy)
│   ├── handoff.sh               # Shell-based prompts (legacy)
│   └── guardrails.sh            # Deadlock, budget, timeout
├── templates/
│   ├── tracker.md               # Tracker doc template
│   ├── confidence-thresholds.yaml
│   ├── plan-of-attack.md
│   ├── codex-system-prompt.md   # Adversarial reviewer prompt (v2 checklist)
│   └── checklists/              # 9 YAML files, 105 binary questions
├── dashboard/
│   ├── index.tsx                # Ink TUI entry point
│   ├── package.json
│   ├── components/              # StatusBar, KanbanBoard, AgentPanel, ScorePanel, etc.
│   ├── engine/                  # orchestrator, tool-runner, checklist-scorer, etc.
│   └── state/                   # store, types, file-watcher

Per-Project State

RingLeader creates .orchestrator/ in your project:

.orchestrator/
├── north-star.md                # Immutable goal (what to build)
├── tracker.md                   # Living state doc
├── confidence-thresholds.yaml   # Quality gates
├── plan-of-attack.md            # Iteration schedule
├── baseline.json                # Baseline scores (existing codebases)
├── handoffs/                    # Per-iteration agent handoff notes
├── logs/
│   ├── scores.jsonl             # Score history (one JSON line per iteration)
│   ├── tool-results-iter-NNN.json
│   ├── claude-iter-NNN.log
│   └── codex-iter-NNN.log
├── decisions/                   # Architecture decision records
└── scratch/                     # Disposable planning artifacts

Guardrails

Guardrail Behavior
Max iterations Hard stop at N
Timeouts 10min planning, 15min execution
Deadlock detection 3 identical composites → agent switch → human escalation
Budget tracking Cumulative tokens in cost.json
Human checkpoint Pause at midpoint (N/2)
Score regression EMA + winsorization caps ±20 deviation
Non-regression Existing codebases: no dimension drops >5 from high-water
Git safety Every iteration committed, fully revertable

Design Influences

  • Karpathy's autoresearch — modify → execute → evaluate → decide. Single objective metric. Human-editable strategy doc.
  • Secretary problem — Optimal stopping for planning vs execution budget.
  • Cline Kanban — Visual board for parallel agent management.
  • RocketEval (ICLR 2025) — Binary checklist decomposition for stable LLM scoring.
  • G-Eval — Chain-of-thought before scoring for consistency.
  • gstack — Cognitive gearing and review checklists.

License

MIT

About

Multi-agent orchestrator: Claude Code ↔ Codex CLI loop with deterministic scoring, secretary problem scheduling, and kanban dashboard

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors