Skip to content

turangenesis/headroom

Repository files navigation

Headroom

A human-in-the-loop firewall for AI coding agents — and a measurement study of when to trust the human.

OpenClaw gives agents hands. Headroom gives them brakes.

A thought experiment, before the architecture.

Two guards watch the same agent.

  • Guard A escalates 5 actions/day to a human reviewer.
  • Guard B escalates 500/day.

Most people pick B — more oversight, more safety. Here's the break:

Guard A (5/day) Guard B (500/day)
Reviewer by approval #300 reads all 5 carefully fatigued, rubber-stamping
Malicious action buried in the stream gets caught sails through on a tired yes

Guard B has more oversight and the worse outcome.

The reframe: escalations aren't free. Each one spends a finite pool of human attention — the same pool the dangerous action needs. So oversight isn't only a classification problem (which action is risky?); it's a resource-allocation problem (the escalation policy spends a budget the threat is competing for).

This is the lens, not a discovery: the fatiguing-reviewer model is prior art (FALCON, DeCCaF). What this repo does is bring it to LLM-agent guards and put a curve on it.

AI agents now run code — deploy, delete, push to main. The usual safety answer is a human-approval gate. But a pause button is the easy part; the hard part is the judgment (which actions to stop) and the fact that the human reviewer is subjective and gets tired. Headroom is both a working gate and the apparatus that measures that judgment instead of guessing at it.

Two things at once, on purpose:

  • 🛠️ The system — a worker LLM agent proposes actions; a guardian (deterministic rules + LLM risk judgment) classifies each safe / approval-required / blocked; risky ones pause the agent mid-task (LangGraph interrupt()) and wait for a human; every decision is audit-logged on a live dashboard with an interactive calibration dial.
  • 📊 The measurement — we frame the guard as selective classification under asymmetric cost with noisy labels and a fatiguing reviewer, and measure it: a calibration curve, a noise floor (Fleiss' κ = 0.52), the safety-vs-oversight inverted-U, a flooding attack, and model-dependence — 4 figures + a measured noise floor (3 regenerate key-free; the 2 LLM-scored ones ship as committed artifacts and need an Anthropic key to regenerate). (Scope: an applied/measurement project. The mechanisms — fatigue-aware deferral, flooding attacks — are prior art we cite; the contribution is the open-source system + the measurement. See docs/DRAFT.md.)

Anyone can stop an agent. Headroom measures when to — and shows the cost of every setting.

▶ Live demo — block a rm -rf, approve a deploy, drag the calibration dial. No install, no API key.

Or run it locally in 60 seconds (no API key): uvicorn headroom.api:app → open the dashboard → Run demo (watch a rm -rf get blocked and a prod deploy pause for approval) → scroll to Calibration explorer and drag the dial.

The problem (why this isn't "just a pause button")

As agents get real hands — deploy, delete, spend money, touch prod — the bottleneck isn't can we stop them; frameworks already do that. The bottleneck is can we trust the thing deciding when to stop them. Over-gate and humans rubber-stamp every alert until the guard is useless; under-gate and something blows up. Today that line is usually set by vibes. Headroom sets it with data.

How it works

worker agent → proposes action → guardian (rules → LLM) ─┬─ SAFE     ▸ run
                                                         ├─ BLOCKED  ▸ deny
                                                         └─ APPROVAL ▸ pause (interrupt)
                                                                       → human approve / reject
                                                                       → resume / abort
                                          → append-only audit log + live dashboard

The gate is the substrate; the layer above it — measuring whether the gate's judgment is any good — is the contribution.

The calibration eval

The guardian emits a 0–100 risk score per action. Sweeping the auto-allow-vs-escalate threshold produces the safety/utility tradeoff curve — missed-danger rate vs false-alarm rate — under an asymmetric cost matrix (auto-allowing danger is catastrophic; a false alarm is annoyance).

python -m eval.calibrate            # prints the sweep + cost-min & Neyman-Pearson points + AURC
python -m eval.calibrate --plot     # also writes eval/calibration.png

Headroom calibration curve — safety/utility tradeoff and expected cost vs threshold

On a 125-action hand-labeled set deliberately weighted to hard cases — ambiguous-middle, obfuscated/adversarial (base64 rm -rf, homoglyph maín, path traversal, exfil), and scary-but-benign traps (small, reported as such — not a published benchmark) — the curve tells the story: against adversarial/ambiguous inputs the guard cannot safely auto-decide — reaching 0% dangerous-miss forces escalating almost everything (cost-minimizing policy ≈ "ask the human about everything"). That's the artifact and the finding: you read the guard's real operating limits off the curve with data — and on hard inputs it can't filter safely, which is exactly what forces high escalation onto a human (→ the research direction below).

The noise floor — why a single "ground truth" is a lie. "Is this action risky?" is subjective: even careful reviewers disagree, so a guard can't be scored against one objective label. python -m eval.noise_floor has three LLM-persona reviewers (cautious / pragmatic / strict-compliance) label the 125-action set and reports Fleiss' κ ≈ 0.52 — only moderate agreement, and the pragmatic reviewer labeled 87 actions SAFE vs the cautious one's 45 (that gap is the risk-tolerance axis). That's the irreducible disagreement, and it's the yardstick: a guard that agrees with reviewers as often as they agree with each other is at "human" level. (Personas are a proxy for human annotators — reported as such, not the true human floor.)

Precision note (so the claim is exact): the curve is operating-point analysis under asymmetric cost (selective classification) — not yet formal calibration in the ECE/reliability sense. "Calibration" is the theme; the claim is precisely the measured tradeoff + noise floor above. Formal calibration metrics (ECE, Brier, reliability diagrams), an adversarial/evasion set, published benchmarks (AgentDojo, InjecAgent), and frontier methods (conformal prediction, trajectory-level guarding) are deeper rigor on the roadmap — see Stage 1.

The throughline: Stopping an agent is a framework feature. Knowing when to stop it — selective classification under asymmetric cost with label noise — is the problem, and here's the curve that shows the tradeoff and makes the operating point a measured choice.

Research direction — "Oversight Has a Capacity"

The deepest version of the thesis (full detail → docs/RESEARCH.md). Agent safety is usually modeled as a perfect, infinite human checking a fallible agent against a ground-truth "safe." All three are false: the label is subjective (no ground truth; measured Fleiss' κ ≈ 0.52), the human is endogenous (escalation fatigues them — the guard degrades its own oracle), and the cost is asymmetric. So the optimal when-to-escalate policy must be load-aware, and realized safety is an inverted-U:

more human oversight can make a system less safe — the safety-optimal guard escalates below the human's capacity.

That's selective classification under asymmetric cost with noisy labels and an endogenous expert — the last clause is prior art (FALCON / DeCCaF), which we demonstrate in the LLM-agent setting via a simulated inverted-U experiment.

Scope: this matters only where the judgment is subjective with delayed outcomes (agent oversight, content moderation, alert triage) — not where there's objective ground truth (e.g. banking fraud, where you just use the better predictor). A novelty check confirmed the core mechanisms are prior art — the endogenous-fatiguing-reviewer + load-aware deferral is FALCON / DeCCaF, the flooding attack is SOC alert-fatigue, trajectory guarding is ShieldAgent et al. This is an applied / measurement / systems project, not a novel-theory paper — the contribution is the open-source firewall + the measurement that brings these ideas together for LLM agents.

The working paper draft is docs/DRAFT.md (built from the figures + real numbers).

Getting Started

pip install -r requirements.txt        # or: uv sync
cp .env.example .env                    # add ANTHROPIC_API_KEY (and optional LangSmith key)

Usage

uvicorn headroom.api:app              # dashboard → http://localhost:8000 (Run demo needs no key)

The dashboard shows a live activity feed, pending approvals with the guardian's reasoning, approve/reject, and a per-run cost line. The Run demo button drives a full SAFE → BLOCKED → APPROVAL flow with no API key. A Calibration explorer lets you drag the guard's aggressiveness and watch the missed-danger vs false-alarm tradeoff recolor across all 125 actions in real time — the curve made tangible, replaying saved scores (no API calls). Run python -m eval.calibrate once to populate it.

Plug it into your own agent (MCP)

Headroom is also an MCP server, so your own agent — Claude Code, Cursor, or a custom one — can route its actions through the guard. The agent doesn't open this app; it calls two tools (submit_action_for_review, check_review) before it acts. Add ~4 lines to your MCP client config:

{ "mcpServers": { "headroom": { "command": "python", "args": ["-m", "headroom.mcp_server"] } } }

Now when your agent wants to git push or rm -rf, it asks Headroom first — which returns allow / blocked, or pending (queued for a human to approve on the dashboard). See it without a real client:

python scripts/mcp_demo.py               # an external agent submits 3 actions over MCP → verdicts
python -m headroom.mcp_server          # run the MCP server itself (stdio)

Cooperative integration (the agent is configured to ask). True no-bypass — gateway / host-hook / sandbox — is the enforcement ladder above this.

Development

pytest                                  # unit + integration (no API key needed)
python -m eval.run_eval                 # guardian confusion matrix + recall / precision
python -m eval.calibrate --plot         # the calibration curve (cost matrix, sweep, NP point, AURC) + PNG
python -m eval.noise_floor              # inter-annotator kappa — the noise floor (LLM-persona proxy)
bash scripts/smoke-check.sh             # key-file checks + pytest

Evaluation is cost-aware by design — prompt caching, the Message Batches API, pre-recorded worker traces, and stratified sampling, with a built-in judge cost/cache meter (GET /api → judge_cost). Methodology and targets → docs/EVAL.md.

Reproduce the results

git clone … && cd Headroom && pip install -r requirements.txt

# Key-free (replay committed scores / pure simulation) — exact paper numbers:
python -m eval.inverted_u                # the inverted-U (Figure 2)  [reads committed calibration.json]
python -m eval.fatigue_attack            # the flooding attack (Figure 4)

# Need an ANTHROPIC_API_KEY (LLM scoring; a few cents each):
python -m eval.calibrate --plot          # calibration curve + per-action scores (Figure 1)
python -m eval.noise_floor               # the κ noise floor       [committed: eval/noise_floor.json]
python -m eval.compare_models            # Haiku vs Sonnet (Figure 3) [committed: eval/model_comparison.json]
python -m eval.nseed --temp 0 --n 3      # deployed-setting AURC spread

The four figures live in eval/ (calibration.png, inverted_u.png, model_comparison.png, fatigue_attack.png); the κ noise floor is a number, not a plot. The two LLM-scored artifacts (noise_floor.json, model_comparison.json) are committed, so the cited numbers ship even without a key.

Stack

Python 3.12 · LangGraph (interrupt() HITL + SqliteSaver) · langchain-anthropic (Claude) · FastAPI · SQLite · MCP (FastMCP) · LangSmith · Tailwind-CDN dashboard.

License

MIT — see LICENSE.

About

More human oversight can make an AI agent less safe. Headroom is a human-in-the-loop firewall for coding agents that measures when to trust the human: oversight as resource allocation, not just classification.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors