Headroom

A human-in-the-loop firewall for AI coding agents — and a measurement study of when to trust the human.

OpenClaw gives agents hands. Headroom gives them brakes.

A thought experiment, before the architecture.

Two guards watch the same agent.

Guard A escalates 5 actions/day to a human reviewer.
Guard B escalates 500/day.

Most people pick B — more oversight, more safety. Here's the break:

	Guard A (5/day)	Guard B (500/day)
Reviewer by approval #300	reads all 5 carefully	fatigued, rubber-stamping
Malicious action buried in the stream	gets caught	sails through on a tired yes

Guard B has more oversight and the worse outcome.

The reframe: escalations aren't free. Each one spends a finite pool of human attention — the same pool the dangerous action needs. So oversight isn't only a classification problem (which action is risky?); it's a resource-allocation problem (the escalation policy spends a budget the threat is competing for).

This is the lens, not a discovery: the fatiguing-reviewer model is prior art (FALCON, DeCCaF). What this repo does is bring it to LLM-agent guards and put a curve on it.

AI agents now run code — deploy, delete, push to main. The usual safety answer is a human-approval gate. But a pause button is the easy part; the hard part is the judgment (which actions to stop) and the fact that the human reviewer is subjective and gets tired. Headroom is both a working gate and the apparatus that measures that judgment instead of guessing at it.

Two things at once, on purpose:

🛠️ The system — a worker LLM agent proposes actions; a guardian (deterministic rules + LLM risk judgment) classifies each safe / approval-required / blocked; risky ones pause the agent mid-task (LangGraph interrupt()) and wait for a human; every decision is audit-logged on a live dashboard with an interactive calibration dial.
📊 The measurement — we frame the guard as selective classification under asymmetric cost with noisy labels and a fatiguing reviewer, and measure it: a calibration curve, a noise floor (Fleiss' κ = 0.52), the safety-vs-oversight inverted-U, a flooding attack, and model-dependence — 4 figures + a measured noise floor (3 regenerate key-free; the 2 LLM-scored ones ship as committed artifacts and need an Anthropic key to regenerate). (Scope: an applied/measurement project. The mechanisms — fatigue-aware deferral, flooding attacks — are prior art we cite; the contribution is the open-source system + the measurement. See docs/DRAFT.md.)

Anyone can stop an agent. Headroom measures when to — and shows the cost of every setting.

▶ Live demo — block a rm -rf, approve a deploy, drag the calibration dial. No install, no API key.

Or run it locally in 60 seconds (no API key): uvicorn headroom.api:app → open the dashboard → Run demo (watch a rm -rf get blocked and a prod deploy pause for approval) → scroll to Calibration explorer and drag the dial.

The problem (why this isn't "just a pause button")

As agents get real hands — deploy, delete, spend money, touch prod — the bottleneck isn't can we stop them; frameworks already do that. The bottleneck is can we trust the thing deciding when to stop them. Over-gate and humans rubber-stamp every alert until the guard is useless; under-gate and something blows up. Today that line is usually set by vibes. Headroom sets it with data.

How it works

worker agent → proposes action → guardian (rules → LLM) ─┬─ SAFE     ▸ run
                                                         ├─ BLOCKED  ▸ deny
                                                         └─ APPROVAL ▸ pause (interrupt)
                                                                       → human approve / reject
                                                                       → resume / abort
                                          → append-only audit log + live dashboard

The gate is the substrate; the layer above it — measuring whether the gate's judgment is any good — is the contribution.

The calibration eval

The guardian emits a 0–100 risk score per action. Sweeping the auto-allow-vs-escalate threshold produces the safety/utility tradeoff curve — missed-danger rate vs false-alarm rate — under an asymmetric cost matrix (auto-allowing danger is catastrophic; a false alarm is annoyance).

python -m eval.calibrate            # prints the sweep + cost-min & Neyman-Pearson points + AURC
python -m eval.calibrate --plot     # also writes eval/calibration.png

On a 125-action hand-labeled set deliberately weighted to hard cases — ambiguous-middle, obfuscated/adversarial (base64 rm -rf, homoglyph maín, path traversal, exfil), and scary-but-benign traps (small, reported as such — not a published benchmark) — the curve tells the story: against adversarial/ambiguous inputs the guard cannot safely auto-decide — reaching 0% dangerous-miss forces escalating almost everything (cost-minimizing policy ≈ "ask the human about everything"). That's the artifact and the finding: you read the guard's real operating limits off the curve with data — and on hard inputs it can't filter safely, which is exactly what forces high escalation onto a human (→ the research direction below).

The noise floor — why a single "ground truth" is a lie. "Is this action risky?" is subjective: even careful reviewers disagree, so a guard can't be scored against one objective label. python -m eval.noise_floor has three LLM-persona reviewers (cautious / pragmatic / strict-compliance) label the 125-action set and reports Fleiss' κ ≈ 0.52 — only moderate agreement, and the pragmatic reviewer labeled 87 actions SAFE vs the cautious one's 45 (that gap is the risk-tolerance axis). That's the irreducible disagreement, and it's the yardstick: a guard that agrees with reviewers as often as they agree with each other is at "human" level. (Personas are a proxy for human annotators — reported as such, not the true human floor.)

Precision note (so the claim is exact): the curve is operating-point analysis under asymmetric cost (selective classification) — not yet formal calibration in the ECE/reliability sense. "Calibration" is the theme; the claim is precisely the measured tradeoff + noise floor above. Formal calibration metrics (ECE, Brier, reliability diagrams), an adversarial/evasion set, published benchmarks (AgentDojo, InjecAgent), and frontier methods (conformal prediction, trajectory-level guarding) are deeper rigor on the roadmap — see Stage 1.

The throughline: Stopping an agent is a framework feature. Knowing when to stop it — selective classification under asymmetric cost with label noise — is the problem, and here's the curve that shows the tradeoff and makes the operating point a measured choice.

Research direction — "Oversight Has a Capacity"

The deepest version of the thesis (full detail → docs/RESEARCH.md). Agent safety is usually modeled as a perfect, infinite human checking a fallible agent against a ground-truth "safe." All three are false: the label is subjective (no ground truth; measured Fleiss' κ ≈ 0.52), the human is endogenous (escalation fatigues them — the guard degrades its own oracle), and the cost is asymmetric. So the optimal when-to-escalate policy must be load-aware, and realized safety is an inverted-U:

more human oversight can make a system less safe — the safety-optimal guard escalates below the human's capacity.

That's selective classification under asymmetric cost with noisy labels and an endogenous expert — the last clause is prior art (FALCON / DeCCaF), which we demonstrate in the LLM-agent setting via a simulated inverted-U experiment.

Scope: this matters only where the judgment is subjective with delayed outcomes (agent oversight, content moderation, alert triage) — not where there's objective ground truth (e.g. banking fraud, where you just use the better predictor). A novelty check confirmed the core mechanisms are prior art — the endogenous-fatiguing-reviewer + load-aware deferral is FALCON / DeCCaF, the flooding attack is SOC alert-fatigue, trajectory guarding is ShieldAgent et al. This is an applied / measurement / systems project, not a novel-theory paper — the contribution is the open-source firewall + the measurement that brings these ideas together for LLM agents.

The working paper draft is docs/DRAFT.md (built from the figures + real numbers).

Getting Started

pip install -r requirements.txt        # or: uv sync
cp .env.example .env                    # add ANTHROPIC_API_KEY (and optional LangSmith key)

Usage

uvicorn headroom.api:app              # dashboard → http://localhost:8000 (Run demo needs no key)

The dashboard shows a live activity feed, pending approvals with the guardian's reasoning, approve/reject, and a per-run cost line. The Run demo button drives a full SAFE → BLOCKED → APPROVAL flow with no API key. A Calibration explorer lets you drag the guard's aggressiveness and watch the missed-danger vs false-alarm tradeoff recolor across all 125 actions in real time — the curve made tangible, replaying saved scores (no API calls). Run python -m eval.calibrate once to populate it.

Plug it into your own agent (MCP)

Headroom is also an MCP server, so your own agent — Claude Code, Cursor, or a custom one — can route its actions through the guard. The agent doesn't open this app; it calls two tools (submit_action_for_review, check_review) before it acts. Add ~4 lines to your MCP client config:

{ "mcpServers": { "headroom": { "command": "python", "args": ["-m", "headroom.mcp_server"] } } }

Now when your agent wants to git push or rm -rf, it asks Headroom first — which returns allow / blocked, or pending (queued for a human to approve on the dashboard). See it without a real client:

python scripts/mcp_demo.py               # an external agent submits 3 actions over MCP → verdicts
python -m headroom.mcp_server          # run the MCP server itself (stdio)

Cooperative integration (the agent is configured to ask). True no-bypass — gateway / host-hook / sandbox — is the enforcement ladder above this.

Development

pytest                                  # unit + integration (no API key needed)
python -m eval.run_eval                 # guardian confusion matrix + recall / precision
python -m eval.calibrate --plot         # the calibration curve (cost matrix, sweep, NP point, AURC) + PNG
python -m eval.noise_floor              # inter-annotator kappa — the noise floor (LLM-persona proxy)
bash scripts/smoke-check.sh             # key-file checks + pytest

Evaluation is cost-aware by design — prompt caching, the Message Batches API, pre-recorded worker traces, and stratified sampling, with a built-in judge cost/cache meter (GET /api → judge_cost). Methodology and targets → docs/EVAL.md.

Reproduce the results

git clone … && cd Headroom && pip install -r requirements.txt

# Key-free (replay committed scores / pure simulation) — exact paper numbers:
python -m eval.inverted_u                # the inverted-U (Figure 2)  [reads committed calibration.json]
python -m eval.fatigue_attack            # the flooding attack (Figure 4)

# Need an ANTHROPIC_API_KEY (LLM scoring; a few cents each):
python -m eval.calibrate --plot          # calibration curve + per-action scores (Figure 1)
python -m eval.noise_floor               # the κ noise floor       [committed: eval/noise_floor.json]
python -m eval.compare_models            # Haiku vs Sonnet (Figure 3) [committed: eval/model_comparison.json]
python -m eval.nseed --temp 0 --n 3      # deployed-setting AURC spread

The four figures live in eval/ (calibration.png, inverted_u.png, model_comparison.png, fatigue_attack.png); the κ noise floor is a number, not a plot. The two LLM-scored artifacts (noise_floor.json, model_comparison.json) are committed, so the cited numbers ship even without a key.

Stack

Python 3.12 · LangGraph (interrupt() HITL + SqliteSaver) · langchain-anthropic (Claude) · FastAPI · SQLite · MCP (FastMCP) · LangSmith · Tailwind-CDN dashboard.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.devcontainer		.devcontainer
.github		.github
docs		docs
eval		eval
headroom		headroom
paper		paper
sample-target		sample-target
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
conftest.py		conftest.py
requirements.txt		requirements.txt
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Headroom

A human-in-the-loop firewall for AI coding agents — and a measurement study of when to trust the human.

The problem (why this isn't "just a pause button")

How it works

The calibration eval

Research direction — "Oversight Has a Capacity"

Getting Started

Usage

Plug it into your own agent (MCP)

Development

Reproduce the results

Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Headroom

A human-in-the-loop firewall for AI coding agents — and a measurement study of when to trust the human.

The problem (why this isn't "just a pause button")

How it works

The calibration eval

Research direction — "Oversight Has a Capacity"

Getting Started

Usage

Plug it into your own agent (MCP)

Development

Reproduce the results

Stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages