DeadBranchBench is a benchmark for measuring wasted agent work.
DeadBranchBench captures agent work as events, derives trace skeletons, and measures where agent compute dies after review.
DeadBranchBench is observability for wasted agent work.
It does not assume a pruning strategy. It does not assume a theory of agent planning. It asks one business question:
Is expensive dead-branch materialization a large enough bill?
The first asset is the event schema plus a boring CLI.
Every observed run records objective events first. Branches, costs, and DBR are derived later after review.
No pruning. No optimizer. No agent intelligence yet.
deadbranchbench/
├── src/deadbranchbench/
├── tasks/
├── traces/
├── metrics/
├── reports/
├── runners/
└── examples/
Use these labels carefully:
live: directly contributed to the final solution.support: failed or ended, but produced information later used by a live branch.deferred: not used in this run, but intentionally preserved for plausible future use.dead: consumed cost and produced no measurable contribution.unknown: not yet labeled.
The enemy is not every dead branch. Cheap exploration is fine. The enemy is expensive dead branches.
Dead Branch Cost (DBC) = sum(cost(branch)) for branches labeled dead
Dead Branch Ratio (DBR) = DBC / total branch cost
Branch ROI = contribution_value / branch cost
Time To Death (TTD) = ended_at - started_at for dead/abandoned/failed/rolled_back branches
Support/Failure Ratio = support branches with failed execution / all failed branches
Success/Dead Ratio = dead branches with succeeded execution / all succeeded branches
Failed-Task Spend = total cost of runs that failed an external task evaluator / total cost
Outcome Waste Floor = classic dead cost, but total run cost when an external evaluator fails
The default cost model is intentionally simple and trace-local. Each trace can override the weights:
{
"token": 1.0,
"tool_call": 1000.0,
"retry": 500.0,
"rollback": 2000.0,
"edit": 100.0,
"test_run": 500.0,
"wall_time_second": 0.0
}From this folder:
python3 -m pip install -e .deadbranchbench validate examples/minimal_trace.json
deadbranchbench compute examples/minimal_trace.json
deadbranchbench report examples/minimal_trace.json --html --output reports/minimal_trace.htmlFor the public benchmark release and the 10-minute external run path, see:
Reproduce the current validated cohort bracket:
python3 runners/report_waste_bracket.py --bundle validated_debug_bundle_20260619Current bracket:
1.71% provenance floor <= human-reviewed waste pending <= 31.81% failed-task ceiling
The HTML report shows:
- DBR
- DBC
- branch table
- cost breakdown
- branch tree
- top waste branches
Capture a command as objective JSONL telemetry:
deadbranchbench observe -- python3 -c "print('hello')"By default this writes:
runs/<run_id>/run.jsonl
The event stream captures:
command_startcommand_endcommand_failedfile_snapshot_beforefile_snapshot_afterfile_changedstdout_summarystderr_summary
Events never claim dead, live, support, or deferred.
Privacy defaults:
- ignored directories include
.git,node_modules,.venv,venv,runs, caches, and build outputs - ignored files include
.env,.npmrc,.pypirc, SSH keys, and common credential filenames - ignored paths include names containing
api_key,access_token,password,private_key,secret, orcredential - ignored suffixes include common key/certificate, database, archive, image, audio, video, and PDF formats
- files larger than 1 MiB are skipped before hashing
Convert objective events into an unknown-labeled trace:
deadbranchbench build-trace runs/<run_id>/run.jsonl --output traces/<run_id>.json
deadbranchbench label traces/<run_id>.json --interactive
deadbranchbench compute traces/<run_id>.jsonbuild-trace keeps contribution_status as unknown. Labels create DBR;
events do not.
Traces can include an optional top-level evaluation object:
{
"success": false,
"evaluator": "external-tests-v1",
"artifact_path": "runs/artifacts/final.py",
"detail": "hidden assertion failed"
}When present, evaluation.success is treated as task-level ground truth. The
classic DBR fields still mean exactly what they meant before: only branches
reviewed or labeled dead count as dead-branch cost. Outcome fields sit beside
DBR so failed-but-executed work is visible:
failed_task_cost: whole-run cost when the external evaluator fails.failed_task_undetected_cost: failed-task cost not already counted as dead branch cost.outcome_waste_floor_cost: dead branch cost for passing/unevaluated runs, whole-run cost for externally failed runs.
This is the benchmark hook for confidently wrong agent work: a run can have
dead_branch_ratio == 0 and still have failed_task_ratio == 1 if it passed
its own checks but failed external ground truth.
Observe a multi-step command script:
deadbranchbench observe-script demo_debug_session/commands.json --run-id observed-debug-session-001 --output runs/observed-debug-session-001/run.jsonl --cwd demo_debug_sessionUse output redaction when terminal streams may contain sensitive data:
deadbranchbench observe --redact-stdout --redact-stderr -- COMMAND...
deadbranchbench observe-script commands.json --redact-stdout --redact-stderrThis repository includes a small observed run where a command writes a failing test and exits nonzero:
deadbranchbench observe --run-id observed-real-task-001 --output runs/observed-real-task-001/run.jsonl --cwd demo_agent_task -- python3 create_failing_test.py
deadbranchbench build-trace runs/observed-real-task-001/run.jsonl --output traces/observed_real_task_unlabeled.json
deadbranchbench report traces/observed_real_task_unlabeled.json --html --output reports/observed_real_task_unlabeled.htmlBefore review:
contribution_status: unknown
unknown_ratio: 100%
DBR: 0%
TTD: present because execution_status is failed
After review, the labeled copy at traces/observed_real_task_labeled.json
marks the branch dead, producing DBR and top-waste output.
The stronger proof is a five-command debugging session:
deadbranchbench observe-script demo_debug_session/commands.json --run-id observed-debug-session-001 --output runs/observed-debug-session-001/run.jsonl --cwd demo_debug_session --redact-stdout
deadbranchbench build-trace runs/observed-debug-session-001/run.jsonl --output traces/observed_debug_session_unlabeled.json
deadbranchbench report traces/observed_debug_session_labeled.json --html --output reports/observed_debug_session_labeled.htmlIt demonstrates:
- failed test output labeled
support - a bad source edit labeled
dead - a final source edit and passing test labeled
live - DBR only after review
The core thesis in five rows:
| Branch | Execution Result | Contribution |
|---|---|---|
b1 |
Failed | Support |
b2 |
Succeeded | Dead |
b3 |
Failed | Support |
b4 |
Succeeded | Live |
b5 |
Succeeded | Live |
Agent execution status and contribution status are different dimensions.
Reviewed labels can include:
reviewerreviewed_atconfidencenotes
Manual wrapper API:
from deadbranchbench import Trace
trace = Trace(
run_id="task-001",
task_id="fix-failing-test",
task_title="Fix failing test",
task_category="debugging",
)
with trace.branch("Inspect failing test output") as branch:
branch.add_tokens(input_tokens=800, output_tokens=250)
branch.add_tool_call(2)
branch.add_test("tests/test_auth.py::test_login")
branch.mark_success(support=True, evidence="Found fixture mismatch")
trace.completed = True
trace.write("traces/task-001.json")deadbranchbench import codex-log ./run.log --output traces/imported.json
deadbranchbench import langgraph ./trace.json --output traces/langgraph.json
deadbranchbench label traces/imported.json --interactiveImported traces default to unknown contribution labels. Do not use them for DBR
claims until they are labeled.
For real LangGraph runs, attach the callback handler and import the JSONL it writes:
from deadbranchbench import LangGraphCallbackHandler
callback = LangGraphCallbackHandler(
"runs/langgraph/raw/run_001.jsonl",
trace_run_id="run_001",
)
result = graph.invoke(
{"input": "fix the failing test"},
config={"callbacks": [callback]},
)Then convert the raw callback events into a reviewable trace:
deadbranchbench import langgraph runs/langgraph/raw/run_001.jsonl --output traces/langgraph/run_001.json --run-id run_001
deadbranchbench label traces/langgraph/run_001.json --interactive
deadbranchbench compute traces/langgraph/run_001.jsonThe callback records execution telemetry only. Contribution labels still come
from review. By default it avoids storing full callback payloads; pass
include_payload=True only when the run data is safe to retain.
To make the dead-vs-support rubric objective, record data-flow provenance with
stable artifact ids. A producer branch emits produces or writes; a consumer
branch emits consumes or reads. During import, matching artifacts populate
the producer branch's used_by_branches.
callback.record_provenance(
"diagnose-node-run-id",
produces=["diagnostic:fixture-mismatch"],
)
callback.record_provenance(
"fix-node-run-id",
consumes=["diagnostic:fixture-mismatch"],
)
callback.record_provenance("patch-node-run-id", writes=["app.py"])
callback.record_provenance("test-node-run-id", reads=["app.py"])Callback metadata can also carry provenance:
graph.invoke(
inputs,
config={
"callbacks": [callback],
"metadata": {"produces": ["doc:auth-api-v2"]},
},
)Raw LangGraph ordering is only control-flow. It is not counted as downstream use unless explicit provenance links a consumed artifact to an earlier produced artifact.
The next evidence gate is not a new feature. It is a dataset:
100 reviewed LangGraph runs
average DBR
median DBR
average TTD
support/failure ratio
success/dead ratio
top dead-work patterns
Protocol: tasks/langgraph_100_run_protocol.md.
The seed fixtures cover:
examples/minimal_trace.jsonfor debuggingexamples/research_trace.jsonexamples/refactor_trace.jsonexamples/api_integration_trace.jsonexamples/migration_trace.json
Generated HTML reports are in reports/.
Summarize multiple traces:
deadbranchbench summarize traces/*.json --output reports/proof_run_summary.json
deadbranchbench summarize traces/*.json --html --output reports/proof_run_summary.htmlThe summary report includes:
- average DBR
- median DBR
- portfolio DBR
- total dead branch cost
- top 10 waste patterns
- top 10 dead branches
- DBR by task category
The first proof bundle is a set of 10 manually reconstructed Codex build-session traces:
traces/codex_01_scaffold_inspection.json
...
traces/codex_10_summary_command.json
Important limitation:
These are real Codex session reconstructions with approximate costs.
They are not provider-native telemetry exports.
Generated outputs:
reports/codex_*.html
reports/proof_run_summary.json
reports/proof_run_summary.html
Headline from the first proof bundle:
runs: 10
average_dbr: 17.72%
median_dbr: 17.21%
portfolio_dbr: 17.44%
total_dead_branch_cost: 19,540.0
total_cost: 112,070.0
Do not add pruning, scoring, or optimization until baseline DBR is measured.