DeadBranchBench

DeadBranchBench is a benchmark for measuring wasted agent work.

DeadBranchBench captures agent work as events, derives trace skeletons, and measures where agent compute dies after review.

DeadBranchBench is observability for wasted agent work.

It does not assume a pruning strategy. It does not assume a theory of agent planning. It asks one business question:

Is expensive dead-branch materialization a large enough bill?

Week 1 Scope

The first asset is the event schema plus a boring CLI.

Every observed run records objective events first. Branches, costs, and DBR are derived later after review.

No pruning. No optimizer. No agent intelligence yet.

Repository Shape

deadbranchbench/
├── src/deadbranchbench/
├── tasks/
├── traces/
├── metrics/
├── reports/
├── runners/
└── examples/

Contribution Labels

Use these labels carefully:

live: directly contributed to the final solution.
support: failed or ended, but produced information later used by a live branch.
deferred: not used in this run, but intentionally preserved for plausible future use.
dead: consumed cost and produced no measurable contribution.
unknown: not yet labeled.

The enemy is not every dead branch. Cheap exploration is fine. The enemy is expensive dead branches.

Primary Metrics

Dead Branch Cost (DBC) = sum(cost(branch)) for branches labeled dead
Dead Branch Ratio (DBR) = DBC / total branch cost
Branch ROI = contribution_value / branch cost
Time To Death (TTD) = ended_at - started_at for dead/abandoned/failed/rolled_back branches
Support/Failure Ratio = support branches with failed execution / all failed branches
Success/Dead Ratio = dead branches with succeeded execution / all succeeded branches
Failed-Task Spend = total cost of runs that failed an external task evaluator / total cost
Outcome Waste Floor = classic dead cost, but total run cost when an external evaluator fails

The default cost model is intentionally simple and trace-local. Each trace can override the weights:

{
  "token": 1.0,
  "tool_call": 1000.0,
  "retry": 500.0,
  "rollback": 2000.0,
  "edit": 100.0,
  "test_run": 500.0,
  "wall_time_second": 0.0
}

Install

From this folder:

python3 -m pip install -e .

CLI Quickstart

deadbranchbench validate examples/minimal_trace.json
deadbranchbench compute examples/minimal_trace.json
deadbranchbench report examples/minimal_trace.json --html --output reports/minimal_trace.html

For the public benchmark release and the 10-minute external run path, see:

Reproduce the current validated cohort bracket:

python3 runners/report_waste_bracket.py --bundle validated_debug_bundle_20260619

Current bracket:

1.71% provenance floor <= human-reviewed waste pending <= 31.81% failed-task ceiling

The HTML report shows:

DBR
DBC
branch table
cost breakdown
branch tree
top waste branches

Event Capture

Capture a command as objective JSONL telemetry:

deadbranchbench observe -- python3 -c "print('hello')"

By default this writes:

runs/<run_id>/run.jsonl

The event stream captures:

command_start
command_end
command_failed
file_snapshot_before
file_snapshot_after
file_changed
stdout_summary
stderr_summary

Events never claim dead, live, support, or deferred.

Privacy defaults:

ignored directories include .git, node_modules, .venv, venv, runs, caches, and build outputs
ignored files include .env, .npmrc, .pypirc, SSH keys, and common credential filenames
ignored paths include names containing api_key, access_token, password, private_key, secret, or credential
ignored suffixes include common key/certificate, database, archive, image, audio, video, and PDF formats
files larger than 1 MiB are skipped before hashing

Build Trace Skeletons

Convert objective events into an unknown-labeled trace:

deadbranchbench build-trace runs/<run_id>/run.jsonl --output traces/<run_id>.json
deadbranchbench label traces/<run_id>.json --interactive
deadbranchbench compute traces/<run_id>.json

build-trace keeps contribution_status as unknown. Labels create DBR; events do not.

External Task Outcomes

Traces can include an optional top-level evaluation object:

{
  "success": false,
  "evaluator": "external-tests-v1",
  "artifact_path": "runs/artifacts/final.py",
  "detail": "hidden assertion failed"
}

When present, evaluation.success is treated as task-level ground truth. The classic DBR fields still mean exactly what they meant before: only branches reviewed or labeled dead count as dead-branch cost. Outcome fields sit beside DBR so failed-but-executed work is visible:

failed_task_cost: whole-run cost when the external evaluator fails.
failed_task_undetected_cost: failed-task cost not already counted as dead branch cost.
outcome_waste_floor_cost: dead branch cost for passing/unevaluated runs, whole-run cost for externally failed runs.

This is the benchmark hook for confidently wrong agent work: a run can have dead_branch_ratio == 0 and still have failed_task_ratio == 1 if it passed its own checks but failed external ground truth.

Observe a multi-step command script:

deadbranchbench observe-script demo_debug_session/commands.json --run-id observed-debug-session-001 --output runs/observed-debug-session-001/run.jsonl --cwd demo_debug_session

Use output redaction when terminal streams may contain sensitive data:

deadbranchbench observe --redact-stdout --redact-stderr -- COMMAND...
deadbranchbench observe-script commands.json --redact-stdout --redact-stderr

Observed Failure Proof

This repository includes a small observed run where a command writes a failing test and exits nonzero:

deadbranchbench observe --run-id observed-real-task-001 --output runs/observed-real-task-001/run.jsonl --cwd demo_agent_task -- python3 create_failing_test.py
deadbranchbench build-trace runs/observed-real-task-001/run.jsonl --output traces/observed_real_task_unlabeled.json
deadbranchbench report traces/observed_real_task_unlabeled.json --html --output reports/observed_real_task_unlabeled.html

Before review:

contribution_status: unknown
unknown_ratio: 100%
DBR: 0%
TTD: present because execution_status is failed

After review, the labeled copy at traces/observed_real_task_labeled.json marks the branch dead, producing DBR and top-waste output.

Observed Debugging Proof

The stronger proof is a five-command debugging session:

deadbranchbench observe-script demo_debug_session/commands.json --run-id observed-debug-session-001 --output runs/observed-debug-session-001/run.jsonl --cwd demo_debug_session --redact-stdout
deadbranchbench build-trace runs/observed-debug-session-001/run.jsonl --output traces/observed_debug_session_unlabeled.json
deadbranchbench report traces/observed_debug_session_labeled.json --html --output reports/observed_debug_session_labeled.html

It demonstrates:

failed test output labeled support
a bad source edit labeled dead
a final source edit and passing test labeled live
DBR only after review

The core thesis in five rows:

Branch	Execution Result	Contribution
`b1`	Failed	Support
`b2`	Succeeded	Dead
`b3`	Failed	Support
`b4`	Succeeded	Live
`b5`	Succeeded	Live

Agent execution status and contribution status are different dimensions.

Reviewed labels can include:

reviewer
reviewed_at
confidence
notes

Trace Recorder

Manual wrapper API:

from deadbranchbench import Trace

trace = Trace(
    run_id="task-001",
    task_id="fix-failing-test",
    task_title="Fix failing test",
    task_category="debugging",
)

with trace.branch("Inspect failing test output") as branch:
    branch.add_tokens(input_tokens=800, output_tokens=250)
    branch.add_tool_call(2)
    branch.add_test("tests/test_auth.py::test_login")
    branch.mark_success(support=True, evidence="Found fixture mismatch")

trace.completed = True
trace.write("traces/task-001.json")

Import And Label

deadbranchbench import codex-log ./run.log --output traces/imported.json
deadbranchbench import langgraph ./trace.json --output traces/langgraph.json
deadbranchbench label traces/imported.json --interactive

Imported traces default to unknown contribution labels. Do not use them for DBR claims until they are labeled.

Native LangGraph Telemetry

For real LangGraph runs, attach the callback handler and import the JSONL it writes:

from deadbranchbench import LangGraphCallbackHandler

callback = LangGraphCallbackHandler(
    "runs/langgraph/raw/run_001.jsonl",
    trace_run_id="run_001",
)

result = graph.invoke(
    {"input": "fix the failing test"},
    config={"callbacks": [callback]},
)

Then convert the raw callback events into a reviewable trace:

deadbranchbench import langgraph runs/langgraph/raw/run_001.jsonl --output traces/langgraph/run_001.json --run-id run_001
deadbranchbench label traces/langgraph/run_001.json --interactive
deadbranchbench compute traces/langgraph/run_001.json

The callback records execution telemetry only. Contribution labels still come from review. By default it avoids storing full callback payloads; pass include_payload=True only when the run data is safe to retain.

To make the dead-vs-support rubric objective, record data-flow provenance with stable artifact ids. A producer branch emits produces or writes; a consumer branch emits consumes or reads. During import, matching artifacts populate the producer branch's used_by_branches.

callback.record_provenance(
    "diagnose-node-run-id",
    produces=["diagnostic:fixture-mismatch"],
)

callback.record_provenance(
    "fix-node-run-id",
    consumes=["diagnostic:fixture-mismatch"],
)

callback.record_provenance("patch-node-run-id", writes=["app.py"])
callback.record_provenance("test-node-run-id", reads=["app.py"])

Callback metadata can also carry provenance:

graph.invoke(
    inputs,
    config={
        "callbacks": [callback],
        "metadata": {"produces": ["doc:auth-api-v2"]},
    },
)

Raw LangGraph ordering is only control-flow. It is not counted as downstream use unless explicit provenance links a consumed artifact to an earlier produced artifact.

LangGraph 100-Run Gate

The next evidence gate is not a new feature. It is a dataset:

100 reviewed LangGraph runs
average DBR
median DBR
average TTD
support/failure ratio
success/dead ratio
top dead-work patterns

Protocol: tasks/langgraph_100_run_protocol.md.

Synthetic Fixtures

The seed fixtures cover:

examples/minimal_trace.json for debugging
examples/research_trace.json
examples/refactor_trace.json
examples/api_integration_trace.json
examples/migration_trace.json

Generated HTML reports are in reports/.

Cross-Run Summary

Summarize multiple traces:

deadbranchbench summarize traces/*.json --output reports/proof_run_summary.json
deadbranchbench summarize traces/*.json --html --output reports/proof_run_summary.html

The summary report includes:

average DBR
median DBR
portfolio DBR
total dead branch cost
top 10 waste patterns
top 10 dead branches
DBR by task category

First Proof Bundle

The first proof bundle is a set of 10 manually reconstructed Codex build-session traces:

traces/codex_01_scaffold_inspection.json
...
traces/codex_10_summary_command.json

Important limitation:

These are real Codex session reconstructions with approximate costs.
They are not provider-native telemetry exports.

Generated outputs:

reports/codex_*.html
reports/proof_run_summary.json
reports/proof_run_summary.html

Headline from the first proof bundle:

runs: 10
average_dbr: 17.72%
median_dbr: 17.21%
portfolio_dbr: 17.44%
total_dead_branch_cost: 19,540.0
total_cost: 112,070.0

Rule

Do not add pruning, scoring, or optimization until baseline DBR is measured.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agent		agent
demo_agent_task		demo_agent_task
demo_debug_session		demo_debug_session
docs		docs
events/schema		events/schema
examples		examples
metrics		metrics
runners		runners
site		site
src/deadbranchbench		src/deadbranchbench
tasks		tasks
tests		tests
validated_debug_bundle_20260619		validated_debug_bundle_20260619
.gitignore		.gitignore
PUBLIC_BENCHMARK_RELEASE.md		PUBLIC_BENCHMARK_RELEASE.md
README.md		README.md
RUN_ON_YOUR_AGENT_10_MIN.md		RUN_ON_YOUR_AGENT_10_MIN.md
pyproject.toml		pyproject.toml
test_app.py		test_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeadBranchBench

Week 1 Scope

Repository Shape

Contribution Labels

Primary Metrics

Install

CLI Quickstart

Event Capture

Build Trace Skeletons

External Task Outcomes

Observed Failure Proof

Observed Debugging Proof

Trace Recorder

Import And Label

Native LangGraph Telemetry

LangGraph 100-Run Gate

Synthetic Fixtures

Cross-Run Summary

First Proof Bundle

Rule

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeadBranchBench

Week 1 Scope

Repository Shape

Contribution Labels

Primary Metrics

Install

CLI Quickstart

Event Capture

Build Trace Skeletons

External Task Outcomes

Observed Failure Proof

Observed Debugging Proof

Trace Recorder

Import And Label

Native LangGraph Telemetry

LangGraph 100-Run Gate

Synthetic Fixtures

Cross-Run Summary

First Proof Bundle

Rule

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages