Skip to content

Wisoba/deadbranchbench

Repository files navigation

DeadBranchBench

DeadBranchBench is a benchmark for measuring wasted agent work.

DeadBranchBench captures agent work as events, derives trace skeletons, and measures where agent compute dies after review.

DeadBranchBench is observability for wasted agent work.

It does not assume a pruning strategy. It does not assume a theory of agent planning. It asks one business question:

Is expensive dead-branch materialization a large enough bill?

Week 1 Scope

The first asset is the event schema plus a boring CLI.

Every observed run records objective events first. Branches, costs, and DBR are derived later after review.

No pruning. No optimizer. No agent intelligence yet.

Repository Shape

deadbranchbench/
├── src/deadbranchbench/
├── tasks/
├── traces/
├── metrics/
├── reports/
├── runners/
└── examples/

Contribution Labels

Use these labels carefully:

  • live: directly contributed to the final solution.
  • support: failed or ended, but produced information later used by a live branch.
  • deferred: not used in this run, but intentionally preserved for plausible future use.
  • dead: consumed cost and produced no measurable contribution.
  • unknown: not yet labeled.

The enemy is not every dead branch. Cheap exploration is fine. The enemy is expensive dead branches.

Primary Metrics

Dead Branch Cost (DBC) = sum(cost(branch)) for branches labeled dead
Dead Branch Ratio (DBR) = DBC / total branch cost
Branch ROI = contribution_value / branch cost
Time To Death (TTD) = ended_at - started_at for dead/abandoned/failed/rolled_back branches
Support/Failure Ratio = support branches with failed execution / all failed branches
Success/Dead Ratio = dead branches with succeeded execution / all succeeded branches
Failed-Task Spend = total cost of runs that failed an external task evaluator / total cost
Outcome Waste Floor = classic dead cost, but total run cost when an external evaluator fails

The default cost model is intentionally simple and trace-local. Each trace can override the weights:

{
  "token": 1.0,
  "tool_call": 1000.0,
  "retry": 500.0,
  "rollback": 2000.0,
  "edit": 100.0,
  "test_run": 500.0,
  "wall_time_second": 0.0
}

Install

From this folder:

python3 -m pip install -e .

CLI Quickstart

deadbranchbench validate examples/minimal_trace.json
deadbranchbench compute examples/minimal_trace.json
deadbranchbench report examples/minimal_trace.json --html --output reports/minimal_trace.html

For the public benchmark release and the 10-minute external run path, see:

Reproduce the current validated cohort bracket:

python3 runners/report_waste_bracket.py --bundle validated_debug_bundle_20260619

Current bracket:

1.71% provenance floor <= human-reviewed waste pending <= 31.81% failed-task ceiling

The HTML report shows:

  • DBR
  • DBC
  • branch table
  • cost breakdown
  • branch tree
  • top waste branches

Event Capture

Capture a command as objective JSONL telemetry:

deadbranchbench observe -- python3 -c "print('hello')"

By default this writes:

runs/<run_id>/run.jsonl

The event stream captures:

  • command_start
  • command_end
  • command_failed
  • file_snapshot_before
  • file_snapshot_after
  • file_changed
  • stdout_summary
  • stderr_summary

Events never claim dead, live, support, or deferred.

Privacy defaults:

  • ignored directories include .git, node_modules, .venv, venv, runs, caches, and build outputs
  • ignored files include .env, .npmrc, .pypirc, SSH keys, and common credential filenames
  • ignored paths include names containing api_key, access_token, password, private_key, secret, or credential
  • ignored suffixes include common key/certificate, database, archive, image, audio, video, and PDF formats
  • files larger than 1 MiB are skipped before hashing

Build Trace Skeletons

Convert objective events into an unknown-labeled trace:

deadbranchbench build-trace runs/<run_id>/run.jsonl --output traces/<run_id>.json
deadbranchbench label traces/<run_id>.json --interactive
deadbranchbench compute traces/<run_id>.json

build-trace keeps contribution_status as unknown. Labels create DBR; events do not.

External Task Outcomes

Traces can include an optional top-level evaluation object:

{
  "success": false,
  "evaluator": "external-tests-v1",
  "artifact_path": "runs/artifacts/final.py",
  "detail": "hidden assertion failed"
}

When present, evaluation.success is treated as task-level ground truth. The classic DBR fields still mean exactly what they meant before: only branches reviewed or labeled dead count as dead-branch cost. Outcome fields sit beside DBR so failed-but-executed work is visible:

  • failed_task_cost: whole-run cost when the external evaluator fails.
  • failed_task_undetected_cost: failed-task cost not already counted as dead branch cost.
  • outcome_waste_floor_cost: dead branch cost for passing/unevaluated runs, whole-run cost for externally failed runs.

This is the benchmark hook for confidently wrong agent work: a run can have dead_branch_ratio == 0 and still have failed_task_ratio == 1 if it passed its own checks but failed external ground truth.

Observe a multi-step command script:

deadbranchbench observe-script demo_debug_session/commands.json --run-id observed-debug-session-001 --output runs/observed-debug-session-001/run.jsonl --cwd demo_debug_session

Use output redaction when terminal streams may contain sensitive data:

deadbranchbench observe --redact-stdout --redact-stderr -- COMMAND...
deadbranchbench observe-script commands.json --redact-stdout --redact-stderr

Observed Failure Proof

This repository includes a small observed run where a command writes a failing test and exits nonzero:

deadbranchbench observe --run-id observed-real-task-001 --output runs/observed-real-task-001/run.jsonl --cwd demo_agent_task -- python3 create_failing_test.py
deadbranchbench build-trace runs/observed-real-task-001/run.jsonl --output traces/observed_real_task_unlabeled.json
deadbranchbench report traces/observed_real_task_unlabeled.json --html --output reports/observed_real_task_unlabeled.html

Before review:

contribution_status: unknown
unknown_ratio: 100%
DBR: 0%
TTD: present because execution_status is failed

After review, the labeled copy at traces/observed_real_task_labeled.json marks the branch dead, producing DBR and top-waste output.

Observed Debugging Proof

The stronger proof is a five-command debugging session:

deadbranchbench observe-script demo_debug_session/commands.json --run-id observed-debug-session-001 --output runs/observed-debug-session-001/run.jsonl --cwd demo_debug_session --redact-stdout
deadbranchbench build-trace runs/observed-debug-session-001/run.jsonl --output traces/observed_debug_session_unlabeled.json
deadbranchbench report traces/observed_debug_session_labeled.json --html --output reports/observed_debug_session_labeled.html

It demonstrates:

  • failed test output labeled support
  • a bad source edit labeled dead
  • a final source edit and passing test labeled live
  • DBR only after review

The core thesis in five rows:

Branch Execution Result Contribution
b1 Failed Support
b2 Succeeded Dead
b3 Failed Support
b4 Succeeded Live
b5 Succeeded Live

Agent execution status and contribution status are different dimensions.

Reviewed labels can include:

  • reviewer
  • reviewed_at
  • confidence
  • notes

Trace Recorder

Manual wrapper API:

from deadbranchbench import Trace

trace = Trace(
    run_id="task-001",
    task_id="fix-failing-test",
    task_title="Fix failing test",
    task_category="debugging",
)

with trace.branch("Inspect failing test output") as branch:
    branch.add_tokens(input_tokens=800, output_tokens=250)
    branch.add_tool_call(2)
    branch.add_test("tests/test_auth.py::test_login")
    branch.mark_success(support=True, evidence="Found fixture mismatch")

trace.completed = True
trace.write("traces/task-001.json")

Import And Label

deadbranchbench import codex-log ./run.log --output traces/imported.json
deadbranchbench import langgraph ./trace.json --output traces/langgraph.json
deadbranchbench label traces/imported.json --interactive

Imported traces default to unknown contribution labels. Do not use them for DBR claims until they are labeled.

Native LangGraph Telemetry

For real LangGraph runs, attach the callback handler and import the JSONL it writes:

from deadbranchbench import LangGraphCallbackHandler

callback = LangGraphCallbackHandler(
    "runs/langgraph/raw/run_001.jsonl",
    trace_run_id="run_001",
)

result = graph.invoke(
    {"input": "fix the failing test"},
    config={"callbacks": [callback]},
)

Then convert the raw callback events into a reviewable trace:

deadbranchbench import langgraph runs/langgraph/raw/run_001.jsonl --output traces/langgraph/run_001.json --run-id run_001
deadbranchbench label traces/langgraph/run_001.json --interactive
deadbranchbench compute traces/langgraph/run_001.json

The callback records execution telemetry only. Contribution labels still come from review. By default it avoids storing full callback payloads; pass include_payload=True only when the run data is safe to retain.

To make the dead-vs-support rubric objective, record data-flow provenance with stable artifact ids. A producer branch emits produces or writes; a consumer branch emits consumes or reads. During import, matching artifacts populate the producer branch's used_by_branches.

callback.record_provenance(
    "diagnose-node-run-id",
    produces=["diagnostic:fixture-mismatch"],
)

callback.record_provenance(
    "fix-node-run-id",
    consumes=["diagnostic:fixture-mismatch"],
)

callback.record_provenance("patch-node-run-id", writes=["app.py"])
callback.record_provenance("test-node-run-id", reads=["app.py"])

Callback metadata can also carry provenance:

graph.invoke(
    inputs,
    config={
        "callbacks": [callback],
        "metadata": {"produces": ["doc:auth-api-v2"]},
    },
)

Raw LangGraph ordering is only control-flow. It is not counted as downstream use unless explicit provenance links a consumed artifact to an earlier produced artifact.

LangGraph 100-Run Gate

The next evidence gate is not a new feature. It is a dataset:

100 reviewed LangGraph runs
average DBR
median DBR
average TTD
support/failure ratio
success/dead ratio
top dead-work patterns

Protocol: tasks/langgraph_100_run_protocol.md.

Synthetic Fixtures

The seed fixtures cover:

  • examples/minimal_trace.json for debugging
  • examples/research_trace.json
  • examples/refactor_trace.json
  • examples/api_integration_trace.json
  • examples/migration_trace.json

Generated HTML reports are in reports/.

Cross-Run Summary

Summarize multiple traces:

deadbranchbench summarize traces/*.json --output reports/proof_run_summary.json
deadbranchbench summarize traces/*.json --html --output reports/proof_run_summary.html

The summary report includes:

  • average DBR
  • median DBR
  • portfolio DBR
  • total dead branch cost
  • top 10 waste patterns
  • top 10 dead branches
  • DBR by task category

First Proof Bundle

The first proof bundle is a set of 10 manually reconstructed Codex build-session traces:

traces/codex_01_scaffold_inspection.json
...
traces/codex_10_summary_command.json

Important limitation:

These are real Codex session reconstructions with approximate costs.
They are not provider-native telemetry exports.

Generated outputs:

reports/codex_*.html
reports/proof_run_summary.json
reports/proof_run_summary.html

Headline from the first proof bundle:

runs: 10
average_dbr: 17.72%
median_dbr: 17.21%
portfolio_dbr: 17.44%
total_dead_branch_cost: 19,540.0
total_cost: 112,070.0

Rule

Do not add pruning, scoring, or optimization until baseline DBR is measured.

About

Benchmark for measuring wasted AI-agent work — dead-end and confidently-wrong work normal observability misses.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors