Tracking the blast radius benchmark that will back the next deep-dive article.
Status (2026-04-23)
Track A — scream test (graph vs reality)
Done. Standalone Python harness that mutates a function, runs Django's test suite, compares what breaks against /v1/analysis/impact predictions.
- Scope:
django/contrib/auth/ — 18 measured targets
- Macro F1: 56.9% (precision 44.2%, recall 94.9%)
- Micro F1: 59.9% (precision 44.2%, recall 93.0%) — TP 159, FP 201, FN 12
- Strongest performers:
hashers.make_password / hashers.get_hasher — 100% recall, 85% precision, 92% F1
- Known gap: 12 FN from
login_required / user_passes_test decorators — parser appears to miss test methods decorated at call time
Artifacts in benchmark/scream/:
scream_test.py — AST-based mutation loop
api_predict.py — batched call to /v1/analysis/impact
compare.py — precision/recall/F1 (scoped + test-case-only filtering)
summarize.py — metrics → markdown
results/metrics_fair.json + results/README.md
Track B — agent comparison (blast radius context → refactor outcome)
Pilot complete, scope problem identified.
Setup: Two Docker containers (bench-br-naked, bench-br-supermodel) built on Django 5.0.6 + Claude Code + Opus 4.7 (pinned via `--model claude-opus-4-7`). Task: make `authenticate(request=None, **credentials)` require `request`, keep tests green.
Pilot run #1 (test scope = `auth_tests` only):
| condition |
verdict |
turns |
tools |
cost |
duration |
files |
| naked |
PASS |
58 |
41 |
$1.56 |
205s |
7 |
| supermodel |
PASS |
59 |
43 |
$1.77 |
203s |
7 |
Effectively tied. Opus 4.7 is capable enough to grep-and-iterate within a single subsystem. The BLAST_RADIUS.md context added cost (bigger input) without changing outcome.
Pilot run #2 in progress (test scope expanded to 13 Django subsystems, 3,293 tests — auth_tests, admin_views, admin_changelist, admin_utils, admin_inlines, sessions_tests, middleware_exceptions, generic_views, forms_tests, contenttypes_tests, handlers, view_tests, test_client_regress). Hypothesis: naked agent will now miss callers outside auth/.
Artifacts in `benchmark/`:
- `Dockerfile.br-naked`, `Dockerfile.br-supermodel`
- `entrypoint.br.sh`
- `CLAUDE.br-naked.md`, `CLAUDE.br-supermodel.md`
- `br_task.md` — refactor spec
- `BLAST_RADIUS.md` — pre-computed from Track A's `predicted.json`
- `run-br.sh` — orchestrator
- `compare-br.sh` — results table
- `results/br/`
Known issues to resolve
Article plan
Deep-dive: Your agent thinks it's editing 3 files. It's editing 145. — receipt-worthy "things that will break" number, Jenga/ripple analogy, push `/v1/analysis/impact` + `supermodel blast-radius`.
Standalone pullout (4th-grade brainrot): "Before you move one block, check who's standing on it."
Tracking the blast radius benchmark that will back the next deep-dive article.
Status (2026-04-23)
Track A — scream test (graph vs reality)
Done. Standalone Python harness that mutates a function, runs Django's test suite, compares what breaks against
/v1/analysis/impactpredictions.django/contrib/auth/— 18 measured targetshashers.make_password/hashers.get_hasher— 100% recall, 85% precision, 92% F1login_required/user_passes_testdecorators — parser appears to miss test methods decorated at call timeArtifacts in
benchmark/scream/:scream_test.py— AST-based mutation loopapi_predict.py— batched call to/v1/analysis/impactcompare.py— precision/recall/F1 (scoped + test-case-only filtering)summarize.py— metrics → markdownresults/metrics_fair.json+results/README.mdTrack B — agent comparison (blast radius context → refactor outcome)
Pilot complete, scope problem identified.
Setup: Two Docker containers (
bench-br-naked,bench-br-supermodel) built on Django 5.0.6 + Claude Code + Opus 4.7 (pinned via `--model claude-opus-4-7`). Task: make `authenticate(request=None, **credentials)` require `request`, keep tests green.Pilot run #1 (test scope = `auth_tests` only):
Effectively tied. Opus 4.7 is capable enough to grep-and-iterate within a single subsystem. The BLAST_RADIUS.md context added cost (bigger input) without changing outcome.
Pilot run #2 in progress (test scope expanded to 13 Django subsystems, 3,293 tests — auth_tests, admin_views, admin_changelist, admin_utils, admin_inlines, sessions_tests, middleware_exceptions, generic_views, forms_tests, contenttypes_tests, handlers, view_tests, test_client_regress). Hypothesis: naked agent will now miss callers outside auth/.
Artifacts in `benchmark/`:
Known issues to resolve
Article plan
Deep-dive: Your agent thinks it's editing 3 files. It's editing 145. — receipt-worthy "things that will break" number, Jenga/ripple analogy, push `/v1/analysis/impact` + `supermodel blast-radius`.
Standalone pullout (4th-grade brainrot): "Before you move one block, check who's standing on it."