Skip to content

fix(bench): EOPS depth scored on best checkpoint, not final state (autopsy reverses the result)#202

Merged
drewstone merged 1 commit into
mainfrom
fix/eops-depth-best-scoring
Jun 8, 2026
Merged

fix(bench): EOPS depth scored on best checkpoint, not final state (autopsy reverses the result)#202
drewstone merged 1 commit into
mainfrom
fix/eops-depth-best-scoring

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Autopsy finding

/autopsy of the "depth loses to breadth on EOPS" result (.evolve/autopsies/2026-06-08-eops-depth-breadth.md). Root cause: design-flaw in my own harness. The comparison was asymmetric:

  • breadth = best-of-K (max over K independent shots, verifier-selected)
  • depth = final state after K sequential shots

So depth alone paid for late-shot degradation — a steer that makes the model re-touch the DB and undo correct work. Ground-truth artifacts showed the signature: depth ending 0/N on tasks breadth solved (2/2→0, 5/7→0).

Fix

Score the DB state after every depth shot; report depth-BEST (max checkpoint, symmetric with breadth's best-of-K) alongside depth-FINAL. Checkpointing is deployable — snapshot the artifact, keep the best-verifying state (exactly what breadth does across its K).

The result reverses (S0 generic, n=24, gpt-4.1, paired bootstrap)

metric depth − breadth
depth-FINAL (the old, biased number) −0.1pp n.s. (the prior −9.9pp was noise + degradation)
depth-BEST − breadth, score +6.0pp CI [−0.4, +13.1]
depth-BEST − breadth, resolved +12.5pp CI [0.0, +25.0]
degradation = best − final +6.2pp (steering reached better states, then undid them)

Within-run steering does NOT lose on EOPS. Depth beats breadth even with a generic steer, once scored fairly. The earlier "steering loses everywhere" conclusion was a scoring artifact and should not drive strategy.

The HumanEval gates used best-of-K on both arms (no asymmetry), so those results stand — the boundary is intact: breadth wins on stateless codegen, depth(+checkpoint) wins on stateful agentic ops. Two independent engineering takeaways: (1) always checkpoint the artifact and keep the best-verifying state — final-state scoring undersells steering; (2) steering can actively degrade, so the keep-best policy is load-bearing, not cosmetic.

Test

typecheck clean; instrumented n=24 ran 0-excluded against the live gym; depth-best/final/trajectory all reported.

…topsy)

Autopsy of the "depth loses to breadth" result (.evolve/autopsies/2026-06-08):
the comparison was asymmetric. breadth = best-of-K (max over K independent shots,
verifier-selected); depth = FINAL state after K shots. So depth alone paid for
late-shot degradation — a steer that makes the model re-touch the DB and undo
correct work. Artifacts showed the signature: depth ending 0/N on tasks breadth
solved (2/2->0, 5/7->0).

Fix: score the DB state after EVERY depth shot; report depth-BEST (max checkpoint,
symmetric with breadth's best-of-K) alongside depth-FINAL. Checkpointing is
deployable (snapshot the artifact, keep the best-verifying state).

Re-run (S0 generic, n=24, gpt-4.1): the -9.9pp REVERSES.
  depth-FINAL - breadth   -0.1pp  n.s.   (the -9.9pp was noise + degradation)
  depth-BEST  - breadth   +6.0pp  CI [-0.4,+13.1]  score
  depth-BEST  - breadth  +12.5pp  CI [ 0.0,+25.0]  resolved
  degradation = best - final = +6.2pp (steering reached better states, then undid them)

So within-run steering does NOT lose on EOPS — depth beats breadth even with a
GENERIC steer, once scored fairly. The HumanEval gates used best-of-K on BOTH arms
(no asymmetry) so those results are unaffected. depth-best is now the headline metric.
@drewstone drewstone merged commit b81d6ef into main Jun 8, 2026
1 check passed
@drewstone drewstone deleted the fix/eops-depth-best-scoring branch June 8, 2026 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant