Skip to content

feat(bench): EOPS analyst-steered depth (S1) — de-confound steer quality#201

Merged
drewstone merged 1 commit into
mainfrom
feat/eops-analyst-steer
Jun 8, 2026
Merged

feat(bench): EOPS analyst-steered depth (S1) — de-confound steer quality#201
drewstone merged 1 commit into
mainfrom
feat/eops-analyst-steer

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Isolates steer quality as the variable in the EOPS depth-vs-breadth gate. The prior depth-win (+13.4pp) used a trace-analyst on a different harness, so steer type, harness, and n were all confounded vs the n=24 generic-steer result (−9.9pp). This varies only the steerer — same eops-gate harness, same domain, same n=24, same breadth control.

  • routerToolLoop now returns toolTrace (each tool call + result) — what a trace-analyst reads (behavior, never the verdict).
  • STEER=analyst wires an inline agent-eval-style trace-analyst as the depth steer: reads the agent's tool-call trace, diagnoses the remaining gap, issues one concrete corrective instruction. Firewalled — never sees the verifiers/expected values. STEER=generic keeps the fixed nudge.

Result (n=24, gpt-4.1, paired bootstrap) — the clean de-confound

steerer depth − breadth (score) verdict
generic (S0) −9.9pp SIGNIF negative
agent-eval analyst (S1) −7.2pp n.s. (CI [−19.7, +4.6], disc 9)

A real inline analyst moves depth in the right direction (−9.9 → −7.2, out of significance) but still does not beat breadth. So a good steerer helps directionally, but is not enough to flip depth>breadth on EOPS in this off-box harness — the prior +13.4pp does not replicate under clean paired-bootstrap, even with an analyst.

What's next (the matrix, mapped to the repo's "analyst = 3 runtimes" F3)

The richest remaining steerer is HALO (the recursive/parallel fanout trace-analysis engine, on PATH) — the last shot at flipping it. It connects via the router; feeding it needs OpenInference/OTLP-shaped span traces (build pending). S3 = HALO + agent-eval combined.

Test

typecheck clean; STEER=analyst ran n=24 0-excluded against the live gym; analyst steer confirmed firing.

The n=24 EOPS gate showed GENERIC-steered depth losing to breadth (-9.9pp). But
the prior depth-WIN (+13.4pp) used a trace-analyst on a different harness — so steer
type, harness, and n were all confounded. This isolates the steerer: same eops-gate
harness, same domain, same n, same breadth control — only STEER varies.

- routerToolLoop now returns toolTrace (each tool call + result) — what a trace
  analyst reads (behavior, never the verdict).
- STEER=analyst wires an inline agent-eval-style trace-analyst as the depth steer:
  it reads the agent's tool-call trace, diagnoses the remaining gap, and issues one
  concrete corrective instruction. FIREWALLED — never sees the verifiers/expected
  values. STEER=generic keeps the fixed nudge (the -9.9pp control).

The decisive comparison: depth@analyst vs breadth, vs depth@generic vs breadth. If
the analyst flips depth from significantly-losing to winning where the generic steer
lost — same harness/n/domain — steer QUALITY is the operative variable, not depth.
Maps to the repo's "analyst = 3 runtimes" F3 (inline / Halo-cli / sandboxed-fanout).
@drewstone drewstone merged commit 6d90502 into main Jun 8, 2026
1 check passed
@drewstone drewstone deleted the feat/eops-analyst-steer branch June 8, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant