feat(driver): error-grounded reflection — gepaDriver targets real failures (0.70.0) by tangletools · Pull Request #146 · tangle-network/agent-eval

tangletools · 2026-05-31T01:40:46Z

The conjunct-2 fix. Adversarial verification on legal + tax (two worker models) showed the gepaDriver's candidates regressed — it reflected on per-scenario scores only; the judge's notes (the 'why') were dropped before reflection, so it proposed generic rewrites that hurt a capable model.

Threads judge notes through generically: campaignBreakdown → scenarios[].notes → buildEvidence → TrialTrace.failureNote → a 'Why it scored low' block in the reflection prompt.

Anti-overfit (Drew's guardrail): notes are generalizable failure patterns by contract, never case ground-truth (that's memorization); the held-out gate is the structural backstop. Generic — any agent benefits. 3 tests; full suite 1645 green.

…r (0.70.0) Adversarial verification on TWO domains (legal + tax, two worker models): the gepaDriver's candidates REGRESSED the baseline (gate correctly held, nothing improved). Root cause: it reflected on per-scenario SCORES only — the judge's notes (the 'why it failed') were computed but DROPPED before the reflection, so it proposed generic rewrites a capable model already knows. Thread judge notes through generically: campaignBreakdown collects per-scenario notes (deduped) -> GenerationCandidate.scenarios[].notes -> gepaDriver buildEvidence -> TrialTrace.failureNote -> buildReflectionPrompt renders a 'Why it scored low' block. The optimizer now targets the real failure pattern. Anti-overfit: notes are GENERALIZABLE patterns by contract (not case ground truth — that's memorization), and the held-out gate is the structural backstop (overfit can't clear the paired-bootstrap CI on unseen cases). Generic — any agent benefits by emitting informative judge notes. 3 tests; suite 1645 green.

tangletools merged commit 28367b3 into main May 31, 2026

tangletools deleted the feat/error-grounded-reflection branch May 31, 2026 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(driver): error-grounded reflection — gepaDriver targets real failures (0.70.0)#146

feat(driver): error-grounded reflection — gepaDriver targets real failures (0.70.0)#146
tangletools merged 1 commit into
mainfrom
feat/error-grounded-reflection

tangletools commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tangletools commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants