Skip to content

feat(driver): error-grounded reflection — gepaDriver targets real failures (0.70.0)#146

Merged
tangletools merged 1 commit into
mainfrom
feat/error-grounded-reflection
May 31, 2026
Merged

feat(driver): error-grounded reflection — gepaDriver targets real failures (0.70.0)#146
tangletools merged 1 commit into
mainfrom
feat/error-grounded-reflection

Conversation

@tangletools
Copy link
Copy Markdown
Contributor

The conjunct-2 fix. Adversarial verification on legal + tax (two worker models) showed the gepaDriver's candidates regressed — it reflected on per-scenario scores only; the judge's notes (the 'why') were dropped before reflection, so it proposed generic rewrites that hurt a capable model.

Threads judge notes through generically: campaignBreakdownscenarios[].notesbuildEvidenceTrialTrace.failureNote → a 'Why it scored low' block in the reflection prompt.

Anti-overfit (Drew's guardrail): notes are generalizable failure patterns by contract, never case ground-truth (that's memorization); the held-out gate is the structural backstop. Generic — any agent benefits. 3 tests; full suite 1645 green.

…r (0.70.0)

Adversarial verification on TWO domains (legal + tax, two worker models): the
gepaDriver's candidates REGRESSED the baseline (gate correctly held, nothing
improved). Root cause: it reflected on per-scenario SCORES only — the judge's
notes (the 'why it failed') were computed but DROPPED before the reflection, so
it proposed generic rewrites a capable model already knows.

Thread judge notes through generically: campaignBreakdown collects per-scenario
notes (deduped) -> GenerationCandidate.scenarios[].notes -> gepaDriver
buildEvidence -> TrialTrace.failureNote -> buildReflectionPrompt renders a 'Why
it scored low' block. The optimizer now targets the real failure pattern.

Anti-overfit: notes are GENERALIZABLE patterns by contract (not case ground
truth — that's memorization), and the held-out gate is the structural backstop
(overfit can't clear the paired-bootstrap CI on unseen cases). Generic — any
agent benefits by emitting informative judge notes. 3 tests; suite 1645 green.
@tangletools tangletools merged commit 28367b3 into main May 31, 2026
@tangletools tangletools deleted the feat/error-grounded-reflection branch May 31, 2026 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants