feat(eval-campaign): shared self-improvement scaffold + integration skill by drewstone · Pull Request #13 · tangle-network/agent-app

drewstone · 2026-06-05T23:01:38Z

A curated app-shell surface (@tangle-network/agent-app/eval-campaign) so any product agent wires its measure → optimize → gate → ship loop from one import, instead of each hand-rolling the loop harness around the substrate.

What the substrate already owns (and this does NOT rebuild)

selfImprove (@tangle-network/agent-eval/contract) already owns the whole cycle: train/holdout split, the GEPA driver, the held-out production gate, durable provenance + hosted ingest, every default. The duplication across product agents is the harness wrapped around it — the identical runImprovementLoop({...}) + emitLoopProvenance({...}) blocks and a re-implemented judge-ensemble reducer.

What this adds

buildEnsembleJudge — turns a per-rubric scoreOne into a JudgeConfig that fans out N uncorrelated judge calls and reduces them via the substrate's aggregateJudgeVerdicts (agent-eval 0.81.0): survivor-mean per dimension, inter-rater disagreement spread, cost sum. Fail-loud — one judge failing is recorded and dropped (never a zero); ALL failing throws → a failed cell, never a silent zero.
Curated re-exports — selfImprove, defaultProductionGate, paretoSignificanceGate, gepaDriver, evolutionaryDriver, runCampaign, and the types, so a product imports from one module rather than three agent-eval subpaths.
SKILL.md (auto-discovered as /eval-campaign) — the Flue-style integration breadcrumb: the three things a product owns (scenarios, agent, judge), a copy-paste minimal wiring, every knob + default, the layering rule, the fail-loud contract, and the anti-patterns it deletes.

Layering

The scaffold lives in the consumer (agent-app) and composes the substrate downward — no upward import. aggregateJudgeVerdicts/selfImprove/gates/drivers all come from agent-eval.

Note for review

Peer floor on @tangle-network/agent-eval moves >=0.50.0 → >=0.81.0 (the version exposing aggregateJudgeVerdicts). Only consumers that upgrade to this agent-app are affected — and they upgrade precisely to get this surface. Flagging in case you'd prefer a narrower floor.

Verification

pnpm typecheck clean (against published agent-eval 0.81.0)
pnpm build clean — dist/eval-campaign/{index.js,index.d.ts} emitted; new ./eval-campaign export wired in package.json + tsup
pnpm test — 181 passing (full suite + 5 new buildEnsembleJudge cases: fan-out, JudgeScore shape, one-rep-fail-survives, all-fail-throws, empty-guards)

Follow-up (separate PRs, after this publishes)

Product agents collapse their hand-rolled loop harness onto selfImprove + buildEnsembleJudge.

…kill A curated app-shell surface so any product agent wires its eval/self-improve loop from one import instead of hand-rolling the loop harness. - buildEnsembleJudge: turns a per-rubric scoreOne into a JudgeConfig that fans out N uncorrelated judge calls and reduces them via the substrate's aggregateJudgeVerdicts (survivor-mean, inter-rater spread, fail-loud on all-failed — a single judge failing never fails the cell, all failing throws a failed cell rather than a silent zero). - Re-exports selfImprove + the gates + drivers + types, so a product does not reach across three agent-eval subpaths. - SKILL.md (auto-discovered as /eval-campaign): the integration contract — the three things a product owns (scenarios, agent, judge), every knob + default, the layering rule, the fail-loud contract, and the boilerplate it deletes. selfImprove already owns the loop/split/gate/provenance; this surface adds only the judge constructor + curation. Peer floor on @tangle-network/agent-eval moves to >=0.81.0 (the version exposing aggregateJudgeVerdicts). typecheck + build green; 5 new tests; full suite 181 passing.

tangletools

Thin app-shell façade over the substrate (selfImprove + aggregateJudgeVerdicts); buildEnsembleJudge is fail-loud (one judge fails → dropped, all fail → failed cell not zero). SKILL.md is the integration contract. 181 tests green, typechecks against published 0.81.0. Approving.

* feat(skills): the Improve skill family (agentic, self-evolving) Five skills that encode HOW an agent builds + runs a self-improvement loop for a product it has never seen — distilled from repairing legal-agent's gepaDriver loop end-to-end. They sit above the eval-campaign engine (#13/#14): the engine optimizes; these skills are the judgment that makes the optimization trustworthy. - eval-architect measure the REAL deliverable, not a proxy (the empty-string / wrong-channel failure) - measurement-validation prove the metric is sound before spending; fail loud on incomplete/unpaired evidence (the fake +47) - surface-evolution run the gated loop; promote without offline/online drift; never regress a guarded dimension - improve-conductor the user-facing Improve button: calibrated, evidence- gated promotion — trust over lift - skill-evolution the meta: each skill is a measured hypothesis (frozen invariants + an evolvable judgment surface optimized by its own meta-eval). The agent-builder north-star: the produced eval yields real held-out lift on the agent it built; the fleet is the training distribution. Every skill follows a 4-part agentic contract — Invariant (frozen, human-owned) / Judgment (wide, loop-owned) / Self-test (a checkable result) / Evolves-by — so it stays adaptive without drifting. Grounded in this session's concrete failures as worked examples. * feat(skills): eval-bootstrap — build the apparatus for real at cold start Closes the hole in the Improve family: the prior skills assumed the measurement was buildable on request. They didn't answer the two hardest cold-start questions — WHAT is the right thing to improve (or the agent perfects a proxy), and WHO builds the apparatus when none exists (the improver must construct it, not tune thin air). Without these, the improver confidently ships a toy. - eval-bootstrap: the two-loop architecture (BUILD a validated, externally- grounded harness — often via a delegated agent-runtime loop — THEN optimize), with the anti-toy / anti-circular invariants: no spend until the target is user-confirmed + tied to product value + the gold is grounded in EXTERNAL truth (never gold the agent invents and grades itself against) + the harness passes measurement-validation (it RUNS, not just compiles). Self-tests: "would the user agree with these scores?", the mutation test, the non-circularity check. - improve-conductor: added the cold-start gate — invariant #4 (no optimization spend before a confirmed target + validated measurement; dispatch eval-bootstrap first) and the explicit two-step framing.

tangletools approved these changes Jun 5, 2026

View reviewed changes

drewstone merged commit f781c07 into main Jun 5, 2026
1 check passed

drewstone deleted the feat/eval-campaign branch June 5, 2026 23:03

drewstone mentioned this pull request Jun 6, 2026

feat(skills): the Improve skill family (agentic, self-evolving) #15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval-campaign): shared self-improvement scaffold + integration skill#13

feat(eval-campaign): shared self-improvement scaffold + integration skill#13
drewstone merged 1 commit into
mainfrom
feat/eval-campaign

drewstone commented Jun 5, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 5, 2026

What the substrate already owns (and this does NOT rebuild)

What this adds

Layering

Note for review

Verification

Follow-up (separate PRs, after this publishes)

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants