feat(eval-campaign): shared self-improvement scaffold + integration skill#13
Merged
Conversation
…kill A curated app-shell surface so any product agent wires its eval/self-improve loop from one import instead of hand-rolling the loop harness. - buildEnsembleJudge: turns a per-rubric scoreOne into a JudgeConfig that fans out N uncorrelated judge calls and reduces them via the substrate's aggregateJudgeVerdicts (survivor-mean, inter-rater spread, fail-loud on all-failed — a single judge failing never fails the cell, all failing throws a failed cell rather than a silent zero). - Re-exports selfImprove + the gates + drivers + types, so a product does not reach across three agent-eval subpaths. - SKILL.md (auto-discovered as /eval-campaign): the integration contract — the three things a product owns (scenarios, agent, judge), every knob + default, the layering rule, the fail-loud contract, and the boilerplate it deletes. selfImprove already owns the loop/split/gate/provenance; this surface adds only the judge constructor + curation. Peer floor on @tangle-network/agent-eval moves to >=0.81.0 (the version exposing aggregateJudgeVerdicts). typecheck + build green; 5 new tests; full suite 181 passing.
tangletools
approved these changes
Jun 5, 2026
tangletools
left a comment
There was a problem hiding this comment.
Thin app-shell façade over the substrate (selfImprove + aggregateJudgeVerdicts); buildEnsembleJudge is fail-loud (one judge fails → dropped, all fail → failed cell not zero). SKILL.md is the integration contract. 181 tests green, typechecks against published 0.81.0. Approving.
drewstone
added a commit
that referenced
this pull request
Jun 6, 2026
* feat(skills): the Improve skill family (agentic, self-evolving) Five skills that encode HOW an agent builds + runs a self-improvement loop for a product it has never seen — distilled from repairing legal-agent's gepaDriver loop end-to-end. They sit above the eval-campaign engine (#13/#14): the engine optimizes; these skills are the judgment that makes the optimization trustworthy. - eval-architect measure the REAL deliverable, not a proxy (the empty-string / wrong-channel failure) - measurement-validation prove the metric is sound before spending; fail loud on incomplete/unpaired evidence (the fake +47) - surface-evolution run the gated loop; promote without offline/online drift; never regress a guarded dimension - improve-conductor the user-facing Improve button: calibrated, evidence- gated promotion — trust over lift - skill-evolution the meta: each skill is a measured hypothesis (frozen invariants + an evolvable judgment surface optimized by its own meta-eval). The agent-builder north-star: the produced eval yields real held-out lift on the agent it built; the fleet is the training distribution. Every skill follows a 4-part agentic contract — Invariant (frozen, human-owned) / Judgment (wide, loop-owned) / Self-test (a checkable result) / Evolves-by — so it stays adaptive without drifting. Grounded in this session's concrete failures as worked examples. * feat(skills): eval-bootstrap — build the apparatus for real at cold start Closes the hole in the Improve family: the prior skills assumed the measurement was buildable on request. They didn't answer the two hardest cold-start questions — WHAT is the right thing to improve (or the agent perfects a proxy), and WHO builds the apparatus when none exists (the improver must construct it, not tune thin air). Without these, the improver confidently ships a toy. - eval-bootstrap: the two-loop architecture (BUILD a validated, externally- grounded harness — often via a delegated agent-runtime loop — THEN optimize), with the anti-toy / anti-circular invariants: no spend until the target is user-confirmed + tied to product value + the gold is grounded in EXTERNAL truth (never gold the agent invents and grades itself against) + the harness passes measurement-validation (it RUNS, not just compiles). Self-tests: "would the user agree with these scores?", the mutation test, the non-circularity check. - improve-conductor: added the cold-start gate — invariant #4 (no optimization spend before a confirmed target + validated measurement; dispatch eval-bootstrap first) and the explicit two-step framing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A curated app-shell surface (
@tangle-network/agent-app/eval-campaign) so any product agent wires its measure → optimize → gate → ship loop from one import, instead of each hand-rolling the loop harness around the substrate.What the substrate already owns (and this does NOT rebuild)
selfImprove(@tangle-network/agent-eval/contract) already owns the whole cycle: train/holdout split, the GEPA driver, the held-out production gate, durable provenance + hosted ingest, every default. The duplication across product agents is the harness wrapped around it — the identicalrunImprovementLoop({...})+emitLoopProvenance({...})blocks and a re-implemented judge-ensemble reducer.What this adds
buildEnsembleJudge— turns a per-rubricscoreOneinto aJudgeConfigthat fans out N uncorrelated judge calls and reduces them via the substrate'saggregateJudgeVerdicts(agent-eval 0.81.0): survivor-mean per dimension, inter-rater disagreement spread, cost sum. Fail-loud — one judge failing is recorded and dropped (never a zero); ALL failing throws → a failed cell, never a silent zero.selfImprove,defaultProductionGate,paretoSignificanceGate,gepaDriver,evolutionaryDriver,runCampaign, and the types, so a product imports from one module rather than three agent-eval subpaths.SKILL.md(auto-discovered as/eval-campaign) — the Flue-style integration breadcrumb: the three things a product owns (scenarios, agent, judge), a copy-paste minimal wiring, every knob + default, the layering rule, the fail-loud contract, and the anti-patterns it deletes.Layering
The scaffold lives in the consumer (agent-app) and composes the substrate downward — no upward import.
aggregateJudgeVerdicts/selfImprove/gates/drivers all come from agent-eval.Note for review
Peer floor on
@tangle-network/agent-evalmoves>=0.50.0→>=0.81.0(the version exposingaggregateJudgeVerdicts). Only consumers that upgrade to this agent-app are affected — and they upgrade precisely to get this surface. Flagging in case you'd prefer a narrower floor.Verification
pnpm typecheckclean (against published agent-eval 0.81.0)pnpm buildclean —dist/eval-campaign/{index.js,index.d.ts}emitted; new./eval-campaignexport wired in package.json + tsuppnpm test— 181 passing (full suite + 5 newbuildEnsembleJudgecases: fan-out, JudgeScore shape, one-rep-fail-survives, all-fail-throws, empty-guards)Follow-up (separate PRs, after this publishes)
selfImprove+buildEnsembleJudge.