chore(release): agent-app 0.2.0 — perfect the eval-campaign surface#14
Merged
Conversation
Bumps the @tangle-network/agent-eval floor to >=0.82.0 (the version where selfImprove forwards the full loop surface + defaults expectUsage to 'assert'), so a product agent collapses its entire loop harness onto one selfImprove call with no regression. SKILL.md documents the now-complete knob set (reps, promoteTopK, labeledStore/captureSource, analyzeGeneration) and the fail-loud default. Minor bump (0.1.x → 0.2.0): the peer-floor tightening is breaking for consumers below 0.82.0.
tangletools
approved these changes
Jun 5, 2026
tangletools
left a comment
There was a problem hiding this comment.
Cuts agent-app 0.2.0: peer floor → agent-eval 0.82.0 (complete selfImprove surface), skill documents the full knob set + fail-loud default. Typecheck/build/181 tests green against published 0.82.0. Approving the cut.
drewstone
added a commit
that referenced
this pull request
Jun 6, 2026
* feat(skills): the Improve skill family (agentic, self-evolving) Five skills that encode HOW an agent builds + runs a self-improvement loop for a product it has never seen — distilled from repairing legal-agent's gepaDriver loop end-to-end. They sit above the eval-campaign engine (#13/#14): the engine optimizes; these skills are the judgment that makes the optimization trustworthy. - eval-architect measure the REAL deliverable, not a proxy (the empty-string / wrong-channel failure) - measurement-validation prove the metric is sound before spending; fail loud on incomplete/unpaired evidence (the fake +47) - surface-evolution run the gated loop; promote without offline/online drift; never regress a guarded dimension - improve-conductor the user-facing Improve button: calibrated, evidence- gated promotion — trust over lift - skill-evolution the meta: each skill is a measured hypothesis (frozen invariants + an evolvable judgment surface optimized by its own meta-eval). The agent-builder north-star: the produced eval yields real held-out lift on the agent it built; the fleet is the training distribution. Every skill follows a 4-part agentic contract — Invariant (frozen, human-owned) / Judgment (wide, loop-owned) / Self-test (a checkable result) / Evolves-by — so it stays adaptive without drifting. Grounded in this session's concrete failures as worked examples. * feat(skills): eval-bootstrap — build the apparatus for real at cold start Closes the hole in the Improve family: the prior skills assumed the measurement was buildable on request. They didn't answer the two hardest cold-start questions — WHAT is the right thing to improve (or the agent perfects a proxy), and WHO builds the apparatus when none exists (the improver must construct it, not tune thin air). Without these, the improver confidently ships a toy. - eval-bootstrap: the two-loop architecture (BUILD a validated, externally- grounded harness — often via a delegated agent-runtime loop — THEN optimize), with the anti-toy / anti-circular invariants: no spend until the target is user-confirmed + tied to product value + the gold is grounded in EXTERNAL truth (never gold the agent invents and grades itself against) + the harness passes measurement-validation (it RUNS, not just compiles). Self-tests: "would the user agree with these scores?", the mutation test, the non-circularity check. - improve-conductor: added the cold-start gate — invariant #4 (no optimization spend before a confirmed target + validated measurement; dispatch eval-bootstrap first) and the explicit two-step framing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cuts the agent-app release that makes the
/eval-campaignself-improvement surface consumable by the product agents, on a now-complete substrate.Why 0.2.0 (not 0.1.x)
Bumps the
@tangle-network/agent-evalpeer floor>=0.81.0→>=0.82.0— the version whereselfImproveforwards the full loop surface (reps,promoteTopK,labeledStore,captureSource,expectUsage,analyzeGeneration,findings) and defaultsexpectUsage: 'assert'. A peer-floor tightening is breaking for consumers below it, so this is a minor bump in 0.x semver. Only consumers that upgrade are affected — and they upgrade to get this surface.What this unblocks
With 0.2.0 published, a product agent collapses its entire hand-rolled
runImprovementLoop+emitLoopProvenanceharness onto a singleselfImprovecall (plusbuildEnsembleJudgefor multi-model judges) with zero regression — capture, replicates, the analyst loop, and the fail-loud integrity guard all carry through.Changed
SKILL.mdconfig table documents the now-complete knob set + the'assert'fail-loud default.Verification
pnpm typecheck+pnpm buildclean against published agent-eval 0.82.0pnpm test— 181 passing