chore(release): agent-app 0.2.0 — perfect the eval-campaign surface by drewstone · Pull Request #14 · tangle-network/agent-app

drewstone · 2026-06-05T23:30:36Z

Cuts the agent-app release that makes the /eval-campaign self-improvement surface consumable by the product agents, on a now-complete substrate.

Why 0.2.0 (not 0.1.x)

Bumps the @tangle-network/agent-eval peer floor >=0.81.0 → >=0.82.0 — the version where selfImprove forwards the full loop surface (reps, promoteTopK, labeledStore, captureSource, expectUsage, analyzeGeneration, findings) and defaults expectUsage: 'assert'. A peer-floor tightening is breaking for consumers below it, so this is a minor bump in 0.x semver. Only consumers that upgrade are affected — and they upgrade to get this surface.

What this unblocks

With 0.2.0 published, a product agent collapses its entire hand-rolled runImprovementLoop + emitLoopProvenance harness onto a single selfImprove call (plus buildEnsembleJudge for multi-model judges) with zero regression — capture, replicates, the analyst loop, and the fail-loud integrity guard all carry through.

Changed

Peer + dev floor on agent-eval → 0.82.0.
SKILL.md config table documents the now-complete knob set + the 'assert' fail-loud default.

Verification

pnpm typecheck + pnpm build clean against published agent-eval 0.82.0
pnpm test — 181 passing
This release also carries the previously-merged store/runtime features on main (createAgentRuntime, KVStore, createDatabaseProvider).

Bumps the @tangle-network/agent-eval floor to >=0.82.0 (the version where selfImprove forwards the full loop surface + defaults expectUsage to 'assert'), so a product agent collapses its entire loop harness onto one selfImprove call with no regression. SKILL.md documents the now-complete knob set (reps, promoteTopK, labeledStore/captureSource, analyzeGeneration) and the fail-loud default. Minor bump (0.1.x → 0.2.0): the peer-floor tightening is breaking for consumers below 0.82.0.

tangletools

Cuts agent-app 0.2.0: peer floor → agent-eval 0.82.0 (complete selfImprove surface), skill documents the full knob set + fail-loud default. Typecheck/build/181 tests green against published 0.82.0. Approving the cut.

* feat(skills): the Improve skill family (agentic, self-evolving) Five skills that encode HOW an agent builds + runs a self-improvement loop for a product it has never seen — distilled from repairing legal-agent's gepaDriver loop end-to-end. They sit above the eval-campaign engine (#13/#14): the engine optimizes; these skills are the judgment that makes the optimization trustworthy. - eval-architect measure the REAL deliverable, not a proxy (the empty-string / wrong-channel failure) - measurement-validation prove the metric is sound before spending; fail loud on incomplete/unpaired evidence (the fake +47) - surface-evolution run the gated loop; promote without offline/online drift; never regress a guarded dimension - improve-conductor the user-facing Improve button: calibrated, evidence- gated promotion — trust over lift - skill-evolution the meta: each skill is a measured hypothesis (frozen invariants + an evolvable judgment surface optimized by its own meta-eval). The agent-builder north-star: the produced eval yields real held-out lift on the agent it built; the fleet is the training distribution. Every skill follows a 4-part agentic contract — Invariant (frozen, human-owned) / Judgment (wide, loop-owned) / Self-test (a checkable result) / Evolves-by — so it stays adaptive without drifting. Grounded in this session's concrete failures as worked examples. * feat(skills): eval-bootstrap — build the apparatus for real at cold start Closes the hole in the Improve family: the prior skills assumed the measurement was buildable on request. They didn't answer the two hardest cold-start questions — WHAT is the right thing to improve (or the agent perfects a proxy), and WHO builds the apparatus when none exists (the improver must construct it, not tune thin air). Without these, the improver confidently ships a toy. - eval-bootstrap: the two-loop architecture (BUILD a validated, externally- grounded harness — often via a delegated agent-runtime loop — THEN optimize), with the anti-toy / anti-circular invariants: no spend until the target is user-confirmed + tied to product value + the gold is grounded in EXTERNAL truth (never gold the agent invents and grades itself against) + the harness passes measurement-validation (it RUNS, not just compiles). Self-tests: "would the user agree with these scores?", the mutation test, the non-circularity check. - improve-conductor: added the cold-start gate — invariant #4 (no optimization spend before a confirmed target + validated measurement; dispatch eval-bootstrap first) and the explicit two-step framing.

tangletools approved these changes Jun 5, 2026

View reviewed changes

drewstone merged commit d7d2d93 into main Jun 5, 2026
1 check passed

drewstone deleted the chore/eval-campaign-release branch June 5, 2026 23:32

drewstone mentioned this pull request Jun 6, 2026

feat(skills): the Improve skill family (agentic, self-evolving) #15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(release): agent-app 0.2.0 — perfect the eval-campaign surface#14

chore(release): agent-app 0.2.0 — perfect the eval-campaign surface#14
drewstone merged 1 commit into
mainfrom
chore/eval-campaign-release

drewstone commented Jun 5, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 5, 2026

Why 0.2.0 (not 0.1.x)

What this unblocks

Changed

Verification

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants