feat(skills): the Improve skill family (agentic, self-evolving)#15
Conversation
Five skills that encode HOW an agent builds + runs a self-improvement loop for a product it has never seen — distilled from repairing legal-agent's gepaDriver loop end-to-end. They sit above the eval-campaign engine (#13/#14): the engine optimizes; these skills are the judgment that makes the optimization trustworthy. - eval-architect measure the REAL deliverable, not a proxy (the empty-string / wrong-channel failure) - measurement-validation prove the metric is sound before spending; fail loud on incomplete/unpaired evidence (the fake +47) - surface-evolution run the gated loop; promote without offline/online drift; never regress a guarded dimension - improve-conductor the user-facing Improve button: calibrated, evidence- gated promotion — trust over lift - skill-evolution the meta: each skill is a measured hypothesis (frozen invariants + an evolvable judgment surface optimized by its own meta-eval). The agent-builder north-star: the produced eval yields real held-out lift on the agent it built; the fleet is the training distribution. Every skill follows a 4-part agentic contract — Invariant (frozen, human-owned) / Judgment (wide, loop-owned) / Self-test (a checkable result) / Evolves-by — so it stays adaptive without drifting. Grounded in this session's concrete failures as worked examples.
✅ No Blockers —
|
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 2 non-blocking findings — cd15036b
Full multi-shot audit completed 1/1 planned shots over 5 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-06T21:30:48Z · immutable trace
…tart Closes the hole in the Improve family: the prior skills assumed the measurement was buildable on request. They didn't answer the two hardest cold-start questions — WHAT is the right thing to improve (or the agent perfects a proxy), and WHO builds the apparatus when none exists (the improver must construct it, not tune thin air). Without these, the improver confidently ships a toy. - eval-bootstrap: the two-loop architecture (BUILD a validated, externally- grounded harness — often via a delegated agent-runtime loop — THEN optimize), with the anti-toy / anti-circular invariants: no spend until the target is user-confirmed + tied to product value + the gold is grounded in EXTERNAL truth (never gold the agent invents and grades itself against) + the harness passes measurement-validation (it RUNS, not just compiles). Self-tests: "would the user agree with these scores?", the mutation test, the non-circularity check. - improve-conductor: added the cold-start gate — invariant #4 (no optimization spend before a confirmed target + validated measurement; dispatch eval-bootstrap first) and the explicit two-step framing.
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — c0033867
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-06T21:31:39Z
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 92 | 89 | 89 |
| Confidence | 65 | 65 | 65 |
| Correctness | 92 | 89 | 89 |
| Security | 92 | 89 | 89 |
| Testing | 92 | 89 | 89 |
| Architecture | 92 | 89 | 89 |
Full multi-shot audit completed 1/1 planned shots over 6 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 6 changed files. Global verifier still owns final merge decision.
🟡 LOW eval-bootstrap references knowledge-loop subpath without version constraint — .claude/skills/eval-bootstrap/SKILL.md
Line 24 references
@tangle-network/agent-app/knowledge-loop's source-grounded acquisition. The subpath exists and is valid in this tree, but unlike eval-campaign (which documents a peer-dep flooragent-eval >= 0.81.0), eval-bootstrap doesn't state whether any minimum version is required. Low risk since the reference is descriptive (skill prose), not importable code.
🟡 LOW skill-evolution enumerates governed skills but omits eval-bootstrap — .claude/skills/skill-evolution/SKILL.md
Line 10: 'It governs
eval-architect,measurement-validation,surface-evolution, andimprove-conductor' — but eval-bootstrap is also a member of the Improve family that follows the 4-part contract and is cross-referenced by improve-conductor. The list should include it for completeness, or be rewritten as a non-exhaustive reference. No functional impact (skill-loading doesn't depend on this), but it's an internal consistency gap.
🟡 LOW Documentation: runImprovementLoop not actually re-exported — .claude/skills/surface-evolution/SKILL.md
Line 10 states runImprovementLoop is among the symbols re-exported via @tangle-network/agent-app/eval-campaign. src/eval-campaign/index.ts:119-125 re-exports runCampaign (not runImprovementLoop). The eval-campaign module deliberately avoids re-exporting runImprovementLoop (line 8 comment: 'A product should NOT hand-roll runImprovementLoop'). Fix: replace runImprovementLoop with runCampaign in the parenthetical list.
tangletools · 2026-06-06T21:35:57Z · trace
What
Five agent-facing skills (
.claude/skills/*, mirroring the existingeval-campaignskill) that encode how an agent builds and runs a self-improvement loop for a product it has never seen — and does it trustworthily. They are the judgment layer above theeval-campaignengine shipped in #13/#14: the engine optimizes; these skills are what keep the optimization from perfecting a fiction.Distilled directly from repairing legal-agent's gepaDriver loop end-to-end this session — every skill's worked example is a real failure we hit.
eval-architectmeasurement-validation+47)surface-evolutionimprove-conductorskill-evolutionWhy this shape (agentic, not a rulebook)
Every skill follows a 4-part contract:
Few frozen invariants hold the line; judgment is broad and loop-owned; outcomes are measured; the judgment surface self-revises. That split is how a skill stays adaptive to an unforeseen product without drifting into either a brittle checklist or unaccountable vibes.
The recursion / north-star
skill-evolutionpoints the same loop the skills describe at the skills themselves: a skill's judgment surface is optimized by the verifiable reward "did following this produce an eval that yielded real held-out lift, no critical regression?" — which is exactly the agent-builder north-star: the produced eval must yield real held-out lift on the agent it built. The fleet (legal/tax/gtm/creative/insurance) is the training distribution; legal-agent's repaired loop is dogfood data point #1.Worked failures baked in as examples
eval-architect)measurement-validation)heldOutLift=+47that was two different personas because 2 of 4 holdout cells errored (→measurement-validation, and the consumer-side guard now landing in legal-agent #155)Follow-up (not in this PR)
An
@tangle-network/agent-app/improvemodule that wires these skills to a typeddefineImproveTarget+ ascaffold_evalapp-tool + budget-boundedrunImprove, mirroring theknowledge-loopdeclarative→running mapper. The skills describe the contract; the module would codify the seam.