Skip to content

Releases: vukkt/token-warden

v0.17.0

16 Jun 13:18

Choose a tag to compare

Quality hardening — no plugin behavior change; this release is about making the
codebase provably tested and tight, with CI guards that can't silently slip.

  • 90% line coverage (78% branch), CI-gated. Added @vitest/coverage-v8 with
    a ratchet-floor threshold; the new coverage pipeline stage fails the build on
    any regression. Coverage rose from ~66% to 90% by unit-testing the
    subprocess/stdin CLIs (collect, gate, distill, evolve, modelbench,
    promptbench) with mocked child_process/stdin boundaries — real orchestration
    tests (fail-open contracts, verdict decisions, anomaly alerts), not padding. The
    untestable invokedDirectly entry shims are honestly excluded via v8 ignore.
  • Dead-code gate. knip (unused files/exports/deps) is wired into CI and the
    module API surface was tightened (8 internal-only exports un-exported). Zero
    unused SQL fields.
  • Component-integration + performance tests. test/integration.test.ts wires
    the real modules end-to-end (collection → distill trigger → selector → receipts
    → status) through one DB; test/perf.test.ts holds hot-path budgets — transcript
    parser ~39 MB/s (2 MB in ~50 ms vs the 2 s Stop-hook budget), 50k tool events
    attributed in ~24 ms, a 2k-session rollup in ~1.3 ms.
  • 361 tests, green on Node 22 and 24.

v0.16.0

16 Jun 12:17

Choose a tag to compare

Rule receipts — the per-rule verdict card (community-suggested).

  • New /warden-receipt command (npx tsx src/receipt.ts [--agent <name>] [--json]) renders the evidence behind each keep/evict decision as one card:
    token savings vs. context rent (with variance and ROI multiple), the model and
    golden-suite hash it was measured under, per-task pass/fail with vs. without
    the rule, and the tool-call / file-reread activity profile with vs. without
    (shown as a signed % so a reviewer can see whether a "cheap" rule did less
    work). Read-only; the natural payload for sharing a rule — "my delta is
    evidence, not authority for your repo."
  • The selector now records a receipt snapshot (rule_receipts table, migration
    #9) at every decision — initial and each re-audit, so a rule has an audit
    trail. The keep/evict verdict logic is unchanged; receipts are additive
    capture. RunResult now carries tool-call / file-reread counts; bench.ts
    gains goldenSuiteHash for suite provenance.
  • The safety axis is surfaced, not auto-judged: a big activity drop is usually
    the point of an efficiency rule, so the receipt shows the numbers and leaves
    the call to a human — the binding safety gate remains the per-task pass/fail
    regression, which evicts on its own.
  • 292 tests, green on Node 22 and 24.

v0.15.0

16 Jun 10:09

Choose a tag to compare

Tooling and docs — no plugin behavior change.

  • Staged CI/CD pipeline. .github/workflows/ci.yml is now a dependent-stage
    pipeline — quality (lint, typecheck, manifest version consistency) →
    test (Node 22 + 24) and fixture in parallel → validate (plugin-manifest
    validation + a CLI smoke run) → release. The release stage runs only on a
    vX.Y.Z tag: it verifies the tag matches the manifests and publishes the
    GitHub release with notes from CHANGELOG.md. Tag-push is now the whole
    deploy step.
  • Release helper scripts (scripts/check-versions.mjs,
    scripts/changelog-section.mjs) — version-consistency guard and changelog
    extraction, reused by CI and runnable locally (npm run check:versions).
  • Standard project docs: CONTRIBUTING.md (setup, the pipeline, the release
    flow, the design invariants) and SECURITY.md (reporting + the security
    model). README gains a Quickstart at the top of "Getting started".
  • A professional sweep of every source file found it clean (no TODO/FIXME, no
    any, no stray debug, no non-text bytes). 275 tests, green on Node 22 and 24.

v0.14.1 — verdict-math boundary tests

16 Jun 09:41

Choose a tag to compare

Test-only hardening — no behavior or API change.

  • Locked the assessDelta degenerate-input boundaries that protect a keep/evict verdict from a divide-by-zero NaN: a single comparable task yields a finite point estimate with null standard error (the savings.length >= 2 guard), and no comparable task yields a null delta rather than NaN. An audit confirmed the verdict math is otherwise free of divide-by-zero / NaN paths.
  • 275 tests, green on Node 22 and 24.

Full changelog: v0.14.0...v0.14.1

v0.14.0 — gate injection hardening + hygiene

16 Jun 09:32

Choose a tag to compare

Hardening and simplification release. No new commands; existing behavior is unchanged except the inter-agent approval prompt is now injection-proof. Bundles the work from a focused optimization pass over the codebase.

Security

  • gate.ts approval prompt is sanitized. The PreToolUse prompt for an inter-agent SendMessage interpolated the sender, recipient, and message body. A hostile teammate message could embed ANSI/control sequences to forge or obscure the line the user approves. Every interpolated field now passes through the shared sanitizer (control/ANSI stripped, names capped); the forged-newline and escape vectors are closed. Verified end-to-end.

Cleanups

  • New src/sanitize.tsdisplayText extracted into one presentation-security chokepoint used by status, compare, attribute, and gate; attribute/compare no longer pull it from the heavier status module.
  • Fixed NUL bytes in attribute.ts (a NUL-delimited map key) — invisible and tool-breaking; replaced with a collision-proof JSON.stringify key. New source-hygiene test fails the build on any NUL/control byte in source.
  • Centralized the run-total token SQL (RUN_TOTAL_TOKENS_SQL, was hand-written 10×) and collapsed the duplicated candidate/re-audit verdict path in select.ts into one helper — both behavior-preserving.
  • Added parseAgentDefinition memory-scope-isolation tests (benchmarks never touch real agent-memory).

Verification

273 tests, green on Node 22 and 24. E2e edge-case sweep confirmed fail-open on the collect/gate hooks (empty, garbage, binary, missing-file inputs) and correct exit codes across every CLI.

Full changelog: v0.13.0...v0.14.0

v0.13.0 — skill/MCP cost attribution (#5 complete)

16 Jun 07:56

Choose a tag to compare

Roadmap direction #5skill/MCP cost attribution is complete. Decomposition, not a verdict: it answers "where did the tokens go?" by attributing each real-work session's footprint to the tool, skill, or MCP server that produced it. Fully orthogonal to the selector/benchmark path — it never promotes, evicts, or measures a rule.

What's new

  • npx tsx src/attribute.ts (new /warden-attribute command) renders a cross-session rollup of tool/skill/MCP cost, or a single transcript with --transcript. Filters: --agent, --kind builtin|mcp|skill, --limit, --json.
  • transcript.ts now joins each tool_use to its tool_result by id in the existing single streaming pass, capturing the input chars the model generated and the result chars the tool injected back into context. The hot Stop-hook budget is unchanged (one pass, O(tool calls)).
  • db.ts migration #8 adds a tool_costs table; collect.ts persists per-session costs inside the existing fail-open block (real-work only — golden runs are never attributed). /warden-status gains a top-costs section.
  • Footprint is measured in characters (exact, deterministic); a rough ≈tokens figure (chars ÷ 4) is shown for intuition, not as a billed token count.

Hardening

From an adversarial review: a tool_result content array with an odd sibling (a bare string, an image block) no longer zeroes the whole result's footprint — each element is read defensively. A cross-feature regression audit confirmed fail-open is preserved, the verdict path is untouched, and the migration is additive.

219 tests, green on Node 22 and 24.

Roadmap

Of the six directions, #1, #2, #3, #4, #5 (plus automated prompt evolution) are shipped. Only #6 (rule marketplaces) remains.

Full changelog: v0.12.0...v0.13.0

v0.12.0 — team-shared rule ledgers complete (CI gate)

15 Jun 18:17

Choose a tag to compare

Roadmap #3, increment 3: the CI gate — #3 is now complete.

What's new

  • npx tsx src/verify-ledger.ts [file...] validates committed .warden/*.rules.md ledgers and exits non-zero if any is corrupt or hand-edited, so a CI job can gate the PR. Deterministic and offline — no model tokens, no secrets; reuses increment 2's parseLedgerFile. Verified: a valid ledger passes (exit 0), a corrupted one fails (exit 1).
  • A deeper gate that re-benchmarks each rule's claimed delta in CI is possible but needs a model-token budget and credentials, so it is a documented deployment choice rather than a default.

Team-shared rule ledgers — the full arc

/warden-share (export) → /warden-adopt (import as candidates, re-measured locally, foreign delta never trusted) → verify-ledger (CI structural gate). Memory review becomes code review.

180 tests, green on Node 22 and 24.

Roadmap

Of the six directions, #1 (model-bench), #2 (prompt A/B), #3 (team-shared ledgers), #4 (cost-anomaly alerting) are shipped, plus automated prompt evolution. Two remain: #5 skill/MCP cost attribution and #6 rule marketplaces.

Full changelog: v0.11.0...v0.12.0

v0.11.0 — team-shared rule ledgers (import + re-verify)

15 Jun 18:11

Choose a tag to compare

Roadmap #3, increment 2: import a shared ledger and re-verify each rule locally — never trusting a foreign delta.

What's new

  • /warden-adopt --from <path> (and src/adopt.ts) reads a shared ledger (from /warden-share) and queues its rules as candidates locally. The foreign measured delta is discarded and the context rent is recomputed locally, so by invariant #1 an adopted rule is never injected into memory until the local selector re-measures it on this machine's golden suite. Near-duplicates of any existing rule (active/candidate/evicted) are skipped; re-adopting is idempotent.

Why this is safe

This is the increment that writes to the rule ledger, so it was the one to handle carefully. The safety comes from a single decision: an adopted rule is just a candidate, so it flows through the existing selector untouched — the whole variance-conservative verdict path re-measures it on your own suite. There is no new trust path. The ledger JSON is zod-validated; control-char bodies and malformed blocks are rejected.

Verified end-to-end: real rules exported → adopted into a fresh DB as candidates (rent recomputed, foreign delta discarded) → re-adopt correctly skipped as duplicates. 174 tests, green on Node 22 and 24.

Full changelog: v0.10.0...v0.11.0

v0.10.0 — team-shared rule ledgers (export)

15 Jun 13:25

Choose a tag to compare

Roadmap #3, increment 1: export measured rules to a committed, reviewable artifact so a team can version and review agent memory like code.

What's new

  • /warden-share <agent> (and src/share.ts) writes an agent's active rules — body, measured token delta, context rent, and provenance — to .warden/<agent>.rules.md: a human-readable bullet list plus a machine-readable JSON block that round-trips. A PR adding a rule arrives with its proof, and a later import can re-verify it.
  • Read-only and zero-coupling by design. It only reads the rule ledger and writes a file, so it cannot affect the collect/distill/select loop. Verified: nothing else imports it, and it carries the invoked-directly guard from the start.

Why export first

The risky part of team-shared ledgers is import, not export — a foreign rule claims a measured delta, and the project's one inviolable rule is "measured, not claimed." So import must re-pass the importer's own golden suite (reusing the existing selector; no new trust path). Shipping export first locks the artifact format before any import logic touches the selector.

166 tests, all green on Node 22 and 24.

Full changelog: v0.9.1...v0.10.0

v0.9.1 — documentation fixes

15 Jun 08:20

Choose a tag to compare

Documentation-only release (no code changes).

  • Roadmap de-drifted. Model-migration benchmarking, prompt A/B testing, and cost-anomaly alerting were still listed as future "bigger directions" while already shipped (v0.5/v0.6/v0.9). Removed them, and collapsed the ever-growing "shipped since v0.1.0" list into a one-line pointer to the CHANGELOG — the canonical record of what shipped — so the two stop drifting.
  • Testing section wording corrected: the CI badge shows pass/fail, not a test count; the prose now gives an approximate count and says so.

Full changelog: v0.9.0...v0.9.1