diff --git a/.agents/rules/agents-tier-system.md b/.agents/rules/agents-tier-system.md
index 1053c21..d2910c6 100644
--- a/.agents/rules/agents-tier-system.md
+++ b/.agents/rules/agents-tier-system.md
@@ -50,7 +50,7 @@ Today's Tier-2 rules:
 
 Pure intent-triggered. The skill description is detailed enough that Cursor surfaces it on relevant phrases. No always-on cost.
 
-Skills stay rule-less when the work is **explicitly invoked** by the user, not pattern-triggered (e.g. `audit-pr-architecture`, `docs-lifecycle-sweep` in this repo; `improve-codebase-architecture`, `gritql-codemods`, `ubiquitous-language` in larger codebases).
+Skills stay rule-less when the work is **explicitly invoked** by the user, not pattern-triggered. Today: `audit-pr-architecture`, `diagnose`, `docs-governance`, `docs-lifecycle-sweep`, `grill-me`, `improve-codebase-architecture`, `write-a-skill`. (Skills like `gritql-codemods` and `ubiquitous-language` would also fit this tier if adopted.)
 
 ## Authoring guidelines
 
diff --git a/.agents/skills/diagnose/SKILL.md b/.agents/skills/diagnose/SKILL.md
new file mode 100644
index 0000000..50278d5
--- /dev/null
+++ b/.agents/skills/diagnose/SKILL.md
@@ -0,0 +1,116 @@
+---
+name: diagnose
+description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression.
+---
+
+# Diagnose
+
+A discipline for hard bugs. Skip phases only when explicitly justified.
+
+When exploring the codebase, query [`codemap`](../codemap/SKILL.md) (the structural SQLite index) before reaching for `Grep` or `Read` per the [`codemap` rule](../../rules/codemap.md) — symbol-shaped questions ("where is X defined?", "what calls X?") have direct answers in the `symbols` / `calls` tables. Read the relevant section of [`docs/architecture.md`](../../../docs/architecture.md) to ground the mental model of layering, and check [`docs/glossary.md`](../../../docs/glossary.md) for canonical domain terms (file types, recipe ids, schema columns).
+
+## Phase 1 — Build a feedback loop
+
+**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you.
+
+Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.**
+
+### Ways to construct one — try them in roughly this order
+
+1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e. Codemap convention: `src/**/<name>.test.ts` for unit + integration; `fixtures/golden/` for query-shape regressions; `bun test <file>` runs them.
+2. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot. Examples: `bun src/index.ts query --json …` against `fixtures/minimal/`, golden runner under `scripts/query-golden.ts`.
+3. **Replay a captured trace.** Save a real `.codemap.db` / config / fixture file to disk; replay it through the code path in isolation.
+4. **Throwaway harness.** Spin up a minimal subset (one parser, one DB connection) that exercises the bug code path with a single function call.
+5. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode.
+6. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it.
+7. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs. The B.6 baseline machinery (`codemap query --save-baseline` / `--baseline`) is built for exactly this — use it.
+8. **HITL bash script.** Last resort. If a human must click or copy a value out of the IDE, drive _them_ with [`scripts/hitl-loop.template.sh`](scripts/hitl-loop.template.sh) so the loop is still structured. Captured output feeds back to you.
+
+Build the right feedback loop, and the bug is 90% fixed.
+
+### Iterate on the loop itself
+
+Treat the loop as a product. Once you have _a_ loop, ask:
+
+- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.)
+- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".)
+- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.)
+
+A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower.
+
+### Non-deterministic bugs
+
+The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable.
+
+### When you genuinely cannot build a loop
+
+Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps, broken `.codemap.db`), or (c) permission to add temporary instrumentation. Do **not** proceed to hypothesise without a loop.
+
+Do not proceed to Phase 2 until you have a loop you believe in.
+
+## Phase 2 — Reproduce
+
+Run the loop. Watch the bug appear.
+
+Confirm:
+
+- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix.
+- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against).
+- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it.
+
+Do not proceed until you reproduce the bug.
+
+## Phase 3 — Hypothesise
+
+Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
+
+Each hypothesis must be **falsifiable**: state the prediction it makes.
+
+> Format: "If `<X>` is the cause, then `<Y>` will make the bug disappear / `<Z>` will make it worse."
+
+If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it.
+
+**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just changed #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK.
+
+## Phase 4 — Instrument
+
+Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.**
+
+Tool preference:
+
+1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs.
+2. **Targeted logs** at the boundaries that distinguish hypotheses.
+3. Never "log everything and grep".
+
+**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die.
+
+**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan, `--performance` flag for index runs), then bisect. Measure first, fix second.
+
+## Phase 5 — Fix + regression test
+
+Write the regression test **before the fix** — but only if there is a **correct seam** for it (per the [`improve-codebase-architecture`](../improve-codebase-architecture/SKILL.md) vocabulary).
+
+A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence.
+
+**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase.
+
+If a correct seam exists:
+
+1. Turn the minimised repro into a failing test at that seam.
+2. Watch it fail.
+3. Apply the fix.
+4. Watch it pass.
+5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario.
+
+## Phase 6 — Cleanup + post-mortem
+
+Required before declaring done:
+
+- [ ] Original repro no longer reproduces (re-run the Phase 1 loop)
+- [ ] Regression test passes (or absence of seam is documented)
+- [ ] All `[DEBUG-…]` instrumentation removed (`grep` the prefix)
+- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location)
+- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns
+- [ ] If the post-mortem yields a permanent insight, append a one-line entry to [`.agents/lessons.md`](../../lessons.md) per the lessons-rule discipline
+
+**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to [`improve-codebase-architecture`](../improve-codebase-architecture/SKILL.md) with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started.
diff --git a/.agents/skills/diagnose/scripts/hitl-loop.template.sh b/.agents/skills/diagnose/scripts/hitl-loop.template.sh
new file mode 100755
index 0000000..b67c86b
--- /dev/null
+++ b/.agents/skills/diagnose/scripts/hitl-loop.template.sh
@@ -0,0 +1,41 @@
+#!/usr/bin/env bash
+# Human-in-the-loop reproduction loop.
+# Copy this file, edit the steps below, and run it.
+# The agent runs the script; the user follows prompts in their terminal.
+#
+# Usage:
+#   bash hitl-loop.template.sh
+#
+# Two helpers:
+#   step    "<instruction>"         → show instruction, wait for Enter
+#   capture VAR "<question>"        → show question, read response into VAR
+#
+# At the end, captured values are printed as KEY=VALUE for the agent to parse.
+
+set -euo pipefail
+
+step() {
+  printf '\n>>> %s\n' "$1"
+  read -r -p "    [Enter when done] " _
+}
+
+capture() {
+  local var="$1" question="$2" answer
+  printf '\n>>> %s\n' "$question"
+  read -r -p "    > " answer
+  printf -v "$var" '%s' "$answer"
+}
+
+# --- edit below ---------------------------------------------------------
+
+step "Open the app at http://localhost:3000 and sign in."
+
+capture ERRORED "Click the 'Export' button. Did it throw an error? (y/n)"
+
+capture ERROR_MSG "Paste the error message (or 'none'):"
+
+# --- edit above ---------------------------------------------------------
+
+printf '\n--- Captured ---\n'
+printf 'ERRORED=%s\n' "$ERRORED"
+printf 'ERROR_MSG=%s\n' "$ERROR_MSG"
diff --git a/.agents/skills/grill-me/SKILL.md b/.agents/skills/grill-me/SKILL.md
new file mode 100644
index 0000000..3345f3c
--- /dev/null
+++ b/.agents/skills/grill-me/SKILL.md
@@ -0,0 +1,12 @@
+---
+name: grill-me
+description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".
+---
+
+Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
+
+Ask the questions one at a time, waiting for feedback before continuing.
+
+If a question can be answered by exploring the codebase, explore the codebase instead. In this repo, that means querying [`codemap`](../codemap/SKILL.md) (the structural index) before reaching for `Grep` or `Read` — see the [`codemap` rule](../../rules/codemap.md).
+
+When agreement crystallises on a question that affects an in-flight `docs/plans/<name>.md`, write the answer into the plan inline as you go — don't batch them up. The plan doc is the durable record; the chat transcript is not.
diff --git a/.agents/skills/improve-codebase-architecture/DEEPENING.md b/.agents/skills/improve-codebase-architecture/DEEPENING.md
new file mode 100644
index 0000000..c52fdfd
--- /dev/null
+++ b/.agents/skills/improve-codebase-architecture/DEEPENING.md
@@ -0,0 +1,37 @@
+# Deepening
+
+How to deepen a cluster of shallow modules safely, given its dependencies. Assumes the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**.
+
+## Dependency categories
+
+When assessing a candidate for deepening, classify its dependencies. The category determines how the deepened module is tested across its seam.
+
+### 1. In-process
+
+Pure computation, in-memory state, no I/O. Always deepenable — merge the modules and test through the new interface directly. No adapter needed.
+
+### 2. Local-substitutable
+
+Dependencies that have local test stand-ins (PGLite for Postgres, in-memory filesystem). Deepenable if the stand-in exists. The deepened module is tested with the stand-in running in the test suite. The seam is internal; no port at the module's external interface.
+
+### 3. Remote but owned (Ports & Adapters)
+
+Your own services across a network boundary (microservices, internal APIs). Define a **port** (interface) at the seam. The deep module owns the logic; the transport is injected as an **adapter**. Tests use an in-memory adapter. Production uses an HTTP/gRPC/queue adapter.
+
+Recommendation shape: _"Define a port at the seam, implement an HTTP adapter for production and an in-memory adapter for testing, so the logic sits in one deep module even though it's deployed across a network."_
+
+### 4. True external (Mock)
+
+Third-party services (Stripe, Twilio, etc.) you don't control. The deepened module takes the external dependency as an injected port; tests provide a mock adapter.
+
+## Seam discipline
+
+- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a port unless at least two adapters are justified (typically production + test). A single-adapter seam is just indirection.
+- **Internal seams vs external seams.** A deep module can have internal seams (private to its implementation, used by its own tests) as well as the external seam at its interface. Don't expose internal seams through the interface just because tests use them.
+
+## Testing strategy: replace, don't layer
+
+- Old unit tests on shallow modules become waste once tests at the deepened module's interface exist — delete them.
+- Write new tests at the deepened module's interface. The **interface is the test surface**.
+- Tests assert on observable outcomes through the interface, not internal state.
+- Tests should survive internal refactors — they describe behaviour, not implementation. If a test has to change when the implementation changes, it's testing past the interface.
diff --git a/.agents/skills/improve-codebase-architecture/INTERFACE-DESIGN.md b/.agents/skills/improve-codebase-architecture/INTERFACE-DESIGN.md
new file mode 100644
index 0000000..7d69c40
--- /dev/null
+++ b/.agents/skills/improve-codebase-architecture/INTERFACE-DESIGN.md
@@ -0,0 +1,44 @@
+# Interface Design
+
+When the user wants to explore alternative interfaces for a chosen deepening candidate, use this parallel sub-agent pattern. Based on "Design It Twice" (Ousterhout) — your first idea is unlikely to be the best.
+
+Uses the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**, **leverage**.
+
+## Process
+
+### 1. Frame the problem space
+
+Before spawning sub-agents, write a user-facing explanation of the problem space for the chosen candidate:
+
+- The constraints any new interface would need to satisfy
+- The dependencies it would rely on, and which category they fall into (see [DEEPENING.md](DEEPENING.md))
+- A rough illustrative code sketch to ground the constraints — not a proposal, just a way to make the constraints concrete
+
+Show this to the user, then immediately proceed to Step 2. The user reads and thinks while the sub-agents work in parallel.
+
+### 2. Spawn sub-agents
+
+Spawn 3+ sub-agents in parallel using the Agent / Task tool. Each must produce a **radically different** interface for the deepened module.
+
+Prompt each sub-agent with a separate technical brief (file paths, coupling details, dependency category from [DEEPENING.md](DEEPENING.md), what sits behind the seam). The brief is independent of the user-facing problem-space explanation in Step 1. Give each agent a different design constraint:
+
+- Agent 1: "Minimize the interface — aim for 1–3 entry points max. Maximise leverage per entry point."
+- Agent 2: "Maximise flexibility — support many use cases and extension."
+- Agent 3: "Optimise for the most common caller — make the default case trivial."
+- Agent 4 (if applicable): "Design around ports & adapters for cross-seam dependencies."
+
+Include both [LANGUAGE.md](LANGUAGE.md) vocabulary and [`docs/glossary.md`](../../../docs/glossary.md) vocabulary in the brief so each sub-agent names things consistently with the architecture language and the project's domain language.
+
+Each sub-agent outputs:
+
+1. Interface (types, methods, params — plus invariants, ordering, error modes)
+2. Usage example showing how callers use it
+3. What the implementation hides behind the seam
+4. Dependency strategy and adapters (see [DEEPENING.md](DEEPENING.md))
+5. Trade-offs — where leverage is high, where it's thin
+
+### 3. Present and compare
+
+Present designs sequentially so the user can absorb each one, then compare them in prose. Contrast by **depth** (leverage at the interface), **locality** (where change concentrates), and **seam placement**.
+
+After comparing, give your own recommendation: which design you think is strongest and why. If elements from different designs would combine well, propose a hybrid. Be opinionated — the user wants a strong read, not a menu.
diff --git a/.agents/skills/improve-codebase-architecture/LANGUAGE.md b/.agents/skills/improve-codebase-architecture/LANGUAGE.md
new file mode 100644
index 0000000..dd9b60f
--- /dev/null
+++ b/.agents/skills/improve-codebase-architecture/LANGUAGE.md
@@ -0,0 +1,53 @@
+# Language
+
+Shared vocabulary for every suggestion this skill makes. Use these terms exactly — don't substitute "component," "service," "API," or "boundary." Consistent language is the whole point.
+
+## Terms
+
+**Module**
+Anything with an interface and an implementation. Deliberately scale-agnostic — applies equally to a function, class, package, or tier-spanning slice.
+_Avoid_: unit, component, service.
+
+**Interface**
+Everything a caller must know to use the module correctly. Includes the type signature, but also invariants, ordering constraints, error modes, required configuration, and performance characteristics.
+_Avoid_: API, signature (too narrow — those refer only to the type-level surface).
+
+**Implementation**
+What's inside a module — its body of code. Distinct from **Adapter**: a thing can be a small adapter with a large implementation (a Postgres repo) or a large adapter with a small implementation (an in-memory fake). Reach for "adapter" when the seam is the topic; "implementation" otherwise.
+
+**Depth**
+Leverage at the interface — the amount of behaviour a caller (or test) can exercise per unit of interface they have to learn. A module is **deep** when a large amount of behaviour sits behind a small interface. A module is **shallow** when the interface is nearly as complex as the implementation.
+
+**Seam** _(from Michael Feathers)_
+A place where you can alter behaviour without editing in that place. The _location_ at which a module's interface lives. Choosing where to put the seam is its own design decision, distinct from what goes behind it.
+_Avoid_: boundary (overloaded with DDD's bounded context).
+
+**Adapter**
+A concrete thing that satisfies an interface at a seam. Describes _role_ (what slot it fills), not substance (what's inside).
+
+**Leverage**
+What callers get from depth. More capability per unit of interface they have to learn. One implementation pays back across N call sites and M tests.
+
+**Locality**
+What maintainers get from depth. Change, bugs, knowledge, and verification concentrate at one place rather than spreading across callers. Fix once, fixed everywhere.
+
+## Principles
+
+- **Depth is a property of the interface, not the implementation.** A deep module can be internally composed of small, mockable, swappable parts — they just aren't part of the interface. A module can have **internal seams** (private to its implementation, used by its own tests) as well as the **external seam** at its interface.
+- **The deletion test.** Imagine deleting the module. If complexity vanishes, the module wasn't hiding anything (it was a pass-through). If complexity reappears across N callers, the module was earning its keep.
+- **The interface is the test surface.** Callers and tests cross the same seam. If you want to test _past_ the interface, the module is probably the wrong shape.
+- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a seam unless something actually varies across it.
+
+## Relationships
+
+- A **Module** has exactly one **Interface** (the surface it presents to callers and tests).
+- **Depth** is a property of a **Module**, measured against its **Interface**.
+- A **Seam** is where a **Module**'s **Interface** lives.
+- An **Adapter** sits at a **Seam** and satisfies the **Interface**.
+- **Depth** produces **Leverage** for callers and **Locality** for maintainers.
+
+## Rejected framings
+
+- **Depth as ratio of implementation-lines to interface-lines** (Ousterhout): rewards padding the implementation. We use depth-as-leverage instead.
+- **"Interface" as the TypeScript `interface` keyword or a class's public methods**: too narrow — interface here includes every fact a caller must know.
+- **"Boundary"**: overloaded with DDD's bounded context. Say **seam** or **interface**.
diff --git a/.agents/skills/improve-codebase-architecture/SKILL.md b/.agents/skills/improve-codebase-architecture/SKILL.md
new file mode 100644
index 0000000..91d53c2
--- /dev/null
+++ b/.agents/skills/improve-codebase-architecture/SKILL.md
@@ -0,0 +1,80 @@
+---
+name: improve-codebase-architecture
+description: Find deepening opportunities in the codebase, informed by the domain language in docs/glossary.md and the architecture in docs/architecture.md. Use when the user wants to improve architecture, find refactoring opportunities, consolidate tightly-coupled modules, or make a codebase more testable and AI-navigable.
+---
+
+# Improve Codebase Architecture
+
+Surface architectural friction and propose **deepening opportunities** — refactors that turn shallow modules into deep ones. The aim is testability and AI-navigability.
+
+## Glossary
+
+Use these terms exactly in every suggestion. Consistent language is the point — don't drift into "component," "service," "API," or "boundary." Full definitions in [LANGUAGE.md](LANGUAGE.md).
+
+- **Module** — anything with an interface and an implementation (function, class, package, slice).
+- **Interface** — everything a caller must know to use the module: types, invariants, error modes, ordering, config. Not just the type signature.
+- **Implementation** — the code inside.
+- **Depth** — leverage at the interface: a lot of behaviour behind a small interface. **Deep** = high leverage. **Shallow** = interface nearly as complex as the implementation.
+- **Seam** — where an interface lives; a place behaviour can be altered without editing in place. (Use this, not "boundary.")
+- **Adapter** — a concrete thing satisfying an interface at a seam.
+- **Leverage** — what callers get from depth.
+- **Locality** — what maintainers get from depth: change, bugs, knowledge concentrated in one place.
+
+Key principles (see [LANGUAGE.md](LANGUAGE.md) for the full list):
+
+- **Deletion test**: imagine deleting the module. If complexity vanishes, it was a pass-through. If complexity reappears across N callers, it was earning its keep.
+- **The interface is the test surface.**
+- **One adapter = hypothetical seam. Two adapters = real seam.**
+
+This skill is _informed_ by the project's domain model. The domain language in [`docs/glossary.md`](../../../docs/glossary.md) gives names to good seams; the layering described in [`docs/architecture.md`](../../../docs/architecture.md) records the structural decisions the skill should not re-litigate.
+
+## Process
+
+### 1. Explore
+
+Read [`docs/glossary.md`](../../../docs/glossary.md) (canonical domain terms) and the relevant section of [`docs/architecture.md`](../../../docs/architecture.md) (canonical layering / wiring) first.
+
+Then walk the codebase via [`codemap`](../codemap/SKILL.md) — the structural SQLite index. Per the [`codemap` rule](../../rules/codemap.md), querying the index beats grepping for symbol-shaped questions:
+
+```bash
+codemap query --json "SELECT name, signature, file_path FROM symbols WHERE file_path LIKE 'src/cli/%' AND kind = 'function'"
+codemap query --json "SELECT from_path, COUNT(*) AS deps FROM dependencies GROUP BY from_path ORDER BY deps DESC LIMIT 10"
+codemap query --json -r barrel-files
+```
+
+Don't follow rigid heuristics — explore organically and note where you experience friction:
+
+- Where does understanding one concept require bouncing between many small modules?
+- Where are modules **shallow** — interface nearly as complex as the implementation?
+- Where have pure functions been extracted just for testability, but the real bugs hide in how they're called (no **locality**)?
+- Where do tightly-coupled modules leak across their seams?
+- Which parts of the codebase are untested, or hard to test through their current interface?
+
+Apply the **deletion test** to anything you suspect is shallow: would deleting it concentrate complexity, or just move it? A "yes, concentrates" is the signal you want.
+
+### 2. Present candidates
+
+Present a numbered list of deepening opportunities. For each candidate:
+
+- **Files** — which files/modules are involved
+- **Problem** — why the current architecture is causing friction
+- **Solution** — plain English description of what would change
+- **Benefits** — explained in terms of locality and leverage, and also in how tests would improve
+
+**Use [`docs/glossary.md`](../../../docs/glossary.md) vocabulary for the domain, and [LANGUAGE.md](LANGUAGE.md) vocabulary for the architecture.** If the glossary defines `barrel file`, talk about "the barrel-file detection module" — not "the FooBarHandler," and not "the barrel service."
+
+**Architecture conflicts**: if a candidate contradicts [`docs/architecture.md` § Layering](../../../docs/architecture.md#layering), only surface it when the friction is real enough to warrant revisiting that layering. Mark it clearly (e.g. _"contradicts architecture.md § Layering — but worth reopening because…"_). Don't list every theoretical refactor the layering forbids.
+
+Do NOT propose interfaces yet. Ask the user: "Which of these would you like to explore?"
+
+### 3. Grilling loop
+
+Once the user picks a candidate, drop into a grilling conversation (per [`grill-me`](../grill-me/SKILL.md)). Walk the design tree with them — constraints, dependencies, the shape of the deepened module, what sits behind the seam, what tests survive.
+
+Side effects happen inline as decisions crystallize:
+
+- **Naming a deepened module after a concept not in `docs/glossary.md`?** Add the term to the glossary right there per [`docs/README.md` Rule 9](../../../docs/README.md). Disambiguations (TS shape vs SQL table, etc.) take priority.
+- **Sharpening a fuzzy term during the conversation?** Update `docs/glossary.md` right there.
+- **Surfacing a structural decision worth recording?** If the candidate becomes a planned refactor, draft `docs/plans/<topic>.md` per [`docs/README.md` Rule 3](../../../docs/README.md). Codemap doesn't ship ADRs — decisions of record lift into [`docs/architecture.md`](../../../docs/architecture.md) on ship per [`docs/README.md` Rule 2](../../../docs/README.md), and the plan file is deleted.
+- **Want to explore alternative interfaces for the deepened module?** See [INTERFACE-DESIGN.md](INTERFACE-DESIGN.md).
+- **Sub-rules for what counts as a "deepening" candidate**: see [DEEPENING.md](DEEPENING.md).
diff --git a/.agents/skills/write-a-skill/SKILL.md b/.agents/skills/write-a-skill/SKILL.md
new file mode 100644
index 0000000..a0f9611
--- /dev/null
+++ b/.agents/skills/write-a-skill/SKILL.md
@@ -0,0 +1,176 @@
+---
+name: write-a-skill
+description: Create new agent skills with proper structure, progressive disclosure, and bundled resources. Use when user wants to create, write, or build a new skill (or asks "how do I write a skill?", "draft a SKILL.md for X").
+---
+
+# Writing Skills
+
+Discipline for authoring `.agents/skills/<name>/SKILL.md` files in this repo.
+
+## Repo conventions you must respect
+
+Before drafting any skill in codemap, internalise these (they trump anything in this skill):
+
+- **File layout** — [`agents-first-convention`](../../rules/agents-first-convention.md): the source-of-truth file is `.agents/skills/<name>/SKILL.md`; the `.cursor/skills/<name>` entry is a **symlink** back. Never put original content under `.cursor/`.
+- **Tier choice** — [`agents-tier-system`](../../rules/agents-tier-system.md): every new skill is Tier 1 (always-on, paired with a rule), Tier 2 (auto-attached to a glob, paired with a rule), or Tier 3 (discoverable, no rule). **Skills with `NEVER` / `ALWAYS` clauses deserve a rule pairing.** Pure intent-trigger skills (no hard "must" clauses) stay Tier 3.
+- **Maintainer-only vs shipped** — `.agents/skills/` is the dev-side mirror; `templates/agents/skills/` is what `codemap agents init` ships to npm consumers. The bundled template surface today is **only** the `codemap` skill — every other skill in `.agents/skills/` is maintainer-only (precedent: PR #25). Don't add a skill to `templates/agents/` unless it's something every consumer of the published package would want.
+
+## Process
+
+### 1. Gather requirements
+
+Ask the user:
+
+- What task / domain does the skill cover?
+- What specific use cases should it handle?
+- Does it need executable scripts (under `scripts/`) or just instructions?
+- Any reference materials to include?
+- **Tier choice**: does the skill have always-on principles (any `NEVER` / `ALWAYS` clauses)? If yes, it deserves a Tier-1 or Tier-2 rule pairing per [`agents-tier-system`](../../rules/agents-tier-system.md).
+
+### 2. Draft the skill
+
+Create:
+
+- `SKILL.md` with concise instructions (under 100 lines if possible — see "When to split" below)
+- Companion files (`LANGUAGE.md`, `REFERENCE.md`, `EXAMPLES.md`, etc.) when content exceeds 100 lines or has distinct domains
+- `scripts/<name>.{sh,ts}` when a deterministic operation is invoked repeatedly (saves tokens vs generated code)
+
+Use [`grill-me`](../grill-me/SKILL.md) on yourself to surface decisions before you write — what's the trigger phrase shape? What's the boundary with adjacent skills? What's the durability test (does this skill still read correctly six months from now)?
+
+### 3. Wire the file layout
+
+```bash
+# Source of truth
+.agents/skills/<name>/SKILL.md
+
+# Cursor symlink (per agents-first-convention)
+ln -s ../../.agents/skills/<name> .cursor/skills/<name>
+```
+
+### 4. Update the tier list
+
+Add the skill to the relevant list in [`agents-tier-system.md`](../../rules/agents-tier-system.md) so the inventory stays accurate.
+
+### 5. Review
+
+Ask the user:
+
+- Does this cover your use cases?
+- Anything missing or unclear?
+- Should any section be more / less detailed?
+
+Run the [Review checklist](#review-checklist) before declaring done.
+
+## Skill structure
+
+```text
+.agents/skills/<name>/
+├── SKILL.md              # Main instructions (required)
+├── LANGUAGE.md           # Vocabulary the skill enforces (if any)
+├── REFERENCE.md          # Detailed docs (if SKILL.md exceeds ~100 lines)
+├── EXAMPLES.md           # Usage examples (if needed)
+└── scripts/              # Utility scripts (if needed)
+    └── helper.sh
+```
+
+## SKILL.md template
+
+```md
+---
+name: skill-name
+description: Brief description of capability. Use when [specific triggers — verbs and nouns the user is likely to say, plus contexts where the skill applies].
+---
+
+# Skill Name
+
+## Quick start
+
+[Minimal working example — what the user does on first invocation]
+
+## Workflows
+
+[Step-by-step processes with checklists for complex tasks]
+
+## Advanced features
+
+[Link to companion files: See [REFERENCE.md](REFERENCE.md) / [LANGUAGE.md](LANGUAGE.md)]
+```
+
+## Description requirements
+
+The description is **the only thing the agent sees** when deciding which skill to load. It's surfaced in the discoverable-skills list alongside every other installed skill. Get this right or your skill never fires.
+
+**Goal**: Give the agent just enough info to know:
+
+1. What capability this skill provides
+2. When / why to trigger it (specific keywords, contexts, file types)
+
+**Format**:
+
+- Max ~1024 chars
+- Write in third person
+- First sentence: what it does
+- Second sentence: "Use when [specific triggers]"
+- Include the verbs and nouns the user is likely to say (per [`agents-tier-system` § Tier 3 description](../../rules/agents-tier-system.md))
+
+**Good example**:
+
+```text
+Triage and fact-check PR review comments against the actual codebase, project rules, and skills. Use when the user asks to address PR comments, respond to reviewer feedback, check if a comment is correct, fact-check a reviewer's claim, decide which comments to push back on, or sort hallucinated suggestions from real ones. Triggers on phrases like "check PR comments", "are these comments right".
+```
+
+**Bad example**:
+
+```text
+Helps with PRs.
+```
+
+The bad example gives the agent no way to distinguish this from any other PR-adjacent skill.
+
+## When to add scripts
+
+Add utility scripts under `scripts/` when:
+
+- Operation is deterministic (validation, formatting, bisection harness)
+- Same code would be generated repeatedly across invocations
+- Errors need explicit handling that's tedious to re-derive
+
+Scripts save tokens and improve reliability vs generated code.
+
+## When to split files
+
+Split into companion files when:
+
+- `SKILL.md` exceeds ~100 lines
+- Content has distinct domains (vocabulary vs process vs templates)
+- Advanced features are rarely needed and would balloon the main file
+
+Cite codemap precedents:
+
+- [`improve-codebase-architecture`](../improve-codebase-architecture/SKILL.md) splits into `LANGUAGE.md` (vocab), `DEEPENING.md` (sub-rules), `INTERFACE-DESIGN.md` (parallel-sub-agent pattern).
+- [`pr-comment-fact-check`](../pr-comment-fact-check/SKILL.md) stays single-file because every section is in-flow process.
+
+## Durability discipline
+
+Per [`agents-tier-system` § Authoring discipline: durability](../../rules/agents-tier-system.md):
+
+- **Don't cite specific audit / plan / research filenames as canonical examples.** Plans are mortal under [`docs-lifecycle-sweep`](../docs-lifecycle-sweep/SKILL.md). Use shape placeholders (`<topic>.md`) instead.
+- **Don't cite specific commit hashes or PR numbers as the only path to context.** Summarise inline.
+- **Don't cite source-code line numbers.** Reference symbols by name.
+
+If the skill still reads correctly six months from now after every doc you didn't write got rewritten, it's durable.
+
+## Review checklist
+
+After drafting, verify:
+
+- [ ] Description includes triggers ("Use when…")
+- [ ] `SKILL.md` under 100 lines OR has split companion files
+- [ ] No time-sensitive info (no "as of 2026-04…")
+- [ ] Consistent terminology — drift kills clarity
+- [ ] Concrete examples included
+- [ ] Cross-references one level deep (don't chain `SKILL.md → REFERENCE.md → DEEP-DIVE.md → REFERENCE2.md`)
+- [ ] File layout follows [`agents-first-convention`](../../rules/agents-first-convention.md) (`.agents/` source + `.cursor/` symlink)
+- [ ] Tier choice documented per [`agents-tier-system`](../../rules/agents-tier-system.md); rule pairing added if the skill has `NEVER` / `ALWAYS` clauses
+- [ ] Skill listed in the appropriate tier section of `agents-tier-system.md`
+- [ ] Decision recorded in the PR description: maintainer-only (`.agents/` only) vs shipped (`templates/agents/` too)
diff --git a/.cursor/skills/diagnose b/.cursor/skills/diagnose
new file mode 120000
index 0000000..7d4b7c9
--- /dev/null
+++ b/.cursor/skills/diagnose
@@ -0,0 +1 @@
+../../.agents/skills/diagnose
\ No newline at end of file
diff --git a/.cursor/skills/grill-me b/.cursor/skills/grill-me
new file mode 120000
index 0000000..eea91a8
--- /dev/null
+++ b/.cursor/skills/grill-me
@@ -0,0 +1 @@
+../../.agents/skills/grill-me
\ No newline at end of file
diff --git a/.cursor/skills/improve-codebase-architecture b/.cursor/skills/improve-codebase-architecture
new file mode 120000
index 0000000..be3dac9
--- /dev/null
+++ b/.cursor/skills/improve-codebase-architecture
@@ -0,0 +1 @@
+../../.agents/skills/improve-codebase-architecture
\ No newline at end of file
diff --git a/.cursor/skills/write-a-skill b/.cursor/skills/write-a-skill
new file mode 120000
index 0000000..8e09e46
--- /dev/null
+++ b/.cursor/skills/write-a-skill
@@ -0,0 +1 @@
+../../.agents/skills/write-a-skill
\ No newline at end of file
diff --git a/docs/plans/codemap-audit.md b/docs/plans/codemap-audit.md
new file mode 100644
index 0000000..75bef4b
--- /dev/null
+++ b/docs/plans/codemap-audit.md
@@ -0,0 +1,342 @@
+# Plan — `codemap audit`
+
+> Two-snapshot structural-drift verdict for a PR / branch. **v1 ships `--baseline <name>`** (diff against a B.6 saved baseline); **v1.x adds `--base <ref>`** (worktree+reindex). Adopted from [`docs/research/fallow.md` § Tier B B.5](../research/fallow.md) — explicitly the "single highest-leverage candidate" of that scan.
+
+**Status:** Open — design pass; not yet implemented.
+**Cross-refs:** [`docs/research/fallow.md`](../research/fallow.md) (motivation) · [`docs/architecture.md` § CLI usage](../architecture.md#cli-usage) (where wiring lands) · [`.agents/lessons.md`](../../.agents/lessons.md) (changesets bump policy).
+
+---
+
+## 1. Goal
+
+One command returns the structural deltas between a saved snapshot (or a git ref) and the current `HEAD` index:
+
+```text
+codemap audit --baseline <name>     # diff vs a B.6-style saved baseline (v1)
+codemap audit --base <ref>          # diff vs a worktree+reindex of <ref> (v1.x)
+↓
+{
+  "base": { "source": "baseline" | "ref", "name": "...", "sha": "...", "indexed_at": <ms> },
+  "head": { "sha": "<sha>", "indexed_at": <ms> },
+  "deltas": {
+    "files":        { "added": [...], "removed": [...] },
+    "dependencies": { "added": [...], "removed": [...] },
+    "deprecated":   { "added": [...], "removed": [...] }
+  }
+}
+```
+
+**v1 ships raw deltas only** — no `verdict` field, exit 0 on success regardless of delta size. A native verdict (`pass | warn | fail` with `codemap.config.audit` thresholds) is a v1.x slice; until then, consumers compose `--json` + `jq` for CI exit codes (one-liner). Rationale in [§5 Verdict shape](#5-verdict-shape).
+
+**v1 auto-runs an incremental index before every audit** so `head` reflects the current source tree. `--no-index` opts out (audit a frozen DB). Rationale in [§7 CLI surface](#7-cli-surface).
+
+Wraps existing recipes; doesn't grow a new analysis layer. Stays consistent with codemap's structural-index thesis ([`docs/why-codemap.md` § What Codemap is not](../why-codemap.md#what-codemap-is-not)).
+
+## 2. Non-goals (v1)
+
+- **Dead-code / duplication / complexity verdicts.** Those are fallow's territory and a non-goal per [`docs/roadmap.md` § Non-goals (v1)](../roadmap.md#non-goals-v1).
+- **Code-quality scoring / grading.** No "code health 87/100" output.
+- **Auto-fix / SARIF output.** Separate concerns — SARIF is B.8, auto-fix is explicitly out (D.14 in the research note).
+- **Cross-repo audit** (audit `origin/main` of project A from a checkout of project B). Out of scope; reuse `--root` for the simpler "audit a different tree" case.
+- **Continuous mode.** One-shot CLI, same as `codemap query`.
+
+## 3. Snapshot strategy — two modes, ship Option B first
+
+The verdict is a diff between two indexed snapshots. There are two valid sources for the "before" snapshot, and they solve subtly different problems — **so codemap audit ships both modes** (mutex, pick one per invocation).
+
+| Mode                     | Best at                                                                                                                                                                                                                 | CLI                               |
+| ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
+| **B — baseline reuse**   | "What's drifted vs a snapshot I deliberately took **then**" — fast, no cold reindex, reproducible because the snapshot is frozen in `.codemap.db`                                                                       | `codemap audit --baseline <name>` |
+| **A — worktree+reindex** | "What's drifted vs an arbitrary ref I name **now**" — no pre-baseline needed, but spawns a worktree + full reindex per audit, and is sensitive to clone staleness (`origin/main` may be hours behind the actual remote) | `codemap audit --base <ref>`      |
+
+### Decision: ship **Option B first** (v1), Option A in v1.x
+
+Reasons:
+
+1. **Cheaper to ship.** Option B reuses the B.6 `query_baselines` table verbatim — no worktree code, no cold-reindex perf concern, no `git fetch` staleness handling.
+2. **Most acute pain is delta-against-saved-state.** Real workflow: `codemap query --save-baseline -r <recipe>` on `main` → branch → refactor → `codemap audit --baseline <recipe>`. This is what B.6 was built for; audit just collapses recipe-by-recipe baselines into one verdict.
+3. **`--base <ref>` is genuinely a different shape.** It needs a fetch-or-fail prelude, a worktree spawn, a temp `.codemap.db` build, and cleanup. Each adds CLI surface and bug surface; deferring lets us validate the verdict / threshold / delta shape under B before committing to the worktree path.
+4. **Cache benefit of Option A only matters at scale.** Codemap-sized projects index in sub-second; the cache benefit of `<sha> → /tmp/codemap-audit-<sha>/.codemap.db` only pays back on multi-thousand-file repos. Defer until a real consumer hits it.
+
+### Option C: dropped
+
+Earlier draft included a third "on-demand snapshot table" hybrid. Killed during planning: it's a mini-indexer that doesn't transfer to other use cases and adds the code-volume of Option A without its conceptual simplicity. Re-revisit only if both A and B prove insufficient.
+
+### v1 `--baseline` mechanics
+
+- The baseline must already exist in `query_baselines` (saved by `codemap query --save-baseline`). If not, exit 1 with `codemap: no baseline named "<name>". Use --baselines to list.` (same error shape as `codemap query --baseline`).
+- Audit doesn't introduce its own baseline-save side effect — the user explicitly opts in via `--save-baseline`. Single source of truth for "snapshot lives here" stays the B.6 surface.
+- The verdict's `base.source` is `"baseline"`; `base.name` is the baseline name; `base.sha` is the baseline's recorded `git_ref`; `base.indexed_at` is the baseline's `created_at`.
+
+### v1.x `--base <ref>` mechanics (when shipped later)
+
+- Spawn a worktree under `.codemap.audit-<sha>/` (gitignored by the existing `.codemap.*` glob).
+- `codemap --full --root .codemap.audit-<sha>` builds the temp DB.
+- Diff queries run cross-DB; results pasted into the same verdict shape with `base.source = "ref"`.
+- Cleanup removes the worktree (cache decision deferred — see open questions §9).
+- `--base` and `--baseline` are mutex (one snapshot source per invocation).
+
+## 4. Built-in deltas (v1)
+
+Each delta wraps an existing query / recipe. All structural — no new analysis layer. **v1 ships three deltas only**; the rest are deferred (each carries an explicit trigger so we don't re-litigate from scratch).
+
+| Delta key      | What it surfaces                                         | Baseline source contract                                                                              |
+| -------------- | -------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
+| `files`        | New / deleted indexed files                              | Baseline must come from `SELECT path FROM files` (or `--recipe files-hashes` — same `path` column).   |
+| `dependencies` | New / deleted edges in the file-to-file dependency graph | Baseline must come from `SELECT from_path, to_path FROM dependencies` (no `DISTINCT` — composite PK). |
+| `deprecated`   | New / removed `@deprecated` symbols                      | Baseline must come from `--recipe deprecated-symbols`.                                                |
+
+### Delta function shape
+
+Each delta defines its own **canonical projection** (a fixed `SELECT … ORDER BY …`) and runs that projection on both sides of the diff. The baseline's stored `sql` is informational — **not replayed**. This isolates the audit from underlying-table schema drift (e.g. SCHEMA_VERSION 4 → 5 added `symbols.visibility`; baselines saved before the bump must still diff cleanly).
+
+Per-delta canonical projection:
+
+| Delta          | Canonical SQL (run on both baseline-projection AND current DB)                                              |
+| -------------- | ----------------------------------------------------------------------------------------------------------- |
+| `files`        | `SELECT path FROM files ORDER BY path`                                                                      |
+| `dependencies` | `SELECT from_path, to_path FROM dependencies ORDER BY from_path, to_path`                                   |
+| `deprecated`   | `SELECT name, kind, file_path FROM symbols WHERE doc_comment LIKE '%@deprecated%' ORDER BY file_path, name` |
+
+Each delta function:
+
+1. Loads the named baseline via `getQueryBaseline(db, name)` (B.6 helper from `db.ts`).
+2. Parses `rows_json` to row objects.
+3. **Validates baseline column-set membership.** The delta's canonical projection has a fixed required-columns list (e.g. `dependencies` requires `from_path`, `to_path`). If any required column is missing from the baseline rows, surface a clean error:
+
+   ```
+   codemap audit: baseline "<name>" is missing required columns
+   for delta "<delta-key>": got [<actual>], need [<required>].
+   Re-save with: codemap query --save-baseline=<name> -r <recipe>
+   ```
+
+4. **Projects baseline rows** to the canonical column subset (extra columns are dropped — agents can still inspect the full baseline via `codemap query --baselines`).
+5. Runs the canonical SQL against the current DB.
+6. Set-diffs via the existing `diffRows` helper from `cmd-query.ts` (multiset, identity = canonical `JSON.stringify(row)` over the projected columns).
+7. Returns `{added: [...], removed: [...]}` — projected rows only.
+
+This means a baseline saved from `--recipe deprecated-symbols` (which returns 6 columns) and a baseline saved from a leaner ad-hoc `SELECT name, kind, file_path FROM symbols WHERE doc_comment LIKE '%@deprecated%'` both work — as long as the required column set is satisfied. Schema bumps that add columns also keep working — the projection drops the new columns. Schema bumps that remove a required column would break the delta — that's the intended behaviour (the delta's contract has changed).
+
+### Deferred — add later when needed
+
+| Delta                | Why deferred (v1)                                                                                                                                                         | Trigger to revisit                                                                            |
+| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
+| `visibility`         | Already covered by `codemap query --baseline visibility-tags` from B.6 directly; v1 audit doesn't add much on top.                                                        | A consumer wants visibility deltas in the same JSON envelope as `files` / `dependencies`.     |
+| `barrels`            | "Top-N membership change" has fuzzy threshold semantics ("rank movement" vs "joined / left top-20"). Defer until a clear semantic emerges from real use.                  | Two consumers ask for "this file just became a barrel" as a verdict-shaping signal.           |
+| `hot_files`          | Same fuzzy-threshold problem as `barrels` (fan-in / fan-out top-N movement).                                                                                              | Same.                                                                                         |
+| `cycles`             | Needs cycle detection on `dependencies`; not a recipe today.                                                                                                              | Cycle detection lands as a recipe (or PRAGMA-driven SQL); audit consumes it.                  |
+| `boundary_crossings` | Needs a project-supplied glob list (the [`audit-pr-architecture`](../../.agents/skills/audit-pr-architecture/SKILL.md) skill's § 2 territory); no canonical source today. | The `audit-pr-architecture` skill formalises a per-repo "boundaries" config codemap can read. |
+| `markers`            | TODO / FIXME drift is noisy and project-specific.                                                                                                                         | A consumer asks for it explicitly.                                                            |
+| `css_*` deltas       | Narrow audience.                                                                                                                                                          | Same.                                                                                         |
+
+**Adding a delta later is mechanical** (one delta function + one threshold-config field + one test + one doc note). **Removing one is harder** (consumer config has thresholds for it; removing breaks user setups). Defer-by-default.
+
+## 5. Verdict shape
+
+**v1 ships no `verdict` field.** Exit 0 on success regardless of delta size. The output envelope is `{base, head, deltas}` — adding `verdict` later is purely additive and forward-compatible.
+
+### Why no verdict in v1
+
+1. **Honesty about what we know.** Structural deltas don't have a universally-meaningful threshold ("how many new dependency edges is too many?" depends entirely on the project). Inventing defaults or shipping a placeholder both pretend we do.
+2. **Real consumers shape the config, not me guessing.** When two consumers ship `jq`-based CI scripts with similar threshold shapes, that pattern becomes the v1.x schema. Until then, no schema commitment.
+3. **fallow already covers the code-quality verdict use case.** A consumer who wants `pass/warn/fail` on dead code, dupes, or complexity runs `fallow audit --base origin/main` — that's fallow's product class ([`docs/roadmap.md` § Non-goals](../roadmap.md#non-goals-v1)). Codemap audit's job is the **structural-delta** signal fallow can't see (new dependency edges, new files, new `@deprecated` drift).
+4. **Cheap consumer-side bridge.** `codemap audit --baseline X --json | jq -e '.deltas.dependencies.added | length <= 50'` exits 1 when the threshold trips. CI-driven thresholds work today without codemap shipping the verdict.
+
+### v1.x trigger to revisit
+
+Add the native verdict + threshold config when **either** of:
+
+- Two consumers independently ship `jq`-based threshold scripts with similar shapes (the pattern crystallises the config schema).
+- One consumer asks for native thresholds with a concrete config sketch.
+
+### Sketch (informational, not v1 commitment)
+
+When the trigger fires, the shape will likely look like:
+
+```ts
+// codemap.config.ts (v1.x — NOT shipped in v1)
+export default defineConfig({
+  audit: {
+    deltas: {
+      dependencies: { added_max: 50, action: "warn" },
+      deprecated: { added_max: 0, action: "fail" }, // any new @deprecated fails
+    },
+    // verdict reduction: highest action wins (fail > warn > pass)
+  },
+});
+```
+
+Validated via existing `codemapUserConfigSchema` (Zod) — see [`docs/architecture.md` § User config](../architecture.md#user-config). Schema additions are minor changesets per [`.agents/lessons.md` "changesets bump policy"](../../.agents/lessons.md) (no `.codemap.db` impact). Exit codes 0/1/2 ship together with `verdict` — never half-shipped.
+
+## 6. Composition with existing flags
+
+| Flag                | Behaviour with `audit`                                                                                                                                                                 |
+| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `--json`            | Emits the `{base, head, deltas}` envelope. See [§7.1 Output shapes](#71-output-shapes) for the terminal-mode (no `--json`) layout.                                                     |
+| `--summary`         | Collapses every delta in the output to counts: with `--json` → `deltas.<key>.{added: N, removed: N}`; without → a single line. See [§7.1](#71-output-shapes).                          |
+| `--baseline <name>` | **Snapshot source** — diff against the named B.6 baseline. v1 default mode.                                                                                                            |
+| `--base <ref>`      | **Snapshot source** — diff against a worktree+reindex of `<ref>`. v1.x. **Mutex with `--baseline`** (one snapshot source per invocation).                                              |
+| `--save-baseline`   | **N/A** — audit doesn't save baselines. Use `codemap query --save-baseline -r <recipe>` first, then `codemap audit --baseline <name>`. Single source of truth for snapshots stays B.6. |
+| `--changed-since`   | **Mutex** — `audit` is itself a "changed-since" operation; combining would be confusing.                                                                                               |
+| `--group-by`        | **Mutex** — output shape is already structured; bucketing is the consumer's job on the output JSON.                                                                                    |
+| `--no-index`        | **Skip the auto-incremental-index prelude.** Default is to re-index first so `head` is fresh; `--no-index` audits the DB as-is.                                                        |
+| `--recipe`          | N/A — `audit` isn't a `query` subcommand. The v1 deltas internally pin canonical SQL (per §4) — not user-selectable.                                                                   |
+
+## 7. CLI surface
+
+```text
+# v1 (ships first):
+codemap audit --baseline <name> [--json] [--summary] [--no-index] [--root <dir>] [--config <file>]
+
+# v1.x (ships after v1 validates the delta shape):
+codemap audit --base <ref>      [--json] [--summary] [--no-index] [--root <dir>] [--config <file>]
+```
+
+- `--baseline <name>` — v1. Required (or `--base <ref>` once shipped). Name must exist in `query_baselines`; saved by `codemap query --save-baseline`.
+- `--base <ref>` — v1.x. Any committish (`origin/main`, `HEAD~5`, sha, tag).
+- **`--baseline` and `--base` are mutex** — exactly one snapshot source per invocation.
+- `--no-index` — skip the auto-incremental-index prelude (see below). Default audits a fresh `head` snapshot.
+- `--root` / `--config` / `--help` / `-h` — same shape as the rest of the CLI (handled by `bootstrap`).
+- **Exit codes (v1):** `0` on success, `1` on bootstrap / DB / baseline-not-found errors. No verdict-driven exit codes until v1.x ships `verdict`.
+
+### Auto-incremental-index prelude
+
+Before computing deltas, `runAuditCmd` calls `runCodemapIndex({ mode: "incremental" })` (the same code path as a bare `codemap` invocation). Reasons:
+
+1. **Same discipline as the codemap rule.** Agents are already told "After completing a step that modified source files, re-index before making any further queries." The audit is a query consumer; auto-indexing treats it the same way.
+2. **Cheap when there's nothing to do.** Incremental indexing is sub-second when no source has changed since last index — git-diff narrows the set to zero.
+3. **Avoids silent staleness.** Without the prelude, an agent that runs `audit` after editing source but before re-indexing would get a `head` snapshot that's older than the changes it just made. The deltas would lie.
+4. **`--no-index` escape hatch** for the rare case of "audit a frozen DB without touching files" (e.g. CI fetches a pre-built `.codemap.db` artifact and just wants the diff).
+
+The prelude reuses `runCodemapIndex` from `application/run-index.ts` — no new code for the indexing step itself, just a single-call wrapper in `cmd-audit.ts`.
+
+### 7.1 Output shapes
+
+Mirrors `git status` — terse on the common (no-drift) case, expressive when there's actual signal. Three output modes from the same data:
+
+**Terminal mode (no `--json`), no drift:**
+
+```text
+audit "pre-refactor" (saved 2 days ago @ abc1234, 152 rows)
+  → no drift across files / dependencies / deprecated.
+```
+
+**Terminal mode (no `--json`), with drift:**
+
+```text
+audit "pre-refactor" (saved 2 days ago @ abc1234, 152 rows)
+  → drift: files +1/-0, dependencies +3/-2, deprecated +1/-0
+
+  files (+1):
+    ┌─────────┬──────────────────────────┐
+    │ (index) │ path                     │
+    ├─────────┼──────────────────────────┤
+    │ 0       │ src/cli/cmd-audit.ts     │
+    └─────────┴──────────────────────────┘
+
+  dependencies (+3 / -2):
+    [console.table here]
+
+  deprecated (+1):
+    [console.table here]
+```
+
+`console.table` blocks are emitted **only for deltas with rows** — empty deltas don't print a `(no results)` placeholder (would be three of them in the no-drift case, all noise).
+
+**`--summary` (no `--json`):**
+
+```text
+audit "pre-refactor" (saved 2 days ago @ abc1234, 152 rows)
+  → drift: files +1/-0, dependencies +3/-2, deprecated +1/-0
+```
+
+Same one-line summary as terminal mode's drift header — no per-delta tables.
+
+**`--summary --json`:**
+
+```json
+{
+  "base": {
+    "source": "baseline",
+    "name": "pre-refactor",
+    "sha": "abc1234",
+    "indexed_at": 1714557600000
+  },
+  "head": { "sha": "def5678", "indexed_at": 1714560000000 },
+  "deltas": {
+    "files": { "added": 1, "removed": 0 },
+    "dependencies": { "added": 3, "removed": 2 },
+    "deprecated": { "added": 1, "removed": 0 }
+  }
+}
+```
+
+Counts replace the row arrays; envelope is otherwise identical to the full `--json` shape.
+
+## 8. Tracer-bullet sequence
+
+Per [`.agents/rules/tracer-bullets`](../../.agents/rules/tracer-bullets.md), commit each slice end-to-end. **v1 ships only `--baseline <name>` (Option B).** `--base <ref>` (Option A) ships in a separate v1.x PR.
+
+### File layout
+
+The audit splits along codemap's existing `cli/` ↔ `application/` seam — same shape as `cmd-index.ts` ↔ `application/index-engine.ts`:
+
+| File                                   | Responsibility                                                                                                                                                                                                         |
+| -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `src/cli/cmd-audit.ts`                 | argv parse (`--baseline`, `--json`, `--summary`), delegation to `runAudit`, terminal-mode renderer (per §7.1).                                                                                                         |
+| `src/application/audit-engine.ts`      | Delta registry (key → canonical SQL + required columns), baseline column-set validation, per-delta diff functions, the `{base, head, deltas}` envelope assembly. Exported entry point: `runAudit({db, baselineName})`. |
+| `src/cli/cmd-audit.test.ts`            | argv → option-bag tests (parser shape, mutex errors, etc.).                                                                                                                                                            |
+| `src/application/audit-engine.test.ts` | Engine tests — exercise `runAudit` against in-memory DB + canned baselines; assert envelope shape and the column-set-validation error path.                                                                            |
+
+The split:
+
+- **Mirrors existing layering** (`cli/cmd-index.ts` ↔ `application/index-engine.ts`) — architectural consistency.
+- **Makes the engine testable independent of CLI shape** — `audit-engine.test.ts` doesn't care about argv.
+- **Makes the v1.x `--base <ref>` slice mechanical** — worktree+reindex code lives in `cmd-audit.ts` (CLI orchestration); the engine just gets a different `db` handle pointing at the temp DB.
+- **Forward-compatible with a programmatic `Codemap.audit()` method** if `api.ts` ever exposes it.
+
+### v1 tracer-bullet sequence — `--baseline <name>`
+
+1. **CLI scaffold** — `cmd-audit.ts` + `audit-engine.ts` skeletons. `codemap audit --help` works; `--baseline <name>` and `--no-index` parsed; auto-incremental-index prelude wired (calls `runCodemapIndex({ mode: "incremental" })` unless `--no-index`); `runAudit` returns `{base: {source: "baseline", ...}, head: {...}, deltas: {}}` stub. Smoke + commit.
+2. **Delta registry + first delta — `files`** — engine grows the canonical-projection registry (`{key, sql, requiredColumns}`); `files` delta implements load-baseline → validate-columns → project → diff via `diffRows`. CLI renders one terminal-mode block. Commit.
+3. **Remaining deltas** — `dependencies`, `deprecated` — each as a separate commit. Each adds one registry entry + one delta function + tests. Renderer extends naturally.
+4. **Terminal-mode polish** — implement the no-drift / drift / `--summary` output shapes from §7.1; `cmd-audit.test.ts` covers all three.
+5. **Docs + agents update** — `architecture.md § Audit wiring`, glossary entry, README CLI block, rule + skill across `.agents/` and `templates/agents/` (Rule 10). Commit.
+6. **Changeset** — patch (no schema bump; reuses existing `query_baselines` table). Commit.
+
+Estimated total: ~1 day end-to-end across ~6 commits. The threshold-config / verdict step is **explicitly out** of v1 (see §5).
+
+### v1.x — `--base <ref>` (separate PR)
+
+1. Worktree spawn + temp-DB build (`codemap --full --root .codemap.audit-<sha>`).
+2. Cross-DB delta queries (same delta definitions as v1, swap snapshot source).
+3. Cleanup + cache decision (see open question §9).
+4. Docs + Rule 10 update.
+5. Changeset.
+
+Defers until: (a) v1 validates the delta shape under real use, AND (b) at least one consumer asks for "audit against an arbitrary ref I haven't pre-baselined."
+
+### v1.x — `verdict` + threshold config (separate PR, separate trigger)
+
+Independent slice from `--base <ref>`. Triggers and shape sketched in [§5 Verdict shape](#5-verdict-shape).
+
+## 9. Open questions (v1.x)
+
+These all defer to v1.x or later — none block the v1 ship.
+
+- **Worktree location for `--base <ref>`** — `.codemap.audit-<sha>/` (project-local; gitignored by the existing `.codemap.*` glob) vs `/tmp/codemap-audit-<sha>` (system-temp; auto-cleaned but loses cache across reboots). **Lean: project-local, named to match the gitignore.** Settled when v1.x ships.
+- **`actions` per delta key** — recipe `actions` (Tier A.1) attach to row sets; an audit delta is a higher-level concept. v1 doesn't include `actions` at all (no verdict either — see §5). v1.x can add `audit.actions: { dependencies: "review-coupling-spike" }` if patterns emerge.
+- **Cross-snapshot performance ceiling for `--base <ref>`** — at what project size does the worktree+full-reindex path become unacceptable (>30s)? Needs a benchmark fixture; defer until a real consumer hits the wall.
+
+### Settled during the design pass
+
+- **Should `audit` warn when `<base>` and `HEAD` are identical?** **No.** The renderer's metadata header (`baseline "X" (saved 2 days ago @ abc1234, 152 rows)`) already exposes the baseline's `git_ref`; the user can spot a same-SHA mistake from the existing output. Adding a warning would be noise in the common case (zero deltas after a small change is exactly what you want) and heuristic-driven in the edge cases ("divergent baseline" requires merge-base inspection — meaningful code for a low-signal warning). Reconsider only if a real consumer reports losing time to it.
+
+## 10. References
+
+- Motivation: [`docs/research/fallow.md` § Tier B B.5](../research/fallow.md) ("single highest-leverage candidate").
+- Snapshot primitive prior art: PR #30 — `query_baselines` table + `--save-baseline` / `--baseline`.
+- Composition: PR #26 — Tier A flags (`--summary` / `--changed-since` / `--group-by` / per-row `actions`).
+- Visibility column prior art: PR #28 — `symbols.visibility` (B.7).
+- CLI conventions: [`docs/architecture.md` § CLI usage](../architecture.md#cli-usage).
+- Doc lifecycle: this file follows the **Plan** type per [`docs/README.md` § Document Lifecycle](../README.md#document-lifecycle) — **delete on ship**, lift the canonical bits into `architecture.md` per Rule 2.
diff --git a/docs/roadmap.md b/docs/roadmap.md
index a8819df..91c3c61 100644
--- a/docs/roadmap.md
+++ b/docs/roadmap.md
@@ -36,6 +36,7 @@ Codemap stays a structural-index primitive that other tools can consume. Out of
 
 ## Backlog
 
+- [ ] **`codemap audit --base <ref>`** — two-snapshot structural-drift verdict for a PR / branch (new files / deps / `@deprecated` / visibility / barrel / hot-file deltas; `pass`/`warn`/`fail` exit codes). Plan: [`plans/codemap-audit.md`](./plans/codemap-audit.md). Builds on B.6 (snapshot primitive), B.7 (`visibility`), Tier A flags (composition).
 - [ ] **MCP** server wrapping `query` — single stdio tool first (`query` SQL string → JSON rows), then expand to `recipe`, `list_recipes`, `schema`, `index`. Resources expose the bundled `SKILL.md` and recipe catalog
 - [ ] **HTTP API** — `codemap serve [--port] [--host 127.0.0.1]` exposing `POST /query`, `GET /recipes`, `GET /recipes/:id`, `GET /schema`, `GET /context`. Bind to loopback by default; reject non-loopback unless `--host` overridden. Unblocks tools that don't speak MCP yet
 - [ ] **Recipes-as-content registry** — pair every bundled recipe in `src/cli/query-recipes.ts` with a sibling `.md` (or YAML frontmatter) describing _when to use, follow-up SQL_; surface in `--recipes-json`. Plus **project-local recipes** loaded from `.codemap/recipes/*.{sql,md}` so teams can ship internal SQL without an adapter API