diff --git a/CLAUDE.md b/CLAUDE.md
index b7720dc..3f7aeb3 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -18,7 +18,7 @@ Current version: **2.2.0** (see [CHANGELOG.md](./CHANGELOG.md)).
 │   ├── .claude-plugin/
 │   │   └── plugin.json
 │   ├── agents/         # 21 agent definitions (.md with frontmatter)
-│   ├── skills/         # 15 skill directories, each with SKILL.md + references/
+│   ├── skills/         # 17 skill directories, each with SKILL.md + references/
 │   └── references/     # Cross-skill reference files (e.g. yagni-rule.md)
 ├── docs/               # Operator-facing documentation
 │   ├── writing-voice.md   # Voice profile every doc follows
@@ -27,7 +27,7 @@ Current version: **2.2.0** (see [CHANGELOG.md](./CHANGELOG.md)).
 │   ├── sizing.md
 │   ├── yagni.md
 │   ├── agents/         # Long-form docs for all 21 agents, plus README
-│   ├── skills/         # Long-form docs for all 15 skills, plus README
+│   ├── skills/         # Long-form docs for all 17 skills, plus README
 │   ├── guidance/       # Contributor-facing authoring guidance
 │   └── templates/      # Templates and coverage rule for long-form docs
 └── images/             # Banner and graphics for README
@@ -56,11 +56,13 @@ The plugin is shipped from `plugin/`; documentation lives in `docs/`. Long-form
 
 ### Skill catalog (`docs/skills/`)
 
-- **[docs/skills/README.md](./docs/skills/README.md).** Index of all 15 skills grouped by purpose (planning, investigation, review, discovery, conventions, reporting). Start here when looking for the right slash command.
+- **[docs/skills/README.md](./docs/skills/README.md).** Index of all 17 skills grouped by purpose (planning, building, investigation, review, discovery, conventions, reporting). Start here when looking for the right slash command.
 - **[docs/skills/plan-a-feature.md](./docs/skills/plan-a-feature.md).** Spec a feature from scratch through an evidence-based interview that walks the design tree and dispatches specialist reviewers.
 - **[docs/skills/plan-implementation.md](./docs/skills/plan-implementation.md).** Turn a feature specification into an implementation plan through a project-manager-led team conversation.
 - **[docs/skills/plan-a-phased-build.md](./docs/skills/plan-a-phased-build.md).** Split a body of context (gap analysis, PRD, design doc) into a numbered sequence of vertical-slice phases, each independently demoable.
 - **[docs/skills/iterative-plan-review.md](./docs/skills/iterative-plan-review.md).** Stress-test an existing plan through multiple codebase-grounded review passes. Edits the plan in place and records every finding.
+- **[docs/skills/tdd.md](./docs/skills/tdd.md).** Drive a feature or behavior through a BDD-framed red-green-refactor loop with an enforced observed-failure gate. The plugin's only execution skill: it writes code, applying coding standards and ADRs in green and refactor.
+- **[docs/skills/issue-triage.md](./docs/skills/issue-triage.md).** Classify a vague issue or bug report, identify missing information, assess severity and reproducibility, and recommend the right next skill.
 - **[docs/skills/investigate.md](./docs/skills/investigate.md).** Evidence-based investigation of bugs, failures, and unexpected behavior, with adversarial validation of the proposed fix.
 - **[docs/skills/code-review.md](./docs/skills/code-review.md).** Comprehensive code review of the current branch or specified files. Dispatches a domain-aware roster that scales with sizing.
 - **[docs/skills/gh-pr-review.md](./docs/skills/gh-pr-review.md).** Run `/code-review` against a GitHub PR and post the review as comments after a clarity check.
@@ -123,4 +125,4 @@ Subdirectories:
 - **Every long-form doc links up.** The first bullet of the "Related Documentation" section always points back to the README at the repo root.
 - **Voice is uniform.** Every doc follows [docs/writing-voice.md](./docs/writing-voice.md). No em-dashes, direct second person, no flattery or hype.
 - **YAGNI applies to docs too.** Don't add speculative sections, for-future-flexibility warnings, or examples for behavior the skill doesn't have. The same evidence rule that gates plan steps gates docs.
-- **Counts to verify when editing indexes.** 21 agents in `plugin/agents/`; 15 skills in `plugin/skills/`; 21 long-form agent docs in `docs/agents/`; 15 long-form skill docs in `docs/skills/`.
+- **Counts to verify when editing indexes.** 21 agents in `plugin/agents/`; 17 skills in `plugin/skills/`; 21 long-form agent docs in `docs/agents/`; 17 long-form skill docs in `docs/skills/`.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index f4661d8..9e5333b 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -29,7 +29,8 @@ Read these once:
    - Body: numbered steps, `${CLAUDE_SKILL_DIR}` paths for script references, extracted references under `references/`.
 3. Copy [the skill template](./docs/templates/skill-long-form-template.md) into `docs/skills/{name}.md` and fill it in. Every skill gets a long-form doc.
 4. Add the skill to the [Skills Index](./docs/skills/README.md) with a one-sentence scent line and a link.
-5. Update the marketplace registry at [`.claude-plugin/marketplace.json`](./.claude-plugin/marketplace.json) if needed.
+5. Update the skill counts and catalog so they stay accurate: the skill catalog and "Counts to verify when editing indexes" line in [Root CLAUDE.md](./CLAUDE.md), the count in [Concepts](./docs/concepts.md) ("What does the plugin include?"), and the counts in the [README](./README.md). If the skill belongs to a new category, add it to the category lists too.
+6. Update the marketplace registry at [`.claude-plugin/marketplace.json`](./.claude-plugin/marketplace.json) if needed.
 
 ## Adding an agent
 
diff --git a/README.md b/README.md
index 46a2416..87eb912 100644
--- a/README.md
+++ b/README.md
@@ -2,28 +2,28 @@
 
 <img src="images/han-banner.png">
 
-Han is a suite of AI skills and agents for solo (or small-team) product engineers. It combines evidence-based planning, full documentation maintenance, deep code review, and architectural analysis into a team of specialists you can dispatch from Claude Code.
+Han is a suite of AI skills and agents for solo (or small-team) product engineers. It combines evidence-based planning, test-driven implementation, full documentation maintenance, deep code review, and architectural analysis into a team of specialists you can dispatch from Claude Code.
 
 ## What this plugin does
 
-Han turns planning, review, and documentation work that would normally take a team into a set of deterministic skills you run from Claude Code. Each skill dispatches specialist agents (project managers, adversarial reviewers, investigators, architectural analysts, testing and security specialists) to do the judgment-heavy work, then folds their findings into an artifact you can trust.
+Han turns planning, implementation, review, and documentation work that would normally take a team into a set of deterministic skills you run from Claude Code. Each skill dispatches specialist agents (project managers, adversarial reviewers, investigators, architectural analysts, testing and security specialists) to do the judgment-heavy work, then folds their findings into an artifact you can trust.
 
-The skills are designed to compose. You can plan a feature, then plan its implementation, then iterate on the plan, then review the resulting code, then write the PR description. All through named skills that hand off to each other cleanly.
+The skills are designed to compose. You can plan a feature, then plan its implementation, then iterate on the plan, then build it test-first, then review the resulting code, then write the PR description. All through named skills that hand off to each other cleanly.
 
 Read [Concepts](./docs/concepts.md) for the skill-and-agent model that runs through the whole plugin.
 
 ## Which path are you on?
 
 - **New to han?** → Start with [Concepts](./docs/concepts.md), then the [Quickstart](./docs/quickstart.md).
-- **Looking for a specific skill?** → [Skills Index](./docs/skills/README.md). 16 skills grouped by purpose.
+- **Looking for a specific skill?** → [Skills Index](./docs/skills/README.md). 17 skills grouped by purpose.
 - **Looking for a specific agent?** → [Agents Index](./docs/agents/README.md). 21 agents grouped by role.
-- **Wondering how the agent swarms scale?** → [Sizing](./docs/sizing.md). The small / medium / large dispatch model used by `/code-review`, `/gap-analysis`, `/iterative-plan-review`, `/plan-a-feature`, and `/plan-implementation`.
+- **Wondering how the agent swarms scale?** → [Sizing](./docs/sizing.md). The small / medium / large dispatch model used by `/architectural-analysis`, `/code-review`, `/gap-analysis`, `/iterative-plan-review`, `/plan-a-feature`, and `/plan-implementation`.
 - **Wondering why a skill said "YAGNI"?** → [YAGNI](./docs/yagni.md). The evidence-based rule every planning, review, and architecture skill applies before committing items to an artifact.
 - **Writing or editing a skill or agent?** → [Contributing](./CONTRIBUTING.md).
 
 ## Skills
 
-Sixteen skills, grouped by the moment you reach for them. Each category links to the full long-form docs through the [Skills Index](./docs/skills/README.md).
+Seventeen skills, grouped by the moment you reach for them. Each category links to the full long-form docs through the [Skills Index](./docs/skills/README.md).
 
 ### Planning
 
@@ -34,11 +34,17 @@ Spec what to build, plan how to build it, sequence it into phases, and stress-te
 - **[`/plan-a-phased-build`](./docs/skills/plan-a-phased-build.md).** Split a body of work into a numbered sequence of vertical-slice phases, each independently demoable.
 - **[`/iterative-plan-review`](./docs/skills/iterative-plan-review.md).** Stress-test an existing plan through multiple codebase-grounded review passes.
 
+### Building
+
+Write the code, test-first, through a disciplined red-green-refactor loop.
+
+- **[`/tdd`](./docs/skills/tdd.md).** Drive a feature or behavior through a BDD-framed red-green-refactor loop with an enforced observed-failure gate, applying coding standards and ADRs in green and refactor.
+
 ### Investigation & root cause
 
 Find out *why* something is broken, with evidence to back it.
 
-- **[`/issue-triage`](docs/skills/issue-triage.md).** Classify a vague issue or bug report, identify missing information, assess severity and reproducibility, and recommend the right next skill.
+- **[`/issue-triage`](./docs/skills/issue-triage.md).** Classify a vague issue or bug report, identify missing information, assess severity and reproducibility, and recommend the right next skill.
 - **[`/investigate`](./docs/skills/investigate.md).** Evidence-based investigation of bugs, failures, and unexpected behavior, with adversarial validation of the proposed fix.
 
 ### Review & analysis
@@ -86,7 +92,7 @@ Add the Test Double skills marketplace to Claude Code, then install the plugin:
 
 - [Concepts](./docs/concepts.md). Skill vs. agent, and how they compose. Read once before using the plugin.
 - [Quickstart](./docs/quickstart.md). Four paths for four common situations. Each path is a short sequence of skills.
-- [Skills Index](./docs/skills/README.md). All 16 skills, grouped by purpose.
+- [Skills Index](./docs/skills/README.md). All 17 skills, grouped by purpose.
 - [Agents Index](./docs/agents/README.md). All 21 agents, grouped by role.
 - [Sizing](./docs/sizing.md). The small / medium / large model that decides how many agents the swarming skills dispatch.
 - [YAGNI](./docs/yagni.md). The evidence-based "You Aren't Gonna Need It" rule every planning, review, and architecture skill applies.
diff --git a/docs/concepts.md b/docs/concepts.md
index 2f070a6..8b1f811 100644
--- a/docs/concepts.md
+++ b/docs/concepts.md
@@ -66,7 +66,7 @@ Every skill that dispatches an agent swarm classifies the work as **small**, **m
 
 - **Default is small.** Every sizing-aware skill starts the classification at small and only escalates when concrete signals require it.
 - **Auto-classified, with a `$size` override.** Skills read signals (file count, subsystems touched, security/data/infra surface) and announce the chosen size with a one-line justification. Pass `small`, `medium`, or `large` as the first positional argument to override (`/code-review medium`, `/plan-a-feature large "describe the feature"`).
-- **Five sizing-aware skills.** [`/code-review`](./skills/code-review.md), [`/gap-analysis`](./skills/gap-analysis.md), [`/iterative-plan-review`](./skills/iterative-plan-review.md), [`/plan-a-feature`](./skills/plan-a-feature.md), [`/plan-implementation`](./skills/plan-implementation.md).
+- **Six sizing-aware skills.** [`/architectural-analysis`](./skills/architectural-analysis.md), [`/code-review`](./skills/code-review.md), [`/gap-analysis`](./skills/gap-analysis.md), [`/iterative-plan-review`](./skills/iterative-plan-review.md), [`/plan-a-feature`](./skills/plan-a-feature.md), [`/plan-implementation`](./skills/plan-implementation.md).
 
 Read the full [Sizing](./sizing.md) reference for the bands, the auto-classification process, and the per-skill rules.
 
@@ -92,7 +92,7 @@ Direct invocation uses the `Agent` tool with `subagent_type: han:{agent-name}` (
 
 ## What does the plugin include?
 
-- **15 skills.** The [skills index](./skills/README.md) groups them by purpose (planning, investigation, review, discovery, conventions, reporting).
+- **17 skills.** The [skills index](./skills/README.md) groups them by purpose (planning, building, investigation, review, discovery, conventions, reporting).
 - **21 agents.** The [agents index](./agents/README.md) groups them by role (planning and facilitation, adversarial reviewers, investigation, architecture, testing, gap and content).
 
 Skim the indexes after you read this page. Pick the one skill you need right now. Come back later to learn the rest.
@@ -108,6 +108,6 @@ Skim the indexes after you read this page. Pick the one skill you need right now
 
 ## Related reading
 
-- [`docs/plugin-entity-taxonomy.md`](./guidance/plugin-entity-taxonomy.md). The taxonomy this plugin follows. Applies across all plugins in this repo.
+- [`docs/guidance/plugin-entity-taxonomy.md`](./guidance/plugin-entity-taxonomy.md). The taxonomy this plugin follows. Applies across all plugins in this repo.
 - [Claude Code Skills reference](https://code.claude.com/docs/en/skills). How skills are defined and invoked in Claude Code itself.
 - [Claude Code Subagents reference](https://code.claude.com/docs/en/sub-agents). How agents are dispatched from inside skills.
diff --git a/docs/quickstart.md b/docs/quickstart.md
index 3b88d20..41b2b19 100644
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@@ -1,12 +1,12 @@
 # Quickstart
 
-New to the han plugin? Pick the path that matches what you are trying to do right now. Each path is a short sequence (two or three skills) that compose into a useful result. You can follow one path end to end, or jump off at any step.
+New to the han plugin? Pick the path that matches what you are trying to do right now. Each path is a short sequence (a few skills) that compose into a useful result. You can follow one path end to end, or jump off at any step.
 
 > See also: [Plugin landing page](../README.md) · [Concepts](./concepts.md) · [Skills](./skills/README.md) · [Agents](./agents/README.md) · [Sizing](./sizing.md) · [YAGNI](./yagni.md)
 
 ## Which path are you on?
 
-- **[Plan a new feature](#path-a--plan-a-new-feature).** You have an idea for a feature and need to figure out what it should do and how to build it.
+- **[Plan a new feature](#path-a--plan-a-new-feature).** You have an idea for a feature and need to figure out what it should do, how to build it, and then build it test-first.
 - **[Investigate a bug or failure](#path-b--investigate-a-bug-or-failure).** Something is broken or behaving oddly and you need a root cause.
 - **[Review code or architecture](#path-c--review-code-or-architecture).** You want a second set of eyes on a branch, a PR, or an existing module.
 - **[Set up a project for everything else](#path-d--set-up-a-project-for-everything-else).** You want to document your project, formalize standards, and give every other skill richer context.
@@ -23,8 +23,9 @@ You have a feature idea and want a specification grounded in evidence, then a pl
 2. **[`/plan-a-phased-build`](./skills/plan-a-phased-build.md)** *(optional).* When the feature is large enough to ship in slices rather than all at once, split the spec into a numbered sequence of vertical-slice phases, each independently demoable to a real person.
 3. **[`/plan-implementation`](./skills/plan-implementation.md).** Turn the specification (or a single phase from the phased build) into an implementation plan through a project-manager-led team conversation.
 4. **[`/iterative-plan-review`](./skills/iterative-plan-review.md)** *(optional).* Stress-test either plan through multiple codebase-grounded review passes before committing to it.
+5. **[`/tdd`](./skills/tdd.md)** *(when you build it).* Implement the plan test-first through a BDD-framed red-green-refactor loop. The specification becomes the behavior test list; the skill enforces an observed-failure gate and applies your coding standards and ADRs in green and refactor.
 
-**You are done when:** you have a `feature-specification.md` and a `feature-implementation-plan.md` in the same folder, each with a cross-referenced decision log and review findings. If the feature was large enough to phase, you also have a `build-phase-outline.md` that orders the work into demoable vertical slices.
+**You are done when:** you have a `feature-specification.md` and a `feature-implementation-plan.md` in the same folder, each with a cross-referenced decision log and review findings. If the feature was large enough to phase, you also have a `build-phase-outline.md` that orders the work into demoable vertical slices. When you build it, the code lands behavior by behavior through `/tdd`, with tests leading.
 
 ---
 
@@ -78,12 +79,13 @@ You can reference multiple skills in one prompt and Claude runs them in sequence
 - *"Scan this repo, document the auth system, and create a coding standard for how we handle tokens."* → [`/project-discovery`](./skills/project-discovery.md) → [`/project-documentation`](./skills/project-documentation.md) → [`/coding-standard`](./skills/coding-standard.md).
 - *"Review my branch, then create an ADR for any architectural decisions in the diff."* → [`/code-review`](./skills/code-review.md) → [`/architectural-decision-record`](./skills/architectural-decision-record.md).
 - *"Plan the retry feature, then plan the implementation, then create a test plan for it."* → [`/plan-a-feature`](./skills/plan-a-feature.md) → [`/plan-implementation`](./skills/plan-implementation.md) → [`/test-planning`](./skills/test-planning.md).
+- *"Spec the discount engine, then build it test-first."* → [`/plan-a-feature`](./skills/plan-a-feature.md) → [`/tdd`](./skills/tdd.md) → [`/code-review`](./skills/code-review.md).
 - *"Compare the auth implementation to the auth spec, then plan how to close the gaps."* → [`/gap-analysis`](./skills/gap-analysis.md) → [`/plan-implementation`](./skills/plan-implementation.md).
 - *"Compare the share v1 implementation to the share v2 spec, split the gaps into a phased rollout, then plan implementation for the first phase."* → [`/gap-analysis`](./skills/gap-analysis.md) → [`/plan-a-phased-build`](./skills/plan-a-phased-build.md) → [`/plan-implementation`](./skills/plan-implementation.md).
 
 ## A note on sizing
 
-Five skills (`/code-review`, `/gap-analysis`, `/iterative-plan-review`, `/plan-a-feature`, `/plan-implementation`) classify the work as **small**, **medium**, or **large** before dispatching agents, default to small, and scale the team and iteration depth to the chosen band. Pass the size as the first positional argument to override (`/code-review medium`, `/plan-a-feature large "describe the feature"`). See [Sizing](./sizing.md) for the full model.
+Six skills (`/architectural-analysis`, `/code-review`, `/gap-analysis`, `/iterative-plan-review`, `/plan-a-feature`, `/plan-implementation`) classify the work as **small**, **medium**, or **large** before dispatching agents, default to small, and scale the team and iteration depth to the chosen band. Pass the size as the first positional argument to override (`/code-review medium`, `/plan-a-feature large "describe the feature"`). See [Sizing](./sizing.md) for the full model.
 
 ## A note on YAGNI
 
diff --git a/docs/skills/README.md b/docs/skills/README.md
index 3bc1195..53b12bc 100644
--- a/docs/skills/README.md
+++ b/docs/skills/README.md
@@ -17,6 +17,12 @@ Skills for specifying *what* a feature does, planning *how* to build it, and str
 - **[`/plan-a-phased-build`](./plan-a-phased-build.md).** Split a body of context (gap analysis, PRD, design doc, feature spec, requirements list) into a numbered sequence of vertical-slice build phases, each independently demoable to a real person and each building on the prior. Dispatches `information-architect` against the rendered outline to verify findability, EPPO standalone-ness of phase entries, and progressive comprehension.
 - **[`/iterative-plan-review`](./iterative-plan-review.md).** Stress-test an already-written plan through multiple codebase-grounded review passes.
 
+## Building
+
+Write the code itself, test-first, through a disciplined loop.
+
+- **[`/tdd`](./tdd.md).** Drive a feature or behavior through a BDD-framed red-green-refactor loop. Builds a behavior test list, enforces an observed-failure gate (no production code until a test has been run and seen to fail), works outside-in for user-facing behavior, and applies the project's coding standards and ADRs in green (correctness) and refactor (full conformance plus YAGNI). The plugin's only execution skill: it writes code, not a document.
+
 ## Investigation & root cause
 
 Skills for finding out *why* something is broken, with evidence to back it.
@@ -68,6 +74,7 @@ Every planning, review, and standards skill in the plugin applies an evidence-ba
 
 - **Planning.** [`/plan-a-feature`](./plan-a-feature.md), [`/plan-implementation`](./plan-implementation.md), [`/plan-a-phased-build`](./plan-a-phased-build.md), [`/iterative-plan-review`](./iterative-plan-review.md).
 - **Review and standards.** [`/code-review`](./code-review.md) (advisory-only), [`/coding-standard`](./coding-standard.md), [`/test-planning`](./test-planning.md), [`/architectural-decision-record`](./architectural-decision-record.md) (forcing-function requirement).
+- **Building.** [`/tdd`](./tdd.md) (enforcing in the refactor step and the test list).
 
 See [YAGNI](../yagni.md) for the two gates, the acceptable-evidence list, the named anti-patterns, and the deferral format.
 
diff --git a/docs/skills/tdd.md b/docs/skills/tdd.md
new file mode 100644
index 0000000..aeb2ba3
--- /dev/null
+++ b/docs/skills/tdd.md
@@ -0,0 +1,134 @@
+# /tdd
+
+Operator documentation for the `/tdd` skill in the han plugin. This document helps you decide *when* and *how* to use the skill. For what the skill does internally, read the skill definition at [`plugin/skills/tdd/SKILL.md`](../../plugin/skills/tdd/SKILL.md).
+
+> See also: [Plugin landing page](../../README.md) · [All skills](./README.md) · [All agents](../agents/README.md) · [YAGNI](../yagni.md)
+
+## TL;DR
+
+- **What it does.** Drives writing code through a disciplined, BDD-framed red-green-refactor loop, one behavior at a time, with a gate that refuses production code until a test has been run and seen to fail.
+- **When to use it.** You want a feature or behavior implemented test-first, the right way, instead of code with tests bolted on after.
+- **What you get back.** Working, tested code in your tree, grown behavior by behavior, with the test list, the standards applied, and the verification output shown at the end.
+
+## Key concepts
+
+- **Execution skill.** Every other han skill produces a markdown document. This one modifies your source tree. It writes the tests and the production code. That is the point, and it is why the skill reports scope and recommends a branch before it starts, so a watching human can see what it will do.
+- **The observed-failure gate.** No production code changes unless a test has been run and watched to fail for the intended reason in that loop. A test that passes the first time it runs means red was never seen, which is a process violation, not a success.
+- **Two hats.** Making a test pass and improving structure are different jobs done at different times. The skill never refactors while a test is red. Make it run, then make it right.
+- **BDD framing.** Tests describe observable behavior, named in your project's existing convention, asserting outcomes through the public interface, never private state. For user-facing behavior the skill works outside-in: a failing acceptance test on the outside, red-green-refactor on the inside.
+- **Standards split across green and refactor.** Going green obeys only the standards that govern correctness and where code is allowed to live (an ADR boundary you cross is wrong code, not deferrable mess). Full stylistic and structural conformance, plus YAGNI, happen in refactor.
+
+## When to use it
+
+**Invoke when:**
+
+- You have a feature, behavior, or function to build and you want it driven test-first through a correct red-green-refactor cycle.
+- You want to grow code behavior by behavior with the tests leading, not write code and add tests afterward.
+- You are working from a specification or plan and want the implementation built with TDD discipline rather than in one pass.
+
+**Do not invoke for:**
+
+- **Producing a test plan without writing code.** Use [`/test-planning`](./test-planning.md) instead. It analyzes coverage gaps and prioritizes what to test; it does not implement.
+- **Reviewing or auditing code that already exists.** Use [`/code-review`](./code-review.md) instead.
+- **Deciding what a feature should do.** Use [`/plan-a-feature`](./plan-a-feature.md) to specify behavior first, then bring the spec here.
+- **Finding the root cause of a bug.** Use [`/investigate`](./investigate.md). Once you have a fix in mind, you can drive it back in through `/tdd`.
+
+## How to invoke it
+
+Run `/tdd` in Claude Code.
+
+Give it:
+
+1. **What to build.** A behavior, a feature, or a path to a specification or plan. A sharp version names the observable behavior ("the fee calculator rounds half-up to the cent"); a thin version ("build the fee thing") still works because Step 2 turns it into a behavior list you review before any code is written.
+2. **Any context to respect.** A `feature-specification.md`, a linked issue, or a plan. The skill reads it as the source of behaviors for the test list.
+3. **Nothing about the test framework.** The skill resolves the test, lint, and build commands from your project itself (see *What you get back*). You do not need to pass them.
+
+The skill runs autonomously after your initial request. Before the loop it reports scope (the behavior to build, the resolved test, lint, and build commands, the standards and ADRs it found, the current branch, a branch recommendation if you are on the default branch), then proceeds without waiting. That report is informational, not a gate. The one exception: if your request or the provided context explicitly says you want to review, verify, or approve the plan or test list before implementation, the skill builds the test list, presents it with the scope report, and waits for your approval before writing any code. The only input that can otherwise block it is a test command it cannot resolve or infer, because there is no way to run tests without one.
+
+Example prompts:
+
+- `/tdd`. *"Implement the discount engine from docs/specs/discount/feature-specification.md test-first."*
+- `/tdd`. *"Drive a new `parseDuration` function red-green-refactor: it should accept `1h30m` style strings and reject garbage."*
+
+## What you get back
+
+Code in your working tree, not a report. Specifically:
+
+- **A test list**, shown to you before the loop starts and updated as behaviors are completed and as new scenarios are discovered (discovered scenarios are deferred, never built mid-loop).
+- **Tests and production code**, grown one behavior per cycle. Each cycle shows you the real test-runner output for red (the test failing for the intended reason) and green (the new test passing, all prior tests still passing).
+- **A final summary**: behaviors implemented, the state of the test list including any deferred items with their reopen triggers, which coding standards and ADRs were applied and where, any YAGNI deferrals from refactor, and the final test, lint, and build status with output shown rather than asserted.
+
+The skill resolves your test, lint, and build commands from CLAUDE.md's `## Project Discovery` section, falling back to `project-discovery.md`, falling back to a one-time discovery script that infers them from your manifest files (package.json, pyproject.toml, go.mod, Cargo.toml, Gemfile, mix.exs, pom.xml, gradle, .csproj, or a Makefile test target). Commands the script infers are treated as best-effort suggestions, surfaced in the scope report so you can correct them if you are watching, not trusted blindly. If none of those resolve the test command, the skill asks you for it before the loop starts, because the loop cannot run without it. That is the only input that can block an otherwise autonomous run.
+
+## How to get the most out of it
+
+- **Bring a specification when you have one.** `/tdd` builds a better test list from a `feature-specification.md` than from a one-line prompt, because the behaviors and edge cases are already named. Run [`/plan-a-feature`](./plan-a-feature.md) first for anything non-trivial.
+- **Have your standards and ADRs discoverable.** The green and refactor steps apply your coding standards and architectural decisions. If they live in `docs/coding-standards/` or `docs/adr/`, or are recorded by [`/coding-standard`](./coding-standard.md) and [`/architectural-decision-record`](./architectural-decision-record.md), the skill finds and applies them. If they do not exist, it infers conventions from surrounding code, which is weaker.
+- **Let the list be the scope signal.** If the open test list grows past about ten items, the skill flags a scope warning and keeps going, then recommends splitting the work in its final summary. Take that warning seriously: a ballooning list usually means the feature wanted to be planned, not grown in one sitting.
+- **Read the red output.** The skill pastes real runner output for every red. Glancing at it is how you catch a test that fails for the wrong reason before it drives wrong code.
+- **Pair with `/code-review` next.** TDD produces self-testing code; it does not replace a second set of eyes. Run [`/code-review`](./code-review.md) on the branch when the list is empty.
+
+## YAGNI
+
+`/tdd` produces two things YAGNI gates: the test list and the code that comes out of refactor.
+
+- **The test list.** A scenario earns a place only with evidence it is needed now (a user-described need, a named dependency, an existing code path that breaks, an applicable regulation, a real incident). Scenarios that fail the evidence test, or that exist for symmetry or completeness, are deferred with the trigger that would reopen them, not padded onto the list.
+- **The refactor step.** Removing duplication is the job. Adding an interface with one implementation, a configuration knob no caller sets, or a generalization from a single example is not. "Duplication is a hint, not a command": the skill abstracts only when two or more concrete examples force it, which is the Rule of Three and also Beck's Triangulate. Speculative structure introduced for future flexibility is a YAGNI candidate, deferred with a named reopen trigger and surfaced to you, never silently added and never silently dropped.
+
+The rule is enforcing in refactor (speculative structure is deferred by default), and the deferrals appear in the final summary. See [YAGNI](../yagni.md) for the two gates, the acceptable-evidence list, the named anti-patterns, and the deferral format.
+
+## Cost and latency
+
+`/tdd` runs on the main agent. It dispatches no sub-agents and is not a sizing-aware skill, so there is no fan-out cost. The cost is the loop itself: a multi-turn, tight iteration where each behavior is three phases (red, green, refactor) and each phase runs your test command. The most expensive single factor is the number of test list items multiplied by your suite's runtime, since the full suite runs at green and after every refactor. This is a tight-loop skill built to run while you build, not an infrequent high-signal report. Keeping the test list scoped (the skill flags lists past ~10 items) is the main lever on total cost.
+
+## In more detail
+
+The skill is structurally modeled on two existing skills. The loop and its stop condition follow [`/iterative-plan-review`](./iterative-plan-review.md): front-loaded constraints, a deterministic per-cycle process, a bounded list. The project and command discovery follows [`/test-planning`](./test-planning.md): resolve from CLAUDE.md's `## Project Discovery`, fall back to `project-discovery.md`, fall back to a one-time script.
+
+One design decision is worth knowing. Classic TDD says green should "commit whatever sins are necessary" and clean up in refactor. Taken literally, that would mean ignoring coding standards while going green. The skill splits the difference: standards that govern *correctness and architectural placement* (which boundary code must go through, where it is allowed to live, which contract it honors) are obeyed in green, because violating an ADR boundary is not a temporary sin you tidy later, it is the wrong code. Stylistic and structural standards, the kind you genuinely can defer, are the refactor hat. This keeps the green step minimal (the Three Laws still hold) while making sure the code that survives the cycle respects the project's architecture.
+
+The hardest honest limitation: the observed-failure gate is enforced by discipline and shown evidence (pasted runner output, the first-run-pass stop rule, strict step sequencing), not by a mechanism that can physically prevent a premature write. No skill in the plugin model can enforce a "you must have observed X before doing Y" constraint with certainty. The skill makes the failure visible and diagnosable instead, which is the strongest available guarantee. If you watch one thing while it runs, watch that the red output is real and fails for the reason intended.
+
+## Sources
+
+The skill's protocols and vocabulary are grounded in the primary TDD and BDD literature. Each source is cited because the skill draws a specific, named artifact from it.
+
+### Kent Beck, *Test-Driven Development: By Example*, 2002; and "Canon TDD", 2023
+
+The test list pattern, the red-green-refactor mantra, the two-hats rule, and the implementation gears (Fake It, Triangulate, Obvious Implementation) come from here. The skill's loop is the Canon TDD five-step loop.
+
+URL: https://tidyfirst.substack.com/p/canon-tdd
+
+### Robert C. Martin, "The Three Rules of TDD" and "The Cycles of TDD"
+
+The verbatim Three Laws the observed-failure gate is built on, including "compilation failures are failures" and "the one failing unit test".
+
+URL: https://blog.cleancoder.com/uncle-bob/2014/12/17/TheCyclesOfTDD.html
+
+### Martin Fowler, "Test Driven Development", "Mocks Aren't Stubs", "GivenWhenThen"
+
+The refactor-is-the-most-skipped-step warning, the classicist default for test doubles, the stub-query / mock-command rule, and Given-When-Then mapped onto Arrange-Act-Assert.
+
+URL: https://martinfowler.com/articles/mocksArentStubs.html
+
+### Dan North, "Introducing BDD"
+
+Behavior over "test", the should-sentence framing, and the "should it? really?" failure triage.
+
+URL: https://dannorth.net/introducing-bdd/
+
+### Steve Freeman & Nat Pryce, *Growing Object-Oriented Software, Guided by Tests*
+
+The outside-in double loop: an outer acceptance test, an inner red-green-refactor loop, collaborator interfaces discovered via mocks at the call site, acceptance test green only with real implementations.
+
+URL: https://growing-object-oriented-software.com/
+
+## Related documentation
+
+- [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree.
+- [YAGNI](../yagni.md). The evidence-based "You Aren't Gonna Need It" rule the refactor step and test list apply. The two gates, the acceptable-evidence list, the named anti-patterns, and the deferral format.
+- [`/test-planning`](./test-planning.md). Plan what to test without writing code. Use it before `/tdd` to enumerate behaviors, or instead of it when you want analysis rather than implementation.
+- [`/plan-a-feature`](./plan-a-feature.md). Specify behavior first; the spec becomes the test list `/tdd` builds from.
+- [`/code-review`](./code-review.md). Run it on the branch once the list is empty. TDD produces self-testing code; it does not replace review.
+- [`/coding-standard`](./coding-standard.md) and [`/architectural-decision-record`](./architectural-decision-record.md). The standards and ADRs `/tdd` applies in green and refactor come from here.
+- [Skill building guidance](../guidance/skill-building-guidance/). The progressive disclosure, description frontmatter, and bash-permission rules this skill follows.
diff --git a/docs/yagni.md b/docs/yagni.md
index 9d7239d..684c635 100644
--- a/docs/yagni.md
+++ b/docs/yagni.md
@@ -86,6 +86,7 @@ YAGNI applies in two postures: **producing** (when a skill drafts an artifact) a
 | [`/coding-standard`](./skills/coding-standard.md) | A standard is justified only when the project does the thing the standard governs *today* and the standard solves a real, concrete problem the team is currently hitting. |
 | [`/test-planning`](./skills/test-planning.md) | A YAGNI sweep removes speculative tests: for code paths that don't exist, hypothetical adversaries, or branches the change doesn't touch. |
 | [`/architectural-decision-record`](./skills/architectural-decision-record.md) | An ADR requires a **forcing function** today: a real decision being made now, with consequences. ADRs about decisions that don't have a forcing function are YAGNI. |
+| [`/tdd`](./skills/tdd.md) | Enforcing in the refactor step and the test list. A scenario joins the test list only with evidence; speculative scenarios are deferred with a reopen trigger. Refactor removes duplication but defers speculative abstraction (the Rule of Three). Correctness-and-placement standards still apply in green. |
 | [`project-manager`](./agents/project-manager.md) | The YAGNI Evidence Gate protocol. Every committed proposal in a facilitated discussion must cite evidence; uncited proposals are challenged or deferred. |
 | [`junior-developer`](./agents/junior-developer.md) | The YAGNI Evidence Sweep protocol. Flags hidden assumptions and uncited additions during stress-testing. |
 | [`software-architect`](./agents/software-architect.md) | Architectural recommendations cite the change-history or coupling evidence that justifies the recommendation. Speculative abstractions are deferred. |
diff --git a/plugin/skills/tdd/SKILL.md b/plugin/skills/tdd/SKILL.md
new file mode 100644
index 0000000..40382f5
--- /dev/null
+++ b/plugin/skills/tdd/SKILL.md
@@ -0,0 +1,237 @@
+---
+name: tdd
+description: >
+  Write code through a disciplined, BDD-framed Test-Driven Development loop:
+  build a behavior test list, then drive each behavior through
+  red-green-refactor with an enforced observed-failure gate. Use when the user
+  wants to implement, build, or write code test-first, "do TDD", follow
+  "red-green-refactor", drive code from tests, or grow a feature
+  behavior-by-behavior with tests leading. This skill writes and changes code;
+  it does not produce a test plan document (use test-planning), review or audit
+  existing code (use code-review), specify what a feature should do (use
+  plan-a-feature), or find the root cause of a bug (use investigate). It applies
+  the project's coding standards and ADRs during the green and refactor steps,
+  and enforces YAGNI during refactor.
+argument-hint: "[what to build, a behavior to drive, or a path to a spec/plan]"
+allowed-tools: Read, Write, Edit, Glob, Grep, Agent, Bash(git *), Bash(find *), Bash(npm *), Bash(npx *), Bash(pnpm *), Bash(yarn *), Bash(pytest *), Bash(python3 *), Bash(go *), Bash(cargo *), Bash(make *), Bash(bundle *), Bash(rake *), Bash(mix *), Bash(dotnet *), Bash(gradle *), Bash(mvn *)
+---
+
+## Project Context
+
+- git installed: !`git --version 2>/dev/null`
+- current branch: !`git branch --show-current 2>/dev/null`
+- CLAUDE.md: !`find . -maxdepth 1 -name "CLAUDE.md" -type f`
+- project-discovery.md: !`find . -maxdepth 3 -name "project-discovery.md" -type f`
+
+## Constraints (read before anything else)
+
+This skill writes production and test code in your working tree. It is an
+execution skill, not a document generator. These constraints shape every step
+and override any instinct to move faster.
+
+- **The observed-failure gate is load-bearing.** You may not write or change a
+  line of production code unless a test has been run and you have *seen it
+  fail for the intended reason* in this loop. A test that passes the first time
+  it is ever run means red was never observed: stop and diagnose, do not
+  proceed. This single rule is what separates real TDD from TDD-flavored
+  code. The verbatim Three Laws and Canon TDD steps this rule comes from are in
+  [references/tdd-loop.md](references/tdd-loop.md) — read that file before
+  Step 3.
+- **Two hats, never worn at once.** Making a test pass (green) and improving
+  structure (refactor) are different jobs. Never refactor while any test is
+  red. Make it run, then make it right.
+- **One behavior at a time.** Exactly one test list item becomes one runnable
+  test per loop. Newly discovered scenarios are written to the list and
+  deferred, never implemented in the current loop.
+- **BDD framing.** Tests describe observable behavior, named in the project's
+  existing test-naming convention, asserting outcomes through the public
+  interface — never private state. The protocol is in
+  [references/bdd-framing.md](references/bdd-framing.md).
+- **You will be tempted to fake this.** The specific ways an agent fakes TDD,
+  and the discipline that catches each, are in
+  [references/failure-modes.md](references/failure-modes.md). Read it.
+- **YAGNI governs the refactor step and the test list.** Apply the rule in
+  [../../references/yagni-rule.md](../../references/yagni-rule.md): remove
+  duplication, but do not add abstractions, configuration, or indirection
+  without evidence. Speculative structure added "for flexibility" during
+  refactor is a YAGNI candidate. Speculative scenarios on the test list are
+  deferred with a reopen trigger, never silently added.
+
+# Test-Driven Development
+
+## Step 1: Resolve Project Config and Confirm Scope
+
+**Resolve commands.** Read CLAUDE.md's `## Project Discovery` section for the
+test command (under `### Commands and Tests`, not `### Frameworks and
+Tooling`), the lint command, the build command, language, and framework. If
+absent, fall back to `project-discovery.md`. If still absent, run
+`${CLAUDE_SKILL_DIR}/scripts/detect-tdd-context.sh` and parse its output for
+git state and manifest-inferred commands. Store the resolved test, lint, and
+build commands for use in every later step.
+
+**Resolve standards and decisions.** Resolve the coding-standards directory and
+ADR directory the same way: read CLAUDE.md's `## Project Discovery` section;
+fall back to `project-discovery.md`; fall back to Glob defaults (`docs/`,
+`docs/adr/`, `docs/coding-standards/`, `docs/decisions/`). Also check
+`CLAUDE.md` and `AGENTS.md` for inline standards. Read what you find — these
+govern the green and refactor steps. If none exist, state that plainly and plan
+to infer conventions from the surrounding code instead.
+
+**Report scope, then proceed (no gate).** This skill runs autonomously after
+the initial request: it does not stop for confirmation. State to the user, in a
+few lines: the behavior or feature to be built, the resolved test/lint/build
+commands, the standards and ADRs found (or that none were), the current branch,
+and that the skill will now write code in a red-green-refactor loop. If
+`current branch` from Project Context is the repository's default branch
+(`main` or `master`), recommend working on a branch, but do not wait for an
+answer. This is a report the user reads while the work runs, not a gate.
+Continue immediately to Step 2 without waiting for a response.
+
+**The one exception.** If the initial request or the provided context
+explicitly states the human wants to review, verify, or approve the plan or
+test list before implementation, then this becomes a gate: build the test list
+in Step 2, present it together with this scope report, and wait for approval
+before starting the Step 3 loop. Absent an explicit request like that, the
+skill runs to completion without further human input.
+
+The one input that can still block is a missing test command: if it could not
+be resolved from CLAUDE.md, `project-discovery.md`, the discovery script, or
+manifest inference, ask the user for it, because TDD is impossible without a
+way to run tests. Exhaust inference before asking; this is a hard dependency,
+not a discretionary checkpoint.
+
+## Step 2: Build the BDD Test List
+
+Turn the requested feature or behavior into a test list (Kent Beck's "test
+list" pattern). Each item is one observable behavior, phrased as a behavior
+sentence, not as an implementation note. "Returns the unrounded fee for a
+sub-dollar charge" is a list item; "use a BigDecimal" is not. Follow
+[references/bdd-framing.md](references/bdd-framing.md) for how to phrase and
+name behaviors, and which test-naming convention to adopt (the project's
+existing convention and any discovered coding standard win over a literal
+"should" default).
+
+Order the list outside-in by user value: the next item is the most important
+thing the system does not yet do. For an item that is **user-observable
+behavior at a system boundary**, write the outer acceptance test for it *first*
+(it will be red until its inner behaviors exist) and record it as the outer
+loop for that item. For internal or utility behavior with no meaningful system
+boundary, the outer acceptance test is optional; the inner loop alone is
+correct.
+
+Apply YAGNI to the list itself. A scenario earns a place only with evidence it
+is needed now (a user-described need, a named dependency, an existing code path
+that breaks, a regulation, a real incident). Scenarios that fail the evidence
+test go to a deferred list with the trigger that would reopen them. Do not pad
+the list for symmetry or completeness.
+
+Report the test list to the user. Unless the verify-plan exception from Step 1
+applies, continue to Step 3 immediately without waiting for approval. When that
+exception applies, present the test list together with the Step 1 scope report
+and wait for approval before entering the loop.
+
+## Step 3: The Red-Green-Refactor Loop
+
+Pick exactly one item from the list. Choose one that teaches you something and
+that you are confident you can implement in one cycle (Beck's "one step
+test"). Then run these three phases in order. Do not collapse them.
+
+### Red
+
+Write exactly one test for the chosen behavior. Name it for the behavior in the
+project's convention. Assert an observable outcome through the public interface
+(Given = arrange the state before; When = the one action under test; Then =
+assert the observable result). Write no more of the test than is sufficient to
+fail; a compilation failure is a failure.
+
+Run the resolved test command directly with Bash. **Paste the actual runner
+output into your response.** Confirm the test fails, and that it fails for the
+reason you intended (the assertion or the missing symbol you expect, not an
+unrelated error).
+
+If the test passes on its first run, the observed-failure gate has tripped.
+Stop. Diagnose: the test is not exercising the behavior, or the behavior
+already exists. If the behavior already exists, cross the item off and pick the
+next one. Do not write production code off an unobserved red.
+
+### Green
+
+Write the minimum production code that makes this one test pass. Use the
+smallest gear that works: Obvious Implementation when you are certain, Fake It
+(return a constant, generalize later) when you are not, Triangulate (force the
+abstraction with a second example) only when you are really unsure. Gears are
+described in [references/tdd-loop.md](references/tdd-loop.md).
+
+While going green, respect the coding standards and ADRs that govern
+*correctness and architectural placement*: where this code is allowed to live,
+which boundary or client it must go through, which contract it must honor.
+Violating an ADR boundary is not a sin you clean up later — it is the wrong
+code. Do **not** apply stylistic or structural polish here (naming sweeps,
+extraction, formatting passes). That is the refactor hat, and wearing it now
+violates "no more code than is sufficient to pass the test."
+
+Run the full test suite with Bash and paste the output. The gate to leave green
+is: the new test passes and every previously passing test still passes. If a
+prior test broke, you are not green — fix it before refactoring.
+
+### Refactor (non-skippable)
+
+Only with every test green. Neglecting this step is the most common way to
+ruin TDD, so it is not optional: either you change something, or you state
+explicitly "no duplication, structure, or standards issue this cycle" and move
+on.
+
+Eliminate the duplication you just created. Bring the code into full
+conformance with the resolved coding standards and ADRs — this is the home for
+the stylistic and structural standards you deliberately skipped in green.
+
+Apply YAGNI here as a first-class concern, per
+[../../references/yagni-rule.md](../../references/yagni-rule.md). Removing
+duplication is the job; adding speculative abstraction is not. One concrete
+implementation beats an interface with one implementation. "Duplication is a
+hint, not a command" — abstract only when two or more concrete examples force
+it (the Rule of Three). Structure added for future flexibility with no evidence
+is a YAGNI candidate: defer it with the trigger that would reopen it, and tell
+the user. Never silently add it, never silently drop it.
+
+Change no behavior. Re-run the full suite after the refactor and paste the
+output; it must stay green. If a refactor reddened a test, revert it — a
+refactor that changes behavior is a defect, not a refactor.
+
+### Close the cycle
+
+Cross the completed item off the list. Append any scenarios you discovered
+while implementing (deferred, with their reopen trigger if speculative), but do
+not implement them now. If the open list has grown past roughly ten items, do
+not stop for input: flag it prominently as a scope warning, keep going, and
+record in the final summary that the work exceeded the recommended size and
+should be split next time. A runaway list is a scope signal, not a reason to
+pause for a human.
+
+Return to the top of Step 3 with the next item. Continue until the list is
+empty.
+
+## Step 4: Close the Outer Loop
+
+For any item that had an outer acceptance test (Step 2), run that test now. It
+should pass only because its inner behaviors are all implemented with real code
+(not mocks). If it is still red, the gap is a missing inner behavior: add the
+missing scenario to the test list and return to Step 3. The acceptance test
+going green is the signal the user-facing behavior is actually delivered.
+
+## Step 5: Final Verification and Summary
+
+Run the full test suite, then the lint command, then the build command, using
+the resolved commands from Step 1. Paste the results. If lint or build fails,
+that is in scope — fix it (a lint or build break is not a "pre-existing
+error" to wave off) and re-run.
+
+Summarize for the user:
+
+- Behaviors implemented, and the state of the test list (done, and any
+  deferred items with their reopen triggers).
+- Which coding standards and ADRs were applied, and where they shaped the code.
+- Any YAGNI deferrals from refactor, each with its reopen trigger.
+- A scope warning if the test list exceeded roughly ten open items, with a
+  recommendation to split future work.
+- The final test, lint, and build status, with output shown, not asserted.
diff --git a/plugin/skills/tdd/references/bdd-framing.md b/plugin/skills/tdd/references/bdd-framing.md
new file mode 100644
index 0000000..eac0ad1
--- /dev/null
+++ b/plugin/skills/tdd/references/bdd-framing.md
@@ -0,0 +1,119 @@
+# BDD Framing for the TDD Loop
+
+BDD is the reason this skill frames every test as a *behavior*. Dan North
+started replacing the word "test" with "behaviour" because almost every
+misunderstanding of TDD traced back to the word "test" — where to start, what
+to test, how much, what to name it, why a failure matters. The framing below is
+how `/tdd` answers those questions mechanically.
+
+## Name tests for behavior, in the project's convention
+
+A test name describes an observable behavior of the unit, not a method or an
+implementation detail. North's diagnostic: if you cannot phrase the name as a
+sentence about what the thing *should do*, the behavior may belong elsewhere.
+"should" also keeps the premise challengeable — when a behavior test fails, ask
+"should it? really?": is the code wrong, or is the asserted behavior now out of
+date?
+
+**Surface syntax follows the project, not a literal "should".** BDD governs the
+*focus* (observable behavior), not the spelling. If the project uses
+`it("...")`, `test_...`, `describe/it`, xUnit `TestXxx`, Go `TestXxx` with
+`t.Run("when ...")`, or a discovered coding standard that mandates a naming
+pattern, use that. The discovered convention and any coding standard always win
+over the word "should". What does not change: the name states a behavior and an
+expected outcome, never an implementation step.
+
+## Given-When-Then maps to Arrange-Act-Assert
+
+A scenario is one test. The three clauses are the three phases of that test:
+
+- **Given** — the state of the world before the behavior. The arrange/setup.
+  Typically something that already happened. Commands that establish state.
+- **When** — the one event or action under test. The act. Exactly one per
+  scenario. If you need an "and" in the When, you probably have two scenarios.
+- **Then** — the expected, observable outcome. The assert. It must be on an
+  observable output: a return value, a visible effect, a message or record
+  that leaves the unit. Never on internal or private state.
+
+Keep scenarios declarative, not imperative. "When the customer requests cash"
+is a behavior. "When the user clicks #submit then waits 200ms" is an
+implementation procedure that will break on every UI change without the
+behavior changing. Declarative scenarios survive refactoring; imperative ones
+do not.
+
+## Outside-in: the double loop
+
+BDD-framed TDD is two nested loops:
+
+- **Outer loop** — a failing acceptance test for a user-observable behavior at
+  a system boundary, written from the perspective of someone using the system.
+  Slow loop (a feature's worth of work). It goes green only when every inner
+  behavior is implemented with real code.
+- **Inner loop** — ordinary red-green-refactor that makes the acceptance test
+  progressively pass.
+
+The procedure for a user-facing item: write the failing acceptance test,
+identify the entry point that gets called first, start implementing it, and
+when it needs a collaborator, introduce that collaborator as a test double at
+the call site. The collaborator's interface is *discovered by what the caller
+needs*, not designed up front. When the entry point's test passes, drop down
+and make the next mocked collaborator real. The acceptance test passing last,
+against real implementations, is the proof no collaborator was forgotten.
+
+Not every list item is user-facing. A pure utility or an internal algorithm has
+no meaningful acceptance boundary — the inner loop alone is correct there.
+Making the outer loop conditional on whether the behavior is user-observable is
+correct scaling, not a shortcut.
+
+## Test doubles: mock commands, stub queries
+
+The five doubles (Meszaros, via Fowler): dummy (passed, never used), fake
+(working but shortcut implementation), stub (canned answers to queries), spy
+(a stub that records calls), mock (pre-programmed with expectations that form a
+specification). Only mocks insist on behavior verification; the rest use state
+verification.
+
+The working rule for the loop:
+
+- **Stub a query.** The collaborator only feeds data the unit needs to produce
+  its outcome. Assert on the resulting state/output, not on the call.
+- **Mock a command / required collaboration.** The interaction *is* the
+  behavior under test (the unit must tell a collaborator to do something).
+  Assert the interaction.
+
+Default to real objects unless using the real thing is awkward (the classicist
+default). Over-mocking couples the test to the implementation: mockist tests
+break on refactor even when behavior did not change, because they specify exact
+calls and parameters that are not the behavior under test. If a mock is only
+there to feed a value, it should have been a stub. If exact call order or
+parameters are not the behavior, do not assert on them.
+
+## BDD-flavored failure modes
+
+- **Gherkin/test that describes UI mechanics** ("clicks the button, fills the
+  field") instead of behavior. Rewrite functionally: name the intent, not the
+  clicks.
+- **Asserting internal state** in the Then. Assert observable output only.
+- **Imperative scenario** that encodes how instead of what. Make it
+  declarative; logic changes should touch step definitions, not scenario text.
+- **Testing the mock instead of the behavior** — asserting call mechanics that
+  are not the behavior under test, so the test breaks on refactor with no
+  behavior change. Demote the mock to a stub, or assert the observable outcome
+  instead.
+- **Back-filling scenarios from code.** Scenarios come from concrete examples
+  of intended behavior, decided before the code, not reverse-engineered from
+  what was written.
+
+## Sources
+
+- Dan North, "Introducing BDD" (dannorth.net): behaviour over "test", the
+  should-sentence template, "should it? really?", "what's the next most
+  important thing the system doesn't do?".
+- Martin Fowler, "GivenWhenThen", "Mocks Aren't Stubs", "UnitTest"
+  (martinfowler.com): GWT mapped to Arrange-Act-Assert; stub vs mock; classicist
+  vs mockist; coupling risk of over-mocking.
+- Steve Freeman & Nat Pryce, *Growing Object-Oriented Software, Guided by
+  Tests*: the outer/inner double loop, discovering interfaces via mocks at the
+  call site, acceptance test green only with real implementations.
+- Cucumber Gherkin reference and "Writing better Gherkin": Given/When/Then
+  semantics, observable-output rule, declarative over imperative.
diff --git a/plugin/skills/tdd/references/failure-modes.md b/plugin/skills/tdd/references/failure-modes.md
new file mode 100644
index 0000000..1e3c105
--- /dev/null
+++ b/plugin/skills/tdd/references/failure-modes.md
@@ -0,0 +1,125 @@
+# How an Agent Fakes TDD, and the Discipline That Catches It
+
+An unguided coding agent reliably fakes TDD. The failure is rarely malice — it
+is the model optimizing for "tests are green at the end" instead of "the tests
+drove the code". Each failure mode below has a symptom you can observe and a
+gate in the SKILL body that catches it. When you feel the pull toward any of
+these, that pull is the signal the discipline exists to resist.
+
+## 1. Writing the test and the production code together
+
+**Symptom.** The test file and the production file are created or edited in the
+same step, then the suite is run once and is green on the first run. Red was
+never observed.
+
+**Why it happens.** It is faster and reads as efficient. The model "knows" the
+implementation, so writing the failing test first feels like theater.
+
+**Discipline.** The observed-failure gate. Write only the test. Run it. Paste
+the real failure output. Only then write production code. A first-run pass is a
+hard stop, not a success — diagnose why the test did not fail.
+
+## 2. Never seeing red
+
+**Symptom.** No test run is shown between writing a test and writing the code
+that satisfies it. The transcript jumps from "here is the test" to "here is the
+implementation".
+
+**Why it happens.** Running the suite mid-cycle feels like overhead when the
+outcome seems obvious.
+
+**Discipline.** Every cycle shows two pasted runner outputs at minimum: the red
+run (test fails for the intended reason) and the green run (new test passes,
+all prior tests still pass). Output is shown, never asserted from memory.
+
+## 3. Whole-feature steps
+
+**Symptom.** One cycle implements the entire feature, then a batch of tests is
+written to cover it. Or the test list is converted into many tests at once and
+then made to pass together.
+
+**Why it happens.** The model can hold the whole feature at once, so
+decomposing into one-behavior steps feels artificially slow.
+
+**Discipline.** Exactly one list item becomes exactly one test per loop. No
+more production code than that one test requires. Everything else discovered
+goes on the list, deferred. A long red bar (many failing tests at once) is the
+symptom; the cure is the one-step rule.
+
+## 4. Skipping the refactor
+
+**Symptom.** Cycles go red, green, red, green. Duplication and structure debt
+accumulate. Coding standards are never applied because the test is already
+green and the model has moved on.
+
+**Why it happens.** Green feels like done. This is the single most common way
+TDD is ruined, per Fowler.
+
+**Discipline.** Refactor is a non-skippable named phase. Either you change
+something (remove duplication, apply the standards deferred from green, conform
+to ADRs) or you state explicitly "no duplication, structure, or standards issue
+this cycle". Silence is not allowed; "green, moving on" is the failure.
+
+## 5. Asserting on implementation detail
+
+**Symptom.** Tests assert internal state, exact private call sequences, or
+specific collaborator parameters that are not the behavior under test. The
+tests pass review and then break on the next refactor with no behavior change.
+
+**Why it happens.** Mocking everything and asserting calls is mechanical and
+looks thorough.
+
+**Discipline.** Assert observable behavior through the public interface. Stub
+queries, mock only genuine required collaborations. If a refactor that changes
+no behavior breaks a test, the test was asserting implementation — that is a
+test defect, not a code defect.
+
+## 6. Applying standards while going green
+
+**Symptom.** During the green phase the model runs naming sweeps, extracts
+helpers, and reformats — adding code beyond what the one test needs.
+
+**Why it happens.** "Write clean code" is a strong prior and fires constantly.
+
+**Discipline.** Green obeys only correctness and architectural-placement
+constraints (where the code lives, which boundary it must use, which contract
+it must honor — violating these is wrong code, not deferrable mess). Stylistic
+and structural standards are the refactor hat. Wearing it during green violates
+"no more code than is sufficient to pass the test".
+
+## 7. Keeping no test list
+
+**Symptom.** Scenarios are implemented as they occur to the model, or forgotten
+entirely. Scope drifts. Speculative cases get built because they came to mind.
+
+**Why it happens.** The model holds context in the conversation instead of in
+an explicit artifact, so the list feels redundant.
+
+**Discipline.** The test list is a first-class, visible artifact. Discovered
+scenarios are appended and deferred, never implemented in the current loop.
+Speculative scenarios are deferred with a reopen trigger (YAGNI), not built.
+The list draining is the progress signal; the list ballooning past ~10 open
+items is a scope warning the skill flags and records in its summary while
+continuing autonomously, not a reason to keep silently grinding through
+unbounded scope.
+
+## 8. Refactoring into speculative abstraction
+
+**Symptom.** The refactor step introduces an interface with one implementation,
+a configuration knob no caller sets, or a generalization from a single example
+— all justified as "for future flexibility".
+
+**Why it happens.** Refactor is read as "make it sophisticated" rather than
+"remove the duplication you just made".
+
+**Discipline.** YAGNI is first-class in refactor. Duplication is a hint, not a
+command. Abstract only when two or more concrete examples force it (Rule of
+Three / Triangulate). Speculative structure is a YAGNI candidate: defer it with
+the trigger that would reopen it and tell the user. Refactor removes
+duplication; it does not add speculation.
+
+## The one check that catches most of these
+
+Before every production-code edit, you must be able to point to a specific test
+that you ran and watched fail for the intended reason in this loop. If you
+cannot, you are in one of the failure modes above. Stop and get back to red.
diff --git a/plugin/skills/tdd/references/tdd-loop.md b/plugin/skills/tdd/references/tdd-loop.md
new file mode 100644
index 0000000..a71df21
--- /dev/null
+++ b/plugin/skills/tdd/references/tdd-loop.md
@@ -0,0 +1,115 @@
+# The TDD Loop: Canon, Laws, and Gears
+
+This is the canonical reference for the red-green-refactor mechanics the `/tdd`
+skill enforces. It is the single source for the verbatim rules — the SKILL body
+links here rather than restating them.
+
+## Robert C. Martin's Three Laws of TDD
+
+Use this phrasing. It resolves the compile-vs-assertion question and the
+one-test-at-a-time question explicitly:
+
+1. You are not allowed to write any production code unless it is to make a
+   failing unit test pass.
+2. You are not allowed to write any more of a unit test than is sufficient to
+   fail; and compilation failures are failures.
+3. You are not allowed to write any more production code than is sufficient to
+   pass the one failing unit test.
+
+The laws operate second-by-second. You iterate them roughly a dozen times
+before a single test is complete. You are never more than a few unwritten lines
+from a run, and never more than one change from a green or red bar.
+
+## Canon TDD (Kent Beck)
+
+The reference loop. Deviations should be deliberate, not accidental:
+
+1. Write a list of the test scenarios you want to cover.
+2. Turn exactly one item on the list into an actual, concrete, runnable test.
+3. Change the code to make the test (and all previous tests) pass, adding items
+   to the list as you discover them.
+4. Optionally refactor to improve the implementation design.
+5. Until the list is empty, go back to step 2.
+
+The test list is a first-class artifact. It starts with every operation you
+know you need and the null/degenerate version of each. As you make tests pass,
+the implementation implies new tests and refactorings — write them on the list,
+do not implement them now. Items left when the session ends are handled
+explicitly: continue them next session, or move clearly out-of-scope
+refactorings to a "later" list. A test you can think of that might not work is
+never moved to "later".
+
+## The mantra
+
+- **Red** — write a small test that does not work, and perhaps does not even
+  compile at first.
+- **Green** — make the test work quickly, committing whatever sins are
+  necessary in the process (duplication, hard-coded values, ugliness).
+- **Refactor** — eliminate the duplication created in just getting the test to
+  work.
+
+"Committing sins" means duplication and poor structure are tolerated
+*temporarily*, to be removed in refactor. It does not mean writing code that
+violates an architectural boundary or a correctness contract — that is not a
+sin you clean up later, it is the wrong code. This is why the `/tdd` skill
+honors correctness and placement standards during green but defers stylistic
+and structural conformance to refactor.
+
+## Two hats
+
+Making a test pass and refactoring are different activities done at different
+times. "Make it run, then make it right." Mixing refactoring into the
+make-it-pass step is the classic two-hats mistake. Refactor only when every
+test is green. A structural change while a test is red is not a refactor.
+
+## The observed-failure gate
+
+The single highest-value invariant, derived from Law 1 + Law 2 + Beck's
+test-first ordering: no production-code change is permitted unless a test has
+been run and observed to fail for the intended reason in this loop. A test that
+passes the first time it is ever run was never red — that is a process
+violation signal, not a success. Diagnose it: the test does not exercise the
+behavior, or the behavior already exists.
+
+## Implementation gears (choosing step size)
+
+TDD is not blind rule-following; it is choosing step size and feedback by
+conditions. Three gears, smallest step last:
+
+- **Obvious Implementation (second gear).** You are certain how to implement
+  the operation. Type the real code. Guardrail: if you start getting surprised
+  by red bars, downshift immediately and take smaller steps.
+- **Fake It (small step).** You have a broken test and are not certain. Return
+  a constant. Once green, generalize the constant into an expression. The
+  duplication between test and code is removed in refactor — this is why Fake
+  It does not violate "no unneeded code".
+- **Triangulate (smallest step).** You are really unsure of the right
+  abstraction. Force it: add a second concrete example with different values so
+  one parameter or branch cannot satisfy both, and let the two examples drive
+  the generalization. Use only when genuinely unsure; otherwise Obvious
+  Implementation or Fake It.
+
+The tougher or less familiar the problem, the smaller the step. Reserve Obvious
+Implementation for genuinely simple, certain operations.
+
+## How much to test
+
+Test count is driven by the confidence the code needs, not by a coverage
+number. If knowledge of the implementation gives real confidence in the absence
+of a test, that test may not be worth writing. Tests that execute code but
+assert nothing meaningful are not TDD tests — start each test from the
+assertion (Assert First) and derive the expected value independently from the
+specification, never by pasting back the program's own output.
+
+## Sources
+
+- Kent Beck, *Test-Driven Development: By Example* (2002): Test List,
+  Implementation Strategies (Fake It, Triangulate, Obvious Implementation),
+  Assert First, step size and feedback.
+- Kent Beck, "Canon TDD" (tidyfirst.substack.com): the five-step loop and the
+  named mistakes (two-hats, batch-converting the list, abstracting too soon).
+- Robert C. Martin, "The Three Rules of TDD" and "The Cycles of TDD"
+  (blog.cleancoder.com): the Three Laws and the nested cycle timescales.
+- Martin Fowler, "Test Driven Development" and "Self Testing Code"
+  (martinfowler.com): red-green-refactor summary; refactor is the most-skipped
+  step; run the whole suite frequently.
diff --git a/plugin/skills/tdd/scripts/detect-tdd-context.sh b/plugin/skills/tdd/scripts/detect-tdd-context.sh
new file mode 100755
index 0000000..efc82bd
--- /dev/null
+++ b/plugin/skills/tdd/scripts/detect-tdd-context.sh
@@ -0,0 +1,95 @@
+#!/usr/bin/env bash
+# Detect git state and infer test/lint/build commands from project manifests.
+# Run ONCE in Step 1, only when CLAUDE.md / project-discovery.md did not
+# resolve the commands. Per-cycle test runs do NOT use this script — they call
+# the resolved test command directly so the loop is not interrupted by a
+# per-invocation approval every cycle.
+#
+# Inference is a best-effort fallback. Emitted commands are suggestions for the
+# user to confirm, not authority. Output is line-oriented key: value pairs.
+
+# --- git state ---------------------------------------------------------------
+if command -v git &>/dev/null && git rev-parse --is-inside-work-tree &>/dev/null; then
+  echo "git-available: true"
+  BRANCH=$(git branch --show-current)
+  echo "branch: ${BRANCH:-none}"
+  if git symbolic-ref --short refs/remotes/origin/HEAD &>/dev/null; then
+    echo "default-branch: $(git symbolic-ref --short refs/remotes/origin/HEAD)"
+  else
+    echo "default-branch: none"
+  fi
+else
+  echo "git-available: false"
+  echo "branch: none"
+  echo "default-branch: none"
+fi
+
+# --- manifest-inferred commands ----------------------------------------------
+# First match wins per category. These are suggestions; the user confirms.
+TEST_CMD=""
+LINT_CMD=""
+BUILD_CMD=""
+
+if [ -f package.json ]; then
+  echo "manifest: package.json"
+  grep -q '"test"[[:space:]]*:' package.json && TEST_CMD="npm test"
+  grep -q '"lint"[[:space:]]*:' package.json && LINT_CMD="npm run lint"
+  grep -q '"build"[[:space:]]*:' package.json && BUILD_CMD="npm run build"
+fi
+
+if [ -z "$TEST_CMD" ] && { [ -f pyproject.toml ] || [ -f pytest.ini ] || [ -f setup.cfg ] || [ -f tox.ini ]; }; then
+  echo "manifest: python (pyproject/pytest)"
+  TEST_CMD="pytest"
+fi
+
+if [ -z "$TEST_CMD" ] && [ -f go.mod ]; then
+  echo "manifest: go.mod"
+  TEST_CMD="go test ./..."
+  BUILD_CMD="go build ./..."
+fi
+
+if [ -z "$TEST_CMD" ] && [ -f Cargo.toml ]; then
+  echo "manifest: Cargo.toml"
+  TEST_CMD="cargo test"
+  BUILD_CMD="cargo build"
+fi
+
+if [ -z "$TEST_CMD" ] && { [ -f Gemfile ] || [ -f Rakefile ] || [ -d spec ]; }; then
+  echo "manifest: ruby (Gemfile/Rakefile)"
+  if [ -d spec ]; then TEST_CMD="bundle exec rspec"; else TEST_CMD="bundle exec rake test"; fi
+fi
+
+if [ -z "$TEST_CMD" ] && [ -f mix.exs ]; then
+  echo "manifest: mix.exs"
+  TEST_CMD="mix test"
+fi
+
+if [ -z "$TEST_CMD" ] && { [ -f pom.xml ]; }; then
+  echo "manifest: pom.xml"
+  TEST_CMD="mvn test"
+  BUILD_CMD="mvn package"
+fi
+
+if [ -z "$TEST_CMD" ] && { [ -f build.gradle ] || [ -f build.gradle.kts ]; }; then
+  echo "manifest: gradle"
+  TEST_CMD="gradle test"
+  BUILD_CMD="gradle build"
+fi
+
+if [ -z "$TEST_CMD" ] && [ -n "$(find . -maxdepth 3 -name '*.csproj' -print -quit 2>/dev/null)" ]; then
+  echo "manifest: dotnet (.csproj)"
+  TEST_CMD="dotnet test"
+  BUILD_CMD="dotnet build"
+fi
+
+# A Makefile with a test target is a common override regardless of language.
+if [ -f Makefile ] && grep -qE '^test:' Makefile; then
+  echo "manifest: Makefile (test target)"
+  [ -z "$TEST_CMD" ] && TEST_CMD="make test"
+  grep -qE '^lint:' Makefile && [ -z "$LINT_CMD" ] && LINT_CMD="make lint"
+  grep -qE '^build:' Makefile && [ -z "$BUILD_CMD" ] && BUILD_CMD="make build"
+fi
+
+echo "inferred-test-command: ${TEST_CMD:-none}"
+echo "inferred-lint-command: ${LINT_CMD:-none}"
+echo "inferred-build-command: ${BUILD_CMD:-none}"
diff --git a/plugin/skills/test-planning/SKILL.md b/plugin/skills/test-planning/SKILL.md
index 7fcb11b..a860478 100644
--- a/plugin/skills/test-planning/SKILL.md
+++ b/plugin/skills/test-planning/SKILL.md
@@ -5,7 +5,8 @@ description: >
   edge cases. Use when you need to create, generate, or draft a test plan for a
   branch, need to analyze test coverage, or need to identify what tests to write
   for specific files or directories. Does not write test code — produces a plan
-  document only. Does not refine or iterate on existing plans — use
+  document only; use tdd to implement behavior test-first through a
+  red-green-refactor loop. Does not refine or iterate on existing plans — use
   iterative-plan-review to improve a previously drafted work plan. Does not review
   code quality, security, or style — use code-review for full code review.
   Does not evaluate architectural testability or structural coupling — use