diff --git a/CHANGELOG.md b/CHANGELOG.md index 419952c..002f4f1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,55 @@ # Han Release Notes +## v2.3.0 + +The `/code-review` skill is recalibrated so its first pass produces the output the user has been getting only by running a manual second-pass reclassification: severity inflation is removed at the structural level, user-provided focus areas and branch-level context reach every dispatched sub-agent, and contradictory same-file findings are detected internally rather than landing for the human to adjudicate without a flag. + +### Calibration + +- The agent-finding classification rubric in `plugin/skills/code-review/references/agent-finding-classification.md` no longer carries a "Most findings land here" WARN floor across seven of the nine agent rubrics. The rubric defines each severity; size-based demotion is governed by `SKILL.md` Step 3.3, the new authoritative home. +- `SKILL.md` Step 3.3 is now the single source of truth for size-based demotion. The Review Constraints rule for manual findings (line 24), the Step 7.2 demotion gate for agent findings, the size-aware rubric, and the YAGNI two-pass procedure all reference Step 3.3 by name rather than restating its content. +- `SKILL.md` Step 7 is restructured into three numbered sub-steps. 7.1 reads agent output; 7.2 applies the merged reachability phrase-match demotion gate (CRIT → WARN → SUGG → omitted) when a finding's rationale contains `theoretical`, `hypothetical`, `defense-in-depth`, `effectively impossible`, `in case the upstream`, `could happen`, `should never happen`, or `edge case that does not occur`; 7.3 classifies the surviving findings using the size-aware rubric. Security findings are exempt from the gate because the security agent's evidence standard already requires a demonstrated exploit path. + +### Context plumbing + +- New `Step 1.5: Load Branch Context` runs after Step 1 (Mode A and Mode B only). It attempts the PR description via `gh pr view`, local `pr-body` files, branch commit messages, and an implementation plan from the planning directory (resolved via the `plans:` key in CLAUDE.md or by Glob fallback). The loaded summary binds to `$branch_context`. When nothing loads, the skill warns once and binds `$branch_context` to `none provided`. +- `$focus_areas` and `$branch_context` are explicit named bindings. Step 1 binds the user's free-form argument to `$focus_areas` (defaulting to `none provided` when empty); Step 1.5 binds the loader output to `$branch_context`. Every Step 3.5 agent prompt includes both bindings verbatim so the agents can deprioritize work the team has already deferred or resolved. +- `Bash(gh *)` is added to the skill's `allowed-tools` frontmatter so Step 1.5 can call `gh pr view`. + +### Per-agent dispatcher tailoring at Step 3.5 + +- `structural-analyst` and `behavioral-analyst` receive a default-SUGG dispatcher directive: every finding starts at SUGG; escalation to WARN or CRIT requires the change to actively introduce or worsen the issue. The agents' general behavior outside `/code-review` is unchanged. +- `junior-developer` receives a file-list scoping directive: outward reads are for context only; findings must concern code on the scoped file list. The agents' general behavior outside `/code-review` is unchanged. +- `edge-case-explorer` receives a narrower file-list directive that preserves Protocol 1's caller-read pattern: callers can be read as evidence, but the failure-mode target of every finding stays on the file list. + +### YAGNI two-pass procedure + +- `references/review-checklist.md`, the Step 3.3 calibration directive's YAGNI block, and the Review Constraints YAGNI rule are all rewritten to run YAGNI in two passes: Pass 1 evidence test against `yagni-rule.md` Gate 1, then Pass 2 named anti-pattern match. Each YAGNI finding's body names the failing evidence type, the matched anti-pattern, and the simpler form considered. The YAGNI section's verbatim opening statement is preserved. +- In Mode B (uncommitted changes) and Mode C (no git), the YAGNI checklist is skipped unless the user explicitly requests it via `$focus_areas`, since the diff signal that separates introduced code from pre-existing code is absent. + +### Self-consistency check + +- New `Step 9.0: Self-consistency check` runs before structural verification. An extraction pass collects `{task-id, file-path, line-range, recommended-action-summary}` tuples for every finding, then a comparison pass flags overlapping-line-range pairs whose recommendations prescribe opposite actions on the same code. Both findings are demoted by one severity and each receives a `Tension with {other-task-id}:` note for the human reviewer. Cross-file semantic contradictions are out of scope. + +### Premise verification before standards-compliance findings + +- Step 5 now requires reading at least one architectural file in the codebase that demonstrates a standard's premise before raising a "violates standard X" finding. When the file does not confirm the premise (e.g., the standard assumes SPA-style company switching but the codebase uses full-page redirects), the finding is omitted with a logged note. The "infer the premise from the standard's own examples" path is now a reason to omit, not a forward path to raise. + +### Documentation + +- [`docs/skills/code-review.md`](./docs/skills/code-review.md) is updated to mirror the new step structure (Step 1.5, the Step 7 sub-steps, Step 9.0), the per-agent dispatcher tailoring, the size-based demotion model, the YAGNI two-pass procedure, the full agent task ID format set, and the new YAGNI section in the output description. +- The four affected agent docs ([`docs/agents/structural-analyst.md`](./docs/agents/structural-analyst.md), [`docs/agents/behavioral-analyst.md`](./docs/agents/behavioral-analyst.md), [`docs/agents/junior-developer.md`](./docs/agents/junior-developer.md), [`docs/agents/edge-case-explorer.md`](./docs/agents/edge-case-explorer.md)) each carry a one-paragraph note explaining the `/code-review` Step 3.5 dispatcher tailoring and confirming the agents' default behavior in other skills is unchanged. +- [`docs/yagni.md`](./docs/yagni.md) `/code-review` table row is updated to reflect the two-pass procedure and the Mode B / Mode C YAGNI skip. +- [`docs/skills/gh-pr-review.md`](./docs/skills/gh-pr-review.md) gains a Key Concept noting that the wrapped `/code-review` Step 1.5 plumbs the PR description into every agent's `$branch_context`. + +### Deferred (YAGNI) + +- A dedicated S12 mode flag for default-SUGG suppression is deferred. The size-aware rubric (Pair A) plus the merged Step 7.2 demotion gate (Pair B) plus the rewritten Review Constraints rule subsume the workaround the user has been running manually. +- A structured "directly introduced" field in agent output formats is deferred in favor of phrase-matching at Step 7.2. +- Cross-file semantic contradiction detection in Step 9.0 is deferred; only single-file overlapping-line-range contradictions are checked. +- An automated test harness, per-agent unit tests, and Mode C standalone tests are deferred. Validation runs against three real PR bundles in `tmp/gearjot-v2-web-pr-{299,307,339}/`. +- Edits to the four affected agent definition files are deferred; `/code-review`'s tailoring lives in Step 3.5 dispatcher directives so the agents remain general-purpose for other callers. + ## v2.2.0 The `/gap-analysis` swarm flips from opt-in to opt-out, `junior-developer` is promoted to a required swarm role at every size to run an explicit actor-perspective sweep, and `project-manager` joins the swarm at medium and large to consolidate Section 4 of the report. diff --git a/docs/agents/behavioral-analyst.md b/docs/agents/behavioral-analyst.md index 09fc522..8aa1ada 100644 --- a/docs/agents/behavioral-analyst.md +++ b/docs/agents/behavioral-analyst.md @@ -17,6 +17,7 @@ Operator documentation for the `behavioral-analyst` agent in the han plugin. Thi - **Error paths are not optional.** The agent walks try/catch blocks, error returns, and failure paths explicitly. Happy-path-only analysis is an anti-pattern. - **Implicit state counts.** Closures, module-level singletons, memoization caches, and thread-local state are flagged alongside explicit variables and databases. - **Discovers findings, does not synthesize.** Recommendations belong to `software-architect`. Risk assessment belongs to `risk-analyst`. Bug investigation belongs to `evidence-based-investigator`. +- **`/code-review` adds a default-SUGG dispatcher directive at Step 3.5.** When dispatched from `/code-review` (version 2.3.0+), the skill appends an instruction to default the severity of every finding to SUGG and escalate to WARN or CRIT only when the change actively introduces or worsens the issue. This is `/code-review`'s tailoring; the agent's general behavior outside `/code-review` is unchanged. Other callers (`/architectural-analysis`, `/investigate`) receive the agent's default skeptical posture. ## When to use it diff --git a/docs/agents/edge-case-explorer.md b/docs/agents/edge-case-explorer.md index 7851c60..7178ebb 100644 --- a/docs/agents/edge-case-explorer.md +++ b/docs/agents/edge-case-explorer.md @@ -17,6 +17,7 @@ Operator documentation for the `edge-case-explorer` agent in the han plugin. Thi - **Trace inputs to the immediate caller, deeper at boundaries.** Internal function-to-function chains are trusted unless a clear external-data or type-coercion signal appears. Exhaustive mode traces to origin. - **Code location per finding.** Every `EC#` cites the affected `file:line` and references the input it touches. Untraceable edge cases are dropped. - **Discovers and catalogs, does not write tests.** Output is a prioritization plan. `test-engineer` or your team writes the tests. +- **`/code-review` adds a failure-mode-target dispatcher directive at Step 3.5.** When dispatched from `/code-review` (version 2.3.0+), the skill appends an instruction that findings must ultimately trace to a failure mode in code on the scoped file list, even when callers outside the file list provide the evidence for that failure mode. The agent's Protocol 1 caller-read still applies; the file-list scope is on the failure-mode target, not the evidence source. This is `/code-review`'s tailoring; the agent's general behavior outside `/code-review` is unchanged. ## When to use it diff --git a/docs/agents/junior-developer.md b/docs/agents/junior-developer.md index 50ac1a1..025db70 100644 --- a/docs/agents/junior-developer.md +++ b/docs/agents/junior-developer.md @@ -17,6 +17,7 @@ Operator documentation for the `junior-developer` agent in the han plugin. This - **Adversarial toward the artifact, never toward people.** Every clarifying question traces back to a location in the artifact, the conversation, or the codebase. - **Defers to specialists.** Flags where UX, DevOps, security, architecture, or testing depth is needed and names the specialist to dispatch. This agent does not claim their expertise. - **Open Questions as first-class output.** The questions the team has not yet answered are surfaced before specialists are dispatched so the specialists can aim their review. +- **`/code-review` adds a file-list scoping dispatcher directive at Step 3.5.** When dispatched from `/code-review` (version 2.3.0+), the skill appends an instruction that outward reads (adjacent code, callers) are for context only and findings must concern code on the scoped file list. A finding about code outside the file list is permitted only when it directly demonstrates that the changed code on the file list cannot be safely interpreted without the out-of-scope context. This is `/code-review`'s tailoring; the agent's general behavior outside `/code-review` is unchanged. ## Summary @@ -138,7 +139,7 @@ URL: https://www.edge.org/conversation/daniel_kahneman-adversarial-collaboration ### Hunt and Thomas: The Pragmatic Programmer (Rubber-Duck Debugging) -Andy Hunt and Dave Thomas introduced the "rubber duck" practice: explaining a problem out loud in plain language to surface the gaps in your own reasoning. The agent's Plain-Language Reframing protocol (Protocol 7) is the rubber duck applied to plans, designs, and standards documents. Restating the artifact in the thirty-second-whiteboard version often exposes the load-bearing jargon, the missing step, or the hidden assumption that the author could not see. +Andy Hunt and Dave Thomas introduced the "rubber duck" practice: explaining a problem out loud in plain language to surface the gaps in your own reasoning. The agent's Plain-Language Reframing protocol (Protocol 8) is the rubber duck applied to plans, designs, and standards documents. Restating the artifact in the thirty-second-whiteboard version often exposes the load-bearing jargon, the missing step, or the hidden assumption that the author could not see. URL: https://pragprog.com/titles/tpp20/the-pragmatic-programmer-20th-anniversary-edition/ diff --git a/docs/agents/structural-analyst.md b/docs/agents/structural-analyst.md index e629aa8..a43231a 100644 --- a/docs/agents/structural-analyst.md +++ b/docs/agents/structural-analyst.md @@ -17,6 +17,7 @@ Operator documentation for the `structural-analyst` agent in the han plugin. Thi - **Coupling has texture.** The agent distinguishes afferent (who depends on this?) from efferent (what does this depend on?), and stable-dependency from volatile-dependency, rather than counting imports as a single number. - **Negative results are valuable.** When a dimension surfaces no issues, the agent says so explicitly. *"Well-structured"* is a finding too. - **Discovers findings, does not synthesize.** Recommendations belong to `software-architect`. Risk assessment belongs to `risk-analyst`. +- **`/code-review` adds a default-SUGG dispatcher directive at Step 3.5.** When dispatched from `/code-review` (version 2.3.0+), the skill appends an instruction to default the severity of every finding to SUGG and escalate to WARN or CRIT only when the change actively introduces or worsens the issue. This is `/code-review`'s tailoring; the agent's general behavior outside `/code-review` is unchanged. Other callers (such as `/architectural-analysis`) receive the agent's default skeptical posture. ## When to use it diff --git a/docs/plans/code-review-guardrails/artifacts/.discovery-notes.md b/docs/plans/code-review-guardrails/artifacts/.discovery-notes.md new file mode 100644 index 0000000..81a8c7a --- /dev/null +++ b/docs/plans/code-review-guardrails/artifacts/.discovery-notes.md @@ -0,0 +1,92 @@ +# Discovery Notes: Code-Review Guardrails Implementation + +These notes are the single source of truth for project context for every specialist engaged in this implementation plan. Read these first; do not re-grep for items already documented here. + +## Tech stack and project type + +- **Project type:** Claude Code plugin (`han`), distributed via Test Double marketplace. +- **Implementation surface:** Markdown files with YAML frontmatter only. No compiled code, no JS/TS, no runtime. No test runner, no build step, no lint. "Implementation" here means edits to Markdown skill bodies, agent definitions, and reference files. +- **Repository layout:** `plugin/` is the shipped plugin; `docs/` is canonical operator-facing documentation. See `CLAUDE.md` for the full map. +- **Version:** 2.2.0 (`CHANGELOG.md`). + +## ADRs + +- **None found.** No `docs/adr/`, no `docs/architecture/decisions/`, no `*adr*.md` files anywhere in the repo. Recorded as a gap. + +## Coding standards + +- **`docs/writing-voice.md`** — canonical voice profile every documentation file must follow. No em-dashes. Direct second person. Plainspoken mentor tone. No flattery, no hype, no "production-ready"-style boilerplate. Named voice violations to avoid. +- **`docs/yagni.md`** — operator-facing explanation of the evidence-based YAGNI rule. +- **`plugin/references/yagni-rule.md`** — the canonical YAGNI rule referenced by skills and agents. Defines two gates (evidence test, simpler-version test) and the named anti-patterns. +- **`docs/concepts.md`** — skill-vs-agent mental model. Required reading for understanding what a "skill" is versus an "agent". +- **`docs/sizing.md`** — small/medium/large dispatch model used by swarming skills. + +## Authoring guidance (relevant to this work) + +- **`docs/guidance/agent-building-guidelines/`** — five files: `agent-domain-focus.md`, `agent-external-files.md`, `agent-model-selection.md`, `graceful-degradation.md`, `multi-agent-economics.md`. Agents that change other agents' behavior need to honor these. +- **`docs/guidance/skill-building-guidance/`** — 22 files covering description frontmatter, progressive disclosure, context hygiene, bash permissions, script execution, naming conventions, success-criteria-and-testing, troubleshooting, and skill-composition. The `success-criteria-and-testing.md` file is directly relevant: the investigation's "what the investigation does not cover" section explicitly calls for a test plan, and that test plan must follow this guidance. + +## Code touch points + +The investigation names the exact files to edit. Each touch point is paired with the long-form doc that mirrors it under `docs/` — both must be kept in sync per the project map. + +| Plugin file (implementation) | Mirror under `docs/` (operator-facing) | Solutions touching it | +|---|---|---| +| `plugin/skills/code-review/SKILL.md` (301 lines) | `docs/skills/code-review.md` | S2, S5, S6, S7, S8, S9, S10, S11, S12, S13 | +| `plugin/skills/code-review/references/agent-finding-classification.md` (81 lines) | (no mirror — reference-only) | S1 | +| `plugin/skills/code-review/references/review-checklist.md` (78 lines) | (no mirror — reference-only) | S7 | +| `plugin/skills/code-review/references/template.md` (99 lines) | (no mirror — reference-only) | (no change required) | +| `plugin/agents/structural-analyst.md` (94 lines) | `docs/agents/structural-analyst.md` | S3 | +| `plugin/agents/behavioral-analyst.md` (98 lines) | `docs/agents/behavioral-analyst.md` | S3 | +| `plugin/agents/junior-developer.md` (345 lines) | `docs/agents/junior-developer.md` | S4 | +| `plugin/agents/edge-case-explorer.md` (217 lines) | `docs/agents/edge-case-explorer.md` | S4 | + +## Existing convention precedents + +- **Calibration directive lives only in `SKILL.md` Step 3.3.** Confirmed by reading. The directive's text is reproduced verbatim into agent prompts at Step 3.5; it is not referenced by `agent-finding-classification.md`. +- **"Most findings land here" floor.** Confirmed in `agent-finding-classification.md` lines 6, 14, 34, 42, 50, 58, 68 (seven sections, not nine — test-engineer, edge-case-explorer, security, structural-analyst, behavioral-analyst, concurrency-analyst, data-engineer, devops-engineer, junior-developer rubrics; the security rubric does not use the phrase, and the junior-developer rubric uses "SUGG" as its expected landing zone). +- **Review-wide "prefer the higher severity" rule** at `SKILL.md` line 24, in the Review Constraints section. Governs Steps 4–6 manual review. +- **`SKILL.md` already has `Step 3.5` agent prompts** that include the calibration directive but no `{focus areas}` or `{user context}` placeholders. Confirmed. +- **"Skeptic-default" rules** at `structural-analyst.md:87,91` and `behavioral-analyst.md:90,95` — exact text matches the investigation. The investigation S3 explicitly preserves the "include when in doubt" rule but lowers the default output severity. +- **`junior-developer.md` Protocol 4** at line 117 includes "patterns in code adjacent to what the artifact will change". This is the adjacent-read protocol S4 targets. +- **`edge-case-explorer.md` Protocol 1** at line 37 instructs "use Grep to search for every call site of the target code's public functions. Read the callers". This is the callers protocol S4 targets. +- **Mode A / Mode B / Mode C** are first-class concepts in `SKILL.md` Step 1's "Detect review context" block. S11 modifies behavior conditional on mode. +- **`SKILL.md` Step 1** notes focus areas "for use in Step 4". The note appears at the end of the Step 1 block; S5 plumbs this through to Step 3.5 and S6 adds Step 1.5 to load PR/branch context. +- **Long-form docs mirror the plugin.** `docs/skills/code-review.md` and the four affected `docs/agents/*.md` files exist and must be updated alongside the plugin changes (CLAUDE.md: "Long-form docs in `docs/skills/{name}.md` and `docs/agents/{name}.md` are the canonical operator-facing source for every skill and every agent"). +- **CHANGELOG.md** records version bumps. A change of this scope (skill behavior + four agent edits) is a minor version bump under `docs/guidance/semantic-versioning.md` (not read in full here; specialists may read on demand). + +## Recent activity in target files (last 90 days) + +`git log --since="90 days ago" --name-only` returns ONLY the files this plan will edit: + +- `plugin/skills/code-review/SKILL.md` +- `plugin/skills/code-review/references/agent-finding-classification.md` +- `plugin/skills/code-review/references/review-checklist.md` +- `plugin/skills/code-review/references/template.md` +- `plugin/skills/code-review/scripts/detect-review-context.sh` +- `plugin/agents/structural-analyst.md` +- `plugin/agents/behavioral-analyst.md` +- `plugin/agents/junior-developer.md` +- `plugin/agents/edge-case-explorer.md` + +This is recent churn on the exact surface the plan changes. The skill and its agents are under active iteration; no co-edits to unrelated systems are expected. + +## Existing planning artifacts + +- **`docs/plans/code-review-guardrails/investigation.md`** is the source spec. It already names all 13 solutions (S1–S13) with file locations and the C# causes each solution addresses. +- **`docs/plans/code-review-guardrails/artifacts/`** contains four evidence files (`pr-299-evidence.md`, `pr-307-evidence.md`, `pr-339-evidence.md`, `skill-behavioral-trace.md`) that ground every cause and solution. +- **No `feature-specification.md` produced by `plan-a-feature` exists.** The investigation is serving as the source spec. No `decision-log.md`, `team-findings.md`, or `feature-technical-notes.md` from a prior `plan-a-feature` run. Per Step 1 of `plan-implementation`, this means T#-related sentences are omitted from agent briefs. +- **No prior implementation plan exists in this folder.** This run creates the first. + +## Explicit gaps (things searched for and not found) + +- **No ADRs anywhere in the repo.** Specialists should not cite ADR violations as evidence for this plan; ADR-shaped concerns route to documentation-compliance instead, or get noted as a Deferred (YAGNI) item if no concrete decision requires recording. +- **No `feature-specification.md` for this work.** The investigation document is the source spec; this is acknowledged in the plan's `Source Specification` section. +- **No `AGENTS.md` at the repo root.** The skill-and-agent guidance lives entirely under `docs/guidance/`. +- **No automated test suite for the plugin.** No way to assert behavior automatically; verification happens by re-running `/code-review` against the three sample PR bundles (PR 299, PR 307, PR 339), per the investigation's "what the investigation does not cover" section. +- **No `feature-technical-notes.md`.** No committed mechanics from a prior spec stage — all mechanics in the investigation are open for the implementation plan to shape. + +## Other relevant context + +- **Counts to verify when editing indexes** (from CLAUDE.md): 21 agents in `plugin/agents/`, 15 skills in `plugin/skills/`, 21 long-form agent docs in `docs/agents/`, 15 long-form skill docs in `docs/skills/`. No counts change in this plan; no index files (`docs/skills/README.md`, `docs/agents/README.md`) need editing. +- **Derivative skill `gh-pr-review`** depends on `/code-review`'s output. The investigation calls out the blast radius is "contained to the code-review pipeline and its derivative `gh-pr-review`". The `gh-pr-review` skill is not edited directly by any solution, but its behavior changes transitively. diff --git a/docs/plans/code-review-guardrails/artifacts/implementation-decision-log.md b/docs/plans/code-review-guardrails/artifacts/implementation-decision-log.md new file mode 100644 index 0000000..394e578 --- /dev/null +++ b/docs/plans/code-review-guardrails/artifacts/implementation-decision-log.md @@ -0,0 +1,297 @@ +# Implementation Decision Log: Code-Review Skill Guardrails + +This file records every implementation decision committed while planning the code-review guardrails work. Behavioral and implementation statements live in [../feature-implementation-plan.md](../feature-implementation-plan.md); this file captures the question, rationale, evidence, and rejected alternatives for each decision. Round-by-round history lives in [implementation-iteration-history.md](implementation-iteration-history.md). + +The investigation (`../investigation.md`) already committed S1–S13 as solutions; those commitments are recorded below as trivial decisions. The full decisions below concern *how* to implement those commitments: where authoritative rules live, how Step 7 is structured, how merged sub-steps fire, what scope S3/S4 take, sequencing constraints, scope additions for docs and versioning, and the named-binding plumbing the agents require. + +## Trivial decisions + +- D-19: Commit S1 (size-aware rubric in `agent-finding-classification.md`), replace the "Most findings land here" WARN floor in seven rubrics with size-aware language; investigation S1 already commits the change. Referenced in plan: Implementation Approach (Runtime Behavior), Decomposition and Sequencing. +- D-20: Commit S5 (add `{focus areas}` and `{branch context}` placeholders to every Step 3.5 agent prompt template), investigation S5 already commits the change. Referenced in plan: Implementation Approach (Runtime Behavior), Decomposition and Sequencing. +- D-21: Commit S11 (document Mode B and Mode C scope limitations in Step 4), investigation S11 already commits the change. Referenced in plan: Implementation Approach (Runtime Behavior), Decomposition and Sequencing. + +## Full decisions + +### D-1: Step 3.3 is the authoritative home for size-based demotion + +- **Question:** Where does size-based demotion live so that every consumer site references one source of truth instead of restating the rule with drift? +- **Decision:** Step 3.3 of `plugin/skills/code-review/SKILL.md` is the single authoritative source for the size-based demotion rules. The rewritten line 24 (S13), the new Step 7 sub-steps (S2 merged with S10), the rewritten rubric sections in `agent-finding-classification.md` (S1), and the new YAGNI two-pass procedure (S7) all reference Step 3.3 by name rather than restating its content. +- **Rationale:** The investigation's central diagnosis (C1, C2, C12) is that calibration is defined once and bypassed everywhere else. Multiple authoritative sites is the structural failure; centralizing on Step 3.3 fixes the cause rather than papering over its symptoms. Centralization also makes future calibration tuning a one-file edit. +- **Evidence:** SA-1; investigation C1, C2, C12 (`../investigation.md`); `plugin/skills/code-review/SKILL.md` Step 3.3 already carries the directive text. +- **Rejected alternatives:** + - Restate the size-based rules at each consumer site (line 24, rubric, Step 7, YAGNI). Rejected because this reproduces the failure mode the investigation documents (drift between sites); evidence: investigation C2 ("Step 7 reclassifies agent output without applying Step 3.3's size-based demotion") shows the drift is the active bug. + - Move the authoritative home to `agent-finding-classification.md`. Rejected because the rubric is consulted only at Step 7, but the directive must also govern line 24 and the agent prompts at Step 3.5; Step 3.3 is already in scope for both. +- **Specialist owner:** `software-architect`. +- **Revisit criterion:** A future change asks consumer sites to encode different size logic from Step 3.3, in which case the rule has stopped being authoritative and the centralization should be re-examined. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** D-2, D-3, D-11, D-15. +- **Referenced in plan:** Implementation Approach (Architecture and Integration Points, Runtime Behavior), Decomposition and Sequencing. + +### D-2: Step 7 takes a numbered sub-step structure (7.1 read, 7.2 merged demote, 7.3 rubric) + +- **Question:** Where in Step 7 do the new mechanics from S2, S10, and S1 land, and in what order? +- **Decision:** Step 7 is rewritten as three numbered sub-steps. **7.1 Read agent output files** preserves the current behavior of loading each agent's findings. **7.2 Apply size-based and reachability demotion** runs the merged S2 + S10 phrase-match gate (see D-3) against every finding before classification. **7.3 Classify with the rubric** then applies the size-aware rubric from `agent-finding-classification.md` (D-19 / S1). Original investigation language separated S2 from S10; this plan merges them per D-3 and collapses two sub-steps into one. +- **Rationale:** SA-2 named the explicit numbering as a precondition for S2 and S10 to land without ambiguity. The merge of 7.2 and 7.3 (originally proposed as separate sub-steps) tracks BA-4's finding that S2 and S10 share a phrase-list demotion signal and should not be two sequential demote passes that risk double-demotion (V5 in the investigation's adversarial validation). +- **Evidence:** SA-2; BA-4; investigation's "Note on the S1 and S2 interaction" warning about double-demotion. +- **Rejected alternatives:** + - Leave Step 7 as a single unnumbered block with the new sub-steps inserted inline. Rejected because the order of "demote then classify" versus "classify then demote" determines the final severity and must be explicit (SA-2 evidence). + - Keep S2 and S10 as separate sub-steps 7.2 and 7.3, with the rubric becoming 7.4. Rejected because BA-4 demonstrated that both signals demote on rationale-phrase matches and a single merged gate satisfies both with no double-demotion risk. +- **Specialist owner:** `software-architect`. +- **Revisit criterion:** Post-ship validation against PR 299 shows S2 and S10 needed to fire on different signals after all, requiring the merged gate to split back into two sub-steps. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** D-3. +- **Referenced in plan:** Implementation Approach (Runtime Behavior), Decomposition and Sequencing. + +### D-3: Merge S2 and S10 into a single Step 7.2 reachability phrase-match gate + +- **Question:** Should S2 fire on a "directly introduced" signal (investigation's original framing) or on a reachability phrase list (S10's signal)? +- **Decision:** S2 and S10 merge into a single Step 7.2 sub-step that demotes findings on a documented reachability phrase list. The phrase list, fixed in the SKILL.md body so it can be tuned in one place, is: `theoretical`, `hypothetical`, `defense-in-depth`, `effectively impossible`, `in case the upstream`, `could happen`, `should never happen`, `edge case that does not occur`. When a finding's rationale text contains any of these phrases, the gate demotes the finding by one severity (CRIT → WARN, WARN → SUGG, SUGG → omitted). The "directly introduced" condition from the investigation's original S2 is dropped because Step 7.3's size-aware rubric already encodes change-relatedness. +- **Rationale:** OQ-3 resolution from R1. The investigation's original "skip when directly introduced" rule (the V5 interaction note) required a structured field in agent output that no agent emits; phrase-matching is the simpler version that satisfies the same evidence (PR 299 E6, E10, E13, E14, E15, E16, all of which contain at least one of the listed phrases verbatim). BA-4 supplied the phrase list; BA-5 supplied the documented limitation (this is a phrase-match gate only, not a semantic reachability analyzer). +- **Evidence:** BA-4, BA-5; PR 299 evidence file confirms the phrases appear in raw agent output for the demotion targets; investigation Note after S13 (the V5 interaction warning). +- **Rejected alternatives:** + - Implement S2's "directly introduced" condition as a structured agent-output field. Rejected because no agent currently emits the field, and adding it across nine agents is a larger change with no evidence the structured form is needed (Gate 2 simpler-version test, recorded as YAGNI candidate in the plan's deferred section). + - Keep S2 and S10 as separate signals firing in sequence. Rejected because both demote on rationale text and sequential firing risks double-demotion per V5 (investigation's adversarial validation). +- **Specialist owner:** `behavioral-analyst`. +- **Revisit criterion:** Post-ship validation surfaces a finding the phrase list misses that a structured "directly introduced" field would have caught, or surfaces a false-positive demotion the phrase list cannot reject without semantic analysis. +- **Dissent (if any):** None at synthesis. The R1 dispute between SA-3 (anchor to Step 3.3 vocabulary) and BA-4 (use reachability phrase list) was resolved by the merge, which references Step 3.3 for criteria when applicable while using the phrase list as the trigger. +- **Driven by rounds:** R1. +- **Dependent decisions:** None downstream; this is the leaf decision for the merged mechanic. +- **Referenced in plan:** Implementation Approach (Runtime Behavior), Decomposition and Sequencing, RAID Log (assumption on phrase-list completeness). + +### D-4: S3 and S4 ship as Step 3.5 dispatcher directives, not as edits to the four agent definition files + +- **Question:** Should S3 (default-severity adjustment for structural-analyst and behavioral-analyst) and S4 (scoped-file-list constraint for junior-developer and edge-case-explorer) be implemented as edits to the four agent definition files, or as additions to the Step 3.5 dispatch prompts in SKILL.md? +- **Decision:** Both S3 and S4 land in SKILL.md Step 3.5 as additions to the relevant agent dispatch prompts. The four agent definition files (`plugin/agents/structural-analyst.md`, `plugin/agents/behavioral-analyst.md`, `plugin/agents/junior-developer.md`, `plugin/agents/edge-case-explorer.md`) are not edited. +- **Rationale:** OQ-5 resolution from R1, with consensus across SA-4, JD-007, and JD-014. The four agents are dispatched by skills outside `/code-review` (gap-analysis, plan-implementation, plan-a-feature, investigate). Edits to the agent bodies change behavior for every caller; edits to Step 3.5 dispatch prompts change behavior only for `/code-review`. The investigation's adversarial validation (V4, V8) already accepts that the agents' general behavior is correct; the bug is that `/code-review` was not tailoring the dispatch. +- **Evidence:** SA-4; JD-007; JD-014; investigation V4 ("outward reads in junior-developer and edge-case-explorer are for context, not findings, S4's framing remains correct"); investigation V8 ("lowered default severity without removing 'include when in doubt'"). +- **Rejected alternatives:** + - Edit the four agent definition files as the investigation originally proposed. Rejected because the blast radius extends beyond `/code-review` to every skill that dispatches these agents, with no evidence the other skills exhibit the same calibration failure. + - Implement S3 in agent bodies but S4 as a Step 3.5 directive. Rejected because the symmetric specialist consensus across all four agents argued the same scope reasoning applies to both. +- **Specialist owner:** `software-architect`. +- **Revisit criterion:** A cross-project validation run (TP-17) or a user report shows a non-`/code-review` skill exhibits the same severity-inflation or out-of-scope-read failure mode, in which case the global agent-body edit becomes the correct scope. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** D-18. +- **Referenced in plan:** Implementation Approach (Architecture and Integration Points, Runtime Behavior), Decomposition and Sequencing. + +### D-5: Atomic shipping pairs and sequencing + +- **Question:** Which solutions must ship in the same commit to avoid partial-state regressions, and what is the overall sequence? +- **Decision:** Three atomic pairs are enforced. **Pair A: S1 + S13 ship together** (the rubric and the Review Constraints line are the two halves of the same size-based severity rule; shipping one without the other leaves either agent findings or manual findings unchanged). **Pair B: S2-and-S10-merged + the Step 7 sub-step structure (D-2) ship together** (the merged gate needs the numbered sub-step structure to be inserted into). **Pair C: S5 + S6 + the dispatcher-directive form of S3 and S4 (D-4) ship together** (all four touch Step 3.5 dispatch prompts, and S3 only has observable effect when S1 has shipped per BA-7, so Pair A must precede Pair C). The overall sequence is **Pair A → Pair B → Pair C → S7 → S8 → S9 → S11 → docs sync + version bump + CHANGELOG**. +- **Rationale:** SA-5 named the three atomic pairs and the precondition that "authoritative home" (D-1) and "Step 7 sub-step structure" (D-2) must be pre-decided before any edits begin. BA-7 added the S3-requires-S1 dependency. The investigation's "Implementation sequence" note (which JD-011 promoted into the plan) corroborates the leverage ordering. +- **Evidence:** SA-5; BA-7; investigation "Implementation sequence" note in "What the investigation does not cover". +- **Rejected alternatives:** + - Ship every solution in a single large commit. Rejected because Pair A delivers value with no dependencies on the rest, and decoupling validates the calibration fix on PR 299 before the context-forwarding changes land. + - Ship S3 and S4 before Pair A. Rejected because S3 has no observable effect until the rubric (S1) has been rewritten, BA-7 evidence. +- **Specialist owner:** `software-architect`. +- **Revisit criterion:** A pair fails validation independently, in which case the pair definition is wrong and the sequence is re-examined. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** D-6, D-7. +- **Referenced in plan:** Decomposition and Sequencing. + +### D-6: Version bump is minor (2.2.0 → 2.3.0) + +- **Question:** Does this branch ship as a minor or major version bump, and is a CHANGELOG entry required? +- **Decision:** Minor bump from 2.2.0 to 2.3.0 in `plugin/.claude-plugin/plugin.json`. A CHANGELOG entry is required under `## 2.3.0` summarizing the four calibration changes (size-aware rubric, merged Step 7 demotion gate, focus-area plumbing, dispatcher directives), the docs sync, and the deferred items. +- **Rationale:** OQ-2 resolution from R1. The skill name, argument signature, and output-format structure are unchanged; only behavior is calibrated. `docs/guidance/semantic-versioning.md` reserves major bumps for backward-incompatible changes to those surfaces. +- **Evidence:** JD-002; `docs/guidance/semantic-versioning.md`; `CHANGELOG.md` precedent (the 2.2.0 entry covers behavioral changes without API change). +- **Rejected alternatives:** + - Major bump (2.2.0 → 3.0.0). Rejected because no published interface changes; users invoke `/code-review` with the same arguments and receive the same output schema. + - Patch bump (2.2.0 → 2.2.1). Rejected because the calibration changes are behavior-shaping and a user upgrading without reading the CHANGELOG would observe different outputs. +- **Specialist owner:** `project-manager`. +- **Revisit criterion:** A post-ship user report demonstrates the behavior change broke a workflow that depended on the prior calibration, in which case the bump should retroactively have been major. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** None. +- **Referenced in plan:** Operational Readiness, Definition of Done. + +### D-7: Long-form `docs/` mirror updates are in scope + +- **Question:** Does this branch update the long-form operator-facing docs (`docs/skills/code-review.md` and the four affected `docs/agents/*.md` files), or defer them to a follow-up branch? +- **Decision:** In scope. Five long-form docs are updated alongside the plugin edits: `docs/skills/code-review.md`, `docs/agents/structural-analyst.md`, `docs/agents/behavioral-analyst.md`, `docs/agents/junior-developer.md`, `docs/agents/edge-case-explorer.md`. The agent docs receive lighter edits than originally implied because D-4 keeps the agent definitions themselves unchanged; the docs note that `/code-review` now tailors dispatch via Step 3.5 directives. +- **Rationale:** OQ-1 resolution from R1. The repo's CLAUDE.md explicitly states that "long-form docs in `docs/skills/{name}.md` and `docs/agents/{name}.md` are the canonical operator-facing source." Shipping plugin behavior without the doc mirror leaves the canonical source describing behavior that no longer matches the implementation. +- **Evidence:** JD-001; `CLAUDE.md` lines on the docs convention. +- **Rejected alternatives:** + - Defer docs to a follow-up branch. Rejected because the canonical source would be stale on `main` between the two branches, and the repo convention treats the long-form doc as authoritative for what the skill does. + - Update only `docs/skills/code-review.md`, leave the agent docs unchanged. Rejected because D-4's dispatcher-directive scope is a notable shift in how `/code-review` uses the four agents and the agent docs need a one-paragraph note about it. +- **Specialist owner:** `project-manager`. +- **Revisit criterion:** A future branch demonstrates that splitting docs from plugin edits is operationally cheaper, in which case the convention should be re-examined. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** None. +- **Referenced in plan:** Decomposition and Sequencing, Definition of Done. + +### D-8: Add `Bash(gh *)` to SKILL.md `allowed-tools` frontmatter + +- **Question:** Does S6's `gh pr view` call require a frontmatter change to `plugin/skills/code-review/SKILL.md`? +- **Decision:** Yes. Add `Bash(gh *)` to the `allowed-tools` field at SKILL.md line 6. The new value reads `allowed-tools: Bash(git *), Bash(gh *), Bash(make *), Bash(npm *), Read, Grep, Glob, Agent`. +- **Rationale:** OQ-4 resolution from R1. JD-013 verified by grep that the current frontmatter does not include `Bash(gh *)`; without the permission, S6's branch-context loader cannot call `gh pr view` and fails the Mode A path silently. +- **Evidence:** JD-013; `plugin/skills/code-review/SKILL.md:6` (verified during R1). +- **Rejected alternatives:** + - Skip `gh pr view` and read PR descriptions only from local `pr-body` files. Rejected because Mode A (the dominant user path) routinely has no local `pr-body` file; the live PR description is what the user wants loaded. + - Use a shell-out via `Bash(*)`. Rejected because the plugin's other skills scope tools narrowly and `Bash(*)` is a broader grant than needed. +- **Specialist owner:** `devops-engineer` (callable on dispatch); the change itself is a frontmatter edit. +- **Revisit criterion:** A user environment exists where `gh` is not installed, in which case S6's fail-open behavior (D-9) covers the gap and no further change is needed; but if the failure mode is silent and undiagnosable, revisit. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** D-9. +- **Referenced in plan:** Implementation Approach (Architecture and Integration Points), Definition of Done. + +### D-9: S6 fail-open behavior prints a visible warning when no PR or branch context loads + +- **Question:** When S6's branch-context loader cannot retrieve a PR description (no `gh` available, no PR open, no local `pr-body` file, no planning doc), does the skill proceed silently or warn the user? +- **Decision:** Fail open with a visible warning. When none of the four context sources (PR description via `gh pr view`, local `pr-body` file, commit messages, planning-directory implementation plan) returns content, Step 1.5 emits a single-line warning to the orchestrator's output: `Branch Context: no PR or planning artifact found; agents will run without branch-level context.` The skill proceeds with `$branch_context` bound to the literal string `none provided`. +- **Rationale:** BA-2 surfaced this. A silent fail-open reproduces the C5/C6 failure (context not forwarded) without telling the user the loader was tried and missed; a hard fail blocks the whole review when the loader is non-essential. +- **Evidence:** BA-2; investigation C5, C6. +- **Rejected alternatives:** + - Silent fail-open. Rejected because the user has no signal that context loading was attempted and missed; the failure mode is the same as no S6 at all. + - Hard fail. Rejected because Mode B and Mode C legitimately have no PR or planning artifact; treating that as a blocker breaks the local-review path. +- **Specialist owner:** `behavioral-analyst`. +- **Revisit criterion:** A user reports the warning is noisy in Mode B/C, in which case the warning can be gated on Mode A. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** None. +- **Referenced in plan:** Implementation Approach (Runtime Behavior), Definition of Done. + +### D-10: S6 planning-directory lookup uses a structured CLAUDE.md key with Glob fallback + +- **Question:** How does S6 locate an implementation plan or design doc for the current branch? +- **Decision:** S6's planning lookup reads CLAUDE.md for a structured key, then falls back to a Glob search. The lookup order: (a) parse `## Project Discovery` for a `plans:` or `planning:` key naming the planning directory (e.g., `plans: docs/plans/`); (b) if no key, Glob `docs/plans/*/feature-implementation-plan.md` and `plans/*/feature-implementation-plan.md`; (c) if Glob returns multiple, match the branch name (with `-` and `_` interchangeable) against the directory names; (d) if no match, log "no planning artifact found for branch {name}" to the agent's output and proceed. +- **Rationale:** BA-8 named the structured-key-with-fallback pattern as the closest match to the existing Step 1 project-config resolution. The two-tier approach gives projects with a CLAUDE.md a deterministic answer and degrades gracefully for projects without one. +- **Evidence:** BA-8; existing Step 1 project-config resolution pattern in SKILL.md. +- **Rejected alternatives:** + - Glob only, no CLAUDE.md key. Rejected because projects with non-standard planning directories (e.g., `internal-docs/proposals/`) have no signal to the loader. + - Require the CLAUDE.md key, hard fail without it. Rejected because most repos don't yet have the key; hard fail blocks adoption. +- **Specialist owner:** `behavioral-analyst`. +- **Revisit criterion:** Three or more user reports show the Glob fallback hits wrong directories, in which case the key becomes required. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** None. +- **Referenced in plan:** Implementation Approach (Runtime Behavior). + +### D-11: S7 updates three YAGNI-bearing locations in SKILL.md, not two + +- **Question:** Where does the rewritten YAGNI two-pass procedure land, in `review-checklist.md` only, or in additional SKILL.md sites? +- **Decision:** Three locations are updated. (1) `plugin/skills/code-review/references/review-checklist.md` YAGNI section: the bullet list becomes a two-pass procedure (evidence test, then anti-pattern check). (2) `plugin/skills/code-review/SKILL.md` Step 3.3 calibration directive: replace the existing YAGNI block with a reference to the two-pass procedure and a reference to Step 3.3's size-based demotion (D-1). (3) `plugin/skills/code-review/SKILL.md` Review Constraints section (lines 29–41): the YAGNI-related bullets are rewritten to reference the two-pass procedure rather than restating it. +- **Rationale:** JD-010 surfaced the third location (Review Constraints) that the investigation's S7 missed. Leaving it un-rewritten reproduces the "calibration is defined once and bypassed everywhere else" failure pattern that D-1 targets. +- **Evidence:** JD-010; investigation S7; `plugin/skills/code-review/SKILL.md` Review Constraints section. +- **Rejected alternatives:** + - Two-location rewrite (review-checklist.md + Step 3.3 only). Rejected because the Review Constraints section also carries YAGNI-shaped rules and would drift. + - Single-location rewrite (review-checklist.md only). Rejected because Step 3.3 governs the agent prompts at Step 3.5 and the procedure must apply there as well. +- **Specialist owner:** `junior-developer`. +- **Revisit criterion:** A future review surfaces YAGNI-shaped findings firing at one of the un-rewritten sites, indicating the rewrite missed a location. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** None. +- **Referenced in plan:** Implementation Approach (Runtime Behavior), Decomposition and Sequencing. + +### D-12: S8 runs an extraction pass before pair comparison, scoped to overlapping line ranges only + +- **Question:** How does S8 (self-consistency check on findings) actually detect contradictory findings, and what is its scope? +- **Decision:** S8 runs in two passes inside Step 9. **Extraction pass:** for every finding, extract a tuple `{task-id, file-path, line-range, recommended-action-summary}`. **Comparison pass:** for every pair of tuples on the same `file-path` with overlapping `line-range`, check whether the two `recommended-action-summary` values prescribe opposite actions on the same code. If yes, both findings are demoted by one severity and each carries a `Tension with {other-task-id}:` note that the human must adjudicate. Scope is overlapping line ranges in a single file only; cross-file semantic contradictions are deferred (see plan's Deferred section). +- **Rationale:** BA-6 named the extraction pass as a precondition for the comparison; JD-015 named the scope limitation. The PR 339 WARN-002/WARN-003 case (the bundle's "most instructive single moment" per the investigation) falls inside the overlapping-line-range scope. +- **Evidence:** BA-6; JD-015; PR 339 E10–E13. +- **Rejected alternatives:** + - Direct pairwise comparison without extraction. Rejected because BA-6 noted the unstructured finding text resists naive comparison. + - Cross-file semantic contradiction detection. Rejected as YAGNI (Gate 1 evidence test fails; no documented incident beyond the PR 339 single-file class). +- **Specialist owner:** `behavioral-analyst`. +- **Revisit criterion:** A real review surfaces a cross-file contradictory finding pair the line-range detector misses, in which case the scope widens. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** None. +- **Referenced in plan:** Implementation Approach (Runtime Behavior), Decomposition and Sequencing, Deferred (YAGNI). + +### D-13: S9 requires reading at least one architectural file before raising a "violates standard X" finding + +- **Question:** How is S9's premise-verification actually enforced, by inferring premises from a standard's own examples (the investigation's original framing), or by a stronger requirement? +- **Decision:** Before raising any "violates standard X" finding in Step 5, the orchestrator (or the dispatched agent for documentation-compliance work) must read at least one architectural file demonstrating the standard's premise. The file class depends on the standard's topic: an entry-point file for runtime-shape standards, a router or navigation surface for routing standards, a config file for configuration standards. The "infer from examples" fallback becomes an explicit "premise not verified; finding omitted" log line rather than a forward path to raising the finding. +- **Rationale:** OQ-7 resolution from R1. JD-012 identified that "infer from examples" is the failure mechanism, not a fix, it lets the agent fabricate a premise that does not hold in the codebase (PR 307's SPA-style company switch finding against a full-page-redirect codebase). +- **Evidence:** JD-012; PR 307 E2; PR 299 E1. +- **Rejected alternatives:** + - Keep "infer from examples" as a forward path. Rejected because PR 307's WARN-003 demonstrates this path produces false positives. + - Require reading three architectural files. Rejected because one architectural file is sufficient to confirm or deny the premise; three is operational drag without evidence the marginal files reduce false positives. +- **Specialist owner:** `software-architect`. +- **Revisit criterion:** A post-ship review still raises a standards-compliance finding against a codebase whose architecture does not match the standard's premise, despite the architectural-file read having been logged. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** None. +- **Referenced in plan:** Implementation Approach (Runtime Behavior), Decomposition and Sequencing. + +### D-14: `$focus_areas` and `$branch_context` are explicit named bindings populated by Steps 1 and 1.5 + +- **Question:** How are focus-area text and branch-context text actually plumbed from Step 1 / Step 1.5 to Step 3.5's agent prompts? +- **Decision:** Two named bindings carry the values. `$focus_areas` is populated at Step 1 from the user's invocation arguments. `$branch_context` is populated at Step 1.5 by S6's branch-context loader. Both default to the literal string `none provided` when empty. Every Step 3.5 agent prompt template references the bindings by name in the new "Focus areas" and "PR / branch context" blocks added by S5. +- **Rationale:** BA-1 (and JD-003, JD-004 corroborating) noted that the plan as written relied on LLM context-window retention to carry values across three steps. Named bindings make the dependency explicit and survive context compaction. +- **Evidence:** BA-1; JD-003; JD-004. +- **Rejected alternatives:** + - Implicit context-window retention. Rejected because the plan cannot guarantee Step 1's notes survive to Step 3.5 across long agent dispatches. + - A single combined `$context` binding. Rejected because focus areas (user-authored) and branch context (loader-authored) have different provenance and the prompt blocks need to label them separately. +- **Specialist owner:** `behavioral-analyst`. +- **Revisit criterion:** A test run shows the bindings are still being lost across step boundaries, in which case a stronger plumbing mechanism is needed. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** None. +- **Referenced in plan:** Implementation Approach (Runtime Behavior), Decomposition and Sequencing. + +### D-15: `{size}` value is read from Step 3.1; all five consumer sites reference Step 3.1 explicitly + +- **Question:** Where does the `{size}` variable used by the new size-aware rubric, the rewritten line 24, Step 7's sub-steps, and the YAGNI two-pass procedure come from? +- **Decision:** `{size}` is set once at Step 3.1 (size detection) and the five consumer sites that read it (the rewritten line 24, Step 3.3 directive, Step 3.5 agent prompts, Step 7.2 sub-step, and the rewritten rubric in `agent-finding-classification.md`) all reference Step 3.1 explicitly as the source. No site reads `{size}` from ambient context. +- **Rationale:** BA-3 named the single-source-of-truth requirement. Sites reading `{size}` from ambient context risk drift when the detection logic changes. +- **Evidence:** BA-3; `plugin/skills/code-review/SKILL.md` Step 3.1. +- **Rejected alternatives:** + - Implicit ambient `{size}`. Rejected because the rubric file is consulted across step boundaries and ambient context is unreliable. +- **Specialist owner:** `behavioral-analyst`. +- **Revisit criterion:** A future change moves size detection to a different step, in which case the reference is updated globally. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** None. +- **Referenced in plan:** Implementation Approach (Runtime Behavior). + +### D-16: Defer S12 (default suppression for small changes) as YAGNI + +- **Question:** Does the user-facing workaround "WARN-justified, SUGG suppressed per reviewer instruction" become an explicit default mode flag (S12)? +- **Decision:** Defer S12. Once S1 + S2-merged-with-S10 + S13 ship and the rubric is size-aware, S12's behavior is subsumed: small changes already produce no Suggestions per Step 3.3, and warnings already demote on reachability phrases. The dedicated suppression flag becomes redundant. +- **Rationale:** OQ-6 resolution from R1, with consensus across SA-6, BA-9, and JD-006. Gate 2 simpler-version test: S1 + S2 + S13 satisfy the same evidence (PR 299 E17's "WARN-justified, SUGG suppressed" workaround), and the simpler version is "fix the calibration so the workaround is unnecessary". +- **Evidence:** SA-6, BA-9, JD-006; PR 299 E17. +- **Rejected alternatives:** + - Ship S12 as a dedicated mode flag. Rejected because the simpler version satisfies the same evidence. +- **Specialist owner:** `project-manager`. +- **Revisit criterion:** Post-ship validation against PR 299 still produces severity inflation that requires a SUGG-suppress mode flag, in which case S12 is reopened. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** None. +- **Referenced in plan:** Deferred (YAGNI). + +### D-17: Em-dash strip policy on verbatim copy from the investigation + +- **Question:** The investigation document uses em-dashes throughout; can verbatim text be copied into shipped files (SKILL.md, agent prompts, classification rubric, review checklist) as-is? +- **Decision:** No. Every verbatim text copy from the investigation into a shipped file under `plugin/` or `docs/` must have em-dashes stripped per `docs/writing-voice.md`. The replacement uses commas, semicolons, parentheses, or restructured sentences depending on the local rhythm. A pre-commit visual review for em-dashes is added to the Definition of Done. +- **Rationale:** JD-009 surfaced the hazard. `docs/writing-voice.md` is the canonical voice rule; the investigation does not follow it because the investigation is not shipped. +- **Evidence:** JD-009; `docs/writing-voice.md`. +- **Rejected alternatives:** + - Allow em-dashes in shipped files when the surrounding paragraph is verbatim from the investigation. Rejected because the voice rule applies to shipped content regardless of source. +- **Specialist owner:** `project-manager`. +- **Revisit criterion:** None; the voice rule is project-wide. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** None. +- **Referenced in plan:** Definition of Done. + +### D-18: S4's constraint is narrower for `edge-case-explorer` than for `junior-developer` + +- **Question:** When S4 (scoped-file-list constraint) lands as a Step 3.5 dispatcher directive, is the wording identical for both `junior-developer` and `edge-case-explorer`, or differentiated? +- **Decision:** Differentiated. For `junior-developer`, the dispatcher adds the unmodified investigation rule: outward reads are for context, findings must concern files on the scoped list. For `edge-case-explorer`, the dispatcher adds a narrower rule: findings must ultimately trace to a failure mode in code on the scoped file list, even when callers outside the list provide the evidence. The narrower wording for `edge-case-explorer` preserves the agent's Protocol 1 callers-read while making the scope rule about the failure-mode target, not the evidence source. +- **Rationale:** JD-008 surfaced that the agent's Protocol 1 explicitly reads callers as evidence for the target code's edge cases; a flat "findings must concern files on the scoped list" rule would suppress legitimate findings traced from caller behavior to scoped-file failure modes. +- **Evidence:** JD-008; `plugin/agents/edge-case-explorer.md` Protocol 1. +- **Rejected alternatives:** + - Identical wording for both agents. Rejected because the agents' protocols are not symmetric; one reads adjacent patterns as context, the other reads callers as evidence. +- **Specialist owner:** `junior-developer`. +- **Revisit criterion:** A `edge-case-explorer` dispatch under the narrower rule still produces out-of-scope findings, or suppresses legitimate ones. +- **Dissent (if any):** None. +- **Driven by rounds:** R1. +- **Dependent decisions:** None. +- **Referenced in plan:** Implementation Approach (Runtime Behavior), Decomposition and Sequencing. diff --git a/docs/plans/code-review-guardrails/artifacts/implementation-iteration-history.md b/docs/plans/code-review-guardrails/artifacts/implementation-iteration-history.md new file mode 100644 index 0000000..560077f --- /dev/null +++ b/docs/plans/code-review-guardrails/artifacts/implementation-iteration-history.md @@ -0,0 +1,73 @@ +# Implementation Iteration History: Code-Review Skill Guardrails + +This file records how the implementation plan for the code-review guardrails evolved across discussion rounds. Committed decisions live in [implementation-decision-log.md](implementation-decision-log.md) and the primary plan lives in [../feature-implementation-plan.md](../feature-implementation-plan.md). + +The iteration loop is capped at two rounds for this medium-sized plan. The deterministic next-step recommendation after Round 1 was "go to synthesis" — all Open Questions were resolved via evidence, reframing, or reasonable-call user defaults under the user's "work without stopping" directive — so no Round 2 was needed. + +## R1: Parallel specialist review + +- **Specialists engaged:** software-architect, behavioral-analyst, test-engineer, junior-developer. Each received a domain-scoped brief plus a path to `artifacts/.discovery-notes.md` and was instructed to read the discovery notes first and only grep further for what the discovery notes did not cover. + +- **New input provided:** Initial source spec (`investigation.md`), discovery notes (`artifacts/.discovery-notes.md`), and the three PR evidence bundles plus the behavioral trace. R1 is the initial sweep. + +- **Claim ledger:** + +| Claim | State | Sources | Spec-maturity | +|---|---|---|---| +| Size-based demotion has no authoritative home — defined in Step 3.3 but contradicted at line 24, in the classification rubric, and in two agent bodies. | Evidenced | SA-1 | plan-level | +| Step 7 needs explicit numbered sub-steps (7.1 read, 7.2 demote, 7.3 reachability, 7.4 rubric) before S2 and S10 are inserted into it. | Evidenced | SA-2 | plan-level | +| S1/S2 interaction rule (when does S2's demotion fire?) is the most contested mechanic in the plan. | Disputed | SA-3 (anchor to Step 3.3 "directly introduced" vocabulary), BA-4 (use reachability phrase list, same as S10), JD-005 (any phrase-match is fragile; needs structured signal) | plan-level | +| S3 and S4 should be scoped as Step 3.5 dispatcher directives, not global agent-body edits, because the four affected agents are dispatched by other skills outside `/code-review`. | Evidenced | SA-4, JD-007, JD-008, JD-014 | plan-level | +| S1 and S13 must ship in the same commit; S2 and the Step 7 sub-step structure must ship in the same commit; S5/S6 and the dispatcher-directive form of S3/S4 must ship in the same commit. | Evidenced | SA-5 | plan-level | +| S12 ("Optional: replace mode flags with explicit defaults") is redundant after S1 + S2 + S13 ship. | Evidenced | SA-6, BA-9, JD-006 | plan-level (YAGNI candidate) | +| `$focus_areas` and `$branch_context` need explicit named bindings; the plan currently relies on LLM context-window retention across Step 1, Step 1.5, Step 3.5. | Evidenced | BA-1, JD-003, JD-004 | plan-level | +| S6's `gh pr view` call requires `Bash(gh *)` in `SKILL.md`'s `allowed-tools` frontmatter, which is not currently listed. | Evidenced | JD-013 (verified by grep — SKILL.md:6 currently reads `allowed-tools: Bash(git *), Bash(make *), Bash(npm *), Read, Grep, Glob, Agent`) | plan-level | +| S6's lookup must declare fail-open behavior explicitly and warn the user when no context could be loaded. | Evidenced | BA-2 | plan-level | +| S6's planning-directory lookup needs a structured CLAUDE.md key + fallback like the existing Step 1 project-config resolution. | Evidenced | BA-8 | plan-level | +| `{size}` is read from one write point (Step 3.1) and consumed by five later sites; the plan must make the single-source-of-truth explicit. | Evidenced | BA-3 | plan-level | +| S2 and S10 both target Step 7 and both demote findings on rationale signals. The cleanest implementation merges them into one Step 7.2 sub-step using the reachability phrase list (BA-4). | Evidenced (resolves the SA-3/BA-4/JD-005 dispute) | BA-4, BA-5 | plan-level | +| S3 alone has no observable severity-reduction effect — Step 7's rubric re-promotes agent self-labeled SUGG to WARN. S3 only matters if S1 ships in the same version. | Evidenced | BA-7 | plan-level | +| S7 must update three YAGNI-bearing locations in SKILL.md, not two (Review Constraints lines 29–41, Step 3.3 calibration directive, review-checklist.md). | Evidenced | JD-010 | plan-level | +| S8 needs an explicit extraction pass producing `{task-id, file-path, line-range, recommended-action-summary}` tuples before pair comparison can run. S8's scope is structural contradictions (overlapping line ranges) only; semantic cross-file contradictions are deferred. | Evidenced | BA-6, JD-015 | plan-level | +| S9's "infer the premise from the standard's examples" fallback is the failure mechanism, not a fix. S9 must require reading at least one architectural file demonstrating the premise before raising the finding. | Evidenced | JD-012 | plan-level | +| Long-form docs in `docs/skills/code-review.md` and the four affected `docs/agents/*.md` must be updated alongside the plugin changes per CLAUDE.md convention. The investigation does not name these as edits; the plan must. | Evidenced | JD-001 | plan-level (scope item) | +| A `CHANGELOG.md` entry and a `plugin.json` version bump are required for this branch per `docs/guidance/semantic-versioning.md`. The investigation does not capture this. | Evidenced | JD-002 | plan-level (scope item) | +| Verbatim text copied from the investigation (which uses em-dashes throughout) into shipped files must have em-dashes stripped per `docs/writing-voice.md`. | Evidenced | JD-009 | plan-level (style constraint) | +| The implementation sequence buried in the investigation's "What the investigation does not cover" section should be promoted into the implementation plan as a first-class section. | Evidenced | JD-011, SA-5 | plan-level | +| Test plan: 5 P0 acceptance tests (PR 299/307/339 outcomes + SEC cross-reference regression + data-isolation regression), 12 P1 per-solution behavior tests, 2 P2 cross-project/Mode-B tests. Manual execution only; automated harness is YAGNI. | Evidenced | TP-1 through TP-21 + TP YAGNI candidates | plan-level | + +- **Open Questions raised:** + + - **OQ-1: Are `docs/` mirror updates in scope for this branch?** (JD-001) → Resolution source: user-input via reasonable-call default. Resolved as: yes, in scope. The five long-form docs (`docs/skills/code-review.md`, `docs/agents/structural-analyst.md`, `docs/agents/behavioral-analyst.md`, `docs/agents/junior-developer.md`, `docs/agents/edge-case-explorer.md`) are listed as a work unit in the plan. User may redirect. + - **OQ-2: Minor or major version bump?** (JD-002) → Resolution source: user-input via reasonable-call default. Resolved as: minor (2.2.0 → 2.3.0). The skill name, argument signature, and output-format structure are unchanged; only behavior is calibrated. CHANGELOG entry required. User may redirect. + - **OQ-3: How does S2's pre-classification demotion gate fire?** (SA-3 / BA-4 / JD-005 dispute) → Resolution source: junior-developer reframing combined with BA-4's mechanical analysis. Resolved as: merge S2 and S10 into a single Step 7.2 sub-step that demotes on a documented reachability phrase list ("theoretical", "hypothetical", "defense-in-depth", "effectively impossible", "in case the upstream", "could happen", "should never happen", "edge case that does not occur"). This replaces the investigation's "skip when directly introduced" rule. SA-3's vocabulary-anchoring concern is addressed by the merged sub-step also reading Step 3.3's calibration directive for criteria when applicable. JD-005's structured-signal concern is deferred as YAGNI; reopen if phrase-matching proves insufficient in post-ship validation. + - **OQ-4: Does SKILL.md `allowed-tools` need `Bash(gh *)` added for S6?** (JD-013) → Resolution source: evidence. Confirmed via grep: SKILL.md:6 currently reads `allowed-tools: Bash(git *), Bash(make *), Bash(npm *), Read, Grep, Glob, Agent`. `Bash(gh *)` is not present. S6 must add it. Resolved. + - **OQ-5: S3/S4 as dispatcher directives or global agent-body edits?** (SA-4 / JD-007 / JD-014 consensus) → Resolution source: specialist consensus across three agents plus evidence (V4 already says outward reads are context-only, not findings; V8 already supports lowered default severity without removing the "include when in doubt" rule). Resolved as: scope all four edits as Step 3.5 dispatch-prompt additions in SKILL.md, not as edits to the four agent definition files. The agent definitions remain general-purpose for use by other skills. + - **OQ-6: Is S12 redundant after S1+S2 ship?** (SA-6 / BA-9 / JD-006 consensus) → Resolution source: specialist consensus. Resolved as: defer S12 to `## Deferred (YAGNI)` with reopen trigger "post-ship validation against PR 299 still shows severity inflation requiring a SUGG-suppress mode flag." + - **OQ-7: Does S9's "infer the premise from examples" fallback actually fix C9?** (JD-012) → Resolution source: junior-developer reframing. Resolved as: strengthen S9 to require reading at least one architectural file (entry-point, config, router, navigation surface — whichever is most relevant to the standard's topic) before raising the finding. The "infer from examples" path becomes a fallback that triggers an explicit "premise not verified — finding omitted" note in the agent output, not a forward path. + - **OQ-8: Is S8's file-path + line-range overlap sufficient detection?** (JD-015) → Resolution source: scope acknowledgment. Resolved as: yes for the WARN-002/WARN-003 contradiction class on overlapping line ranges in a single file. S8's documented limitation: it does not detect semantic contradictions across non-overlapping file regions or across files. Cross-file semantic contradiction detection is deferred as YAGNI; reopen if a real review exhibits this failure mode post-ship. + +- **Spec-maturity tags:** + - `plan-level`: 23 findings (everything raised by all four specialists). + - `spec-level`: 0 findings. No specialist tagged any claim as "spec is silent" or "undefined behavior in the spec." The investigation already commits to S1–S13 and to the C# causes; all R1 findings concern *how* to implement those commitments. + - `T#-contradiction`: 0 (no `feature-technical-notes.md` exists for this work — the source is an investigation, not a `plan-a-feature` spec — so the T# concept does not apply). + - **Spec-maturity gate did NOT trip.** The gate requires ≥ 2 `T#`-contradictions raised by ≥ 2 specialists, or ≥ 5 `spec-level` findings raised by ≥ 3 specialists. Neither condition holds. No PM facilitation-gate-trip pass was made. + +- **Resolution source per question:** + +| OQ | Resolved by | +|---|---| +| OQ-1 (docs scope) | user-input default — reasonable call under user's "work without stopping" directive | +| OQ-2 (version bump) | user-input default — reasonable call | +| OQ-3 (S2 mechanic) | reframing (junior-developer's restatement of the question) + specialist analysis (BA-4) | +| OQ-4 (allowed-tools) | evidence (`grep -n "allowed-tools" plugin/skills/code-review/SKILL.md` confirmed the gap) | +| OQ-5 (S3/S4 scope) | specialist consensus (SA-4 + JD-007 + JD-014 + corroborating V4/V8 from the source spec) | +| OQ-6 (S12 YAGNI) | specialist consensus (SA-6 + BA-9 + JD-006) + YAGNI Gate 2 simpler-version test | +| OQ-7 (S9 fix) | reframing (junior-developer identified inference-from-examples as the failure mechanism, not a fix) | +| OQ-8 (S8 scope) | scope acknowledgment (the proposed mechanism handles the named PR 339 contradiction class; broader is YAGNI) | + +- **Decisions produced:** D-1 through D-21. Full decisions: D-1 (Step 3.3 authoritative home), D-2 (Step 7 sub-step structure), D-3 (merged S2+S10 reachability gate), D-4 (S3/S4 as dispatcher directives), D-5 (atomic shipping pairs and sequencing), D-6 (minor version bump 2.2.0 to 2.3.0), D-7 (docs mirror in scope), D-8 (allowed-tools gh grant), D-9 (S6 fail-open warning), D-10 (planning-directory lookup format), D-11 (S7 three-location rewrite), D-12 (S8 extraction-pass scope), D-13 (S9 architectural-file read), D-14 ($focus_areas and $branch_context named bindings), D-15 ({size} single source of truth at Step 3.1), D-16 (defer S12 as YAGNI), D-17 (em-dash strip policy), D-18 (narrower S4 wording for edge-case-explorer). Trivial decisions: D-19 (commit S1), D-20 (commit S5), D-21 (commit S11). + +- **Changed in plan:** All plan sections written for the first time in this round: Source Specification, Outcome, Context, Team Composition and Participation, Implementation Approach (Architecture and Integration Points, Runtime Behavior; Data Model and External Interfaces marked not applicable), Decomposition and Sequencing, RAID Log, Testing Strategy, Security Posture (not applicable), Operational Readiness, Definition of Done, Specialist Handoffs for Implementation, Deferred (YAGNI), Open Items, Summary. R1 is the initial sweep; everything in the plan derives from R1 findings and the deterministic resolution. + +- **Project-manager next-step recommendation (deterministic):** **Go to synthesis.** Rationale: spec-maturity gate did not trip; all OQs resolved via evidence, reframing, or user-input default; no specialist named a needed handoff to a specialist not already engaged; the next round would produce no new findings because the dispute (OQ-3) is resolved and no specialist requested re-engagement with new context. diff --git a/docs/plans/code-review-guardrails/artifacts/pr-299-evidence.md b/docs/plans/code-review-guardrails/artifacts/pr-299-evidence.md new file mode 100644 index 0000000..009f2bb --- /dev/null +++ b/docs/plans/code-review-guardrails/artifacts/pr-299-evidence.md @@ -0,0 +1,156 @@ +# Evidence: PR #299 (gearjot-v2-web, gear map foundation) + +Source bundle: `tmp/gearjot-v2-web-pr-299/` (May 7, 2026). + +Two `han:code-review` passes ran on this PR. The second pass opened with the explicit statement that "7 of the prior 11 warnings were downgraded to suggestions on a second pass". The existence of the second pass is itself the single strongest piece of evidence in this investigation: the skill's first-pass output was so over-classified that the user had to re-run it with a reclassification mode. + +Each finding below is keyed E1...E23 and cited verbatim where possible. Findings are grouped by which of the four symptoms they demonstrate. + +--- + +## Symptom 1: Goes too deep / off the rails + +### E1: Structural agent flagged code outside the change surface +Source: `tmp/gearjot-v2-web-pr-299/artifacts/github/pr-299-review-2.md:155` +> Structural agent surfaced what is now SUGG-005/SUGG-006/SUGG-007: duplication, mock-type drift, SDK-type leak: none of which block merge. + +The three findings were filed as warnings on the first pass against code outside the touched diff. Reclassification was needed only because they did not concern code introduced by this change. + +### E2: SUGG-005 cited a defect case that the team had already deferred +Source: `pr-299-review-2.md:117` +> SUGG-005: Three duplicated announcement stores (`announcement-store.svelte.ts` plus 2 siblings). Real duplication; the cited defect-propagation case is the deferred `_interruptingMessage` issue, itself accepted as intentional. Refactoring opportunity. + +The skill examined two sibling stores not touched by the branch, and the only defect path it could attach to the duplication had already been documented as intentionally deferred. + +### E3: SUGG-007 flagged an architectural concern with acknowledged zero current impact +Source: `pr-299-review-2.md:119` +> SUGG-007: `google.maps.Map` as popover prop (`GearMarkerPopover.svelte:13-14, 71`). Architectural concern with no current user impact. Refactor candidate. + +Filed as warning on the first pass even though the review itself, on reclassification, acknowledged "no current user impact". + +### E4: SUGG-001 was filed against a theoretical race +Source: `pr-299-review-2.md:113` +> SUGG-001: `loadSDK` silent skip when `mapEl` unbound. In Svelte 5, `bind:this` resolves before `onMount`; the navigate-away-back hazard writes to instance-scoped `$state` (harmless). The race window is theoretical. + +The first pass treated framework-internal sequencing on minimally-changed code as a warning. Reclassification labeled the race window "theoretical". + +### E5: Hidden depth in suppressed suggestions +Source: `pr-299-review-1.md:124` +> Suppressed per reviewer instruction. (Junior-developer raised toast omission, `DEMO_MAP_ID` comment, `sed`-key escaping; behavioral raised silent `mapEl` skip: all dropped.) + +The first pass generated at least four extra findings that were suppressed only because the user explicitly passed a SUGG-suppress flag. Without that flag, three of them would have appeared as warnings about code clearly outside the change surface. The default mode produces deeper findings than the user wants. + +### E6: SUGG-003 was filed even though its originating agent flagged it as unreachable +Source: `pr-299-review-2.md:115` +> SUGG-003: `await goto()` inside `catch` racing rapid retry. The concurrency agent itself noted this is "effectively impossible in production" because `goto` removes the page from the DOM before re-click is feasible. Structural deviation from the `finally` standard, not a reachable bug. + +The agent that produced the finding documented that it knew the failure was unreachable. The classification step did not honor that signal and the finding still came through as a warning. + +### E7: Feature overview directly names off-scope review comments +Source: `feature-overview-plain-language.md:114` +> It is not slice 2, 3, 4, 5, or 6. Several review comments were "you should add filtering / a list pane / a current-location button" and the team consistently said: that's a different slice. + +The author of the overview explicitly recorded that the review surfaced features belonging to future slices. The scope context "this is slice 1, not slices 2 to 6" did not constrain the review. + +--- + +## Symptom 2: YAGNI under-applies + +### E8: One YAGNI finding in the first pass despite many candidates +Source: `process-conversation-log.md:51` +> 2026-05-07 20:41Z | First `han:code-review` posted | 6 warnings + 1 YAGNI; all but WARN-002 resolved in working tree. + +The first pass produced exactly one YAGNI finding. SUGG-001 through SUGG-007 (later reclassified) are all candidates by the YAGNI rule's criteria but were issued as warnings, not as YAGNI items. + +### E9: YAGNI was applied to obvious dead code but not to speculative or refactor-only findings +Source: `pr-299-review-1.md:130-144` +> YAGNI-001: `maybeAnnouncePhase2` / `phase2Announced` is dead-or-misordered code. +> Trigger that would justify keeping it: A documented UX requirement that the cat-4 "Loading gear locations" announcement is distinct from the skeleton's `aria-busy`/`aria-label` and must fire even after data resolves first. + +The skill correctly applied YAGNI to dead code and named a reopen trigger. The same rigor was not applied to SUGG-005 (three stores, no documented evidence for three), SUGG-007 (architectural choice with no documented requirement), or SUGG-006 (mock-type drift with no failing behavior). All three meet the YAGNI evidence test for deferral. None were YAGNI-classified. + +### E10: Process log explicitly names the first-pass gap +Source: `process-conversation-log.md:244` +> 5. Reachability gate in adversarial review. The second-pass reclassification was the right call: explicitly demoting "theoretical races" and "defensive refinements" to suggestions. Bake this into the first pass instead of needing a second pass to catch it. + +The user's own post-mortem identifies the absence of a first-pass reachability gate as a root cause. + +--- + +## Symptom 3: Severity inflation + +### E11: The second review exists because the first pass over-classified +Source: `pr-299-review-2.md:3` +> Supersedes the previous review comment. Re-run with reclassification: 7 of the prior 11 warnings were downgraded to suggestions on a second pass: they were defensive refinements or refactoring opportunities, not reachable defects. + +7 of 11 warnings were inflated. 64 percent of warnings on the first pass were not warnings at all. + +### E12: Process log confirms a wider initial sweep produced unrationalized warnings +Source: `process-conversation-log.md:170-171` +> Mode: Re-run with reclassification: 7 of the prior 11 warnings (from a wider initial sweep, presumably) were downgraded to suggestions on a second pass; they were "defensive refinements or refactoring opportunities, not reachable defects." + +"From a wider initial sweep, presumably" admits the user could not reconstruct exactly how the first pass produced 11 warnings. The classification path was opaque. + +### E13: SUGG-003 was a first-pass warning even though the agent labeled it unreachable +See E6. + +### E14: SUGG-004 was first-pass warning despite hypothetical threat +Source: `pr-299-review-2.md:116` +> SUGG-004: `sed` redact with unescaped API key. Google-issued keys never contain sed metacharacters; the operator-passes-bad-key path is hypothetical. Defense-in-depth. + +A finding whose threat model is "hypothetical" and whose category is "defense-in-depth" should not have been a warning. + +### E15: Process log calls out reachability gate as the missing first-pass step +Source: `process-conversation-log.md:183` +> Lesson for `han` improvement: A second pass that demotes warnings is as important as the first pass that creates them. The reclassification rationale was specific and traceable in every case. Without it, the user would either (a) waste time fixing things that aren't real bugs, or (b) lose trust in the agent's signal-to-noise ratio. Good adversarial review must include a "is this actually reachable in production?" gate. + +### E16: Process log #5 explicitly asks for first-pass reachability gate +Source: `process-conversation-log.md:244` +See E10. + +### E17: Default mode produces severity inflation; users compensate via mode flags +Source: `pr-299-review-1.md:3` +> Size: Medium (8 files, +1009/-107). Mode: WARN-justified, SUGG suppressed per reviewer instruction. + +The user passed an explicit "SUGG suppress" instruction. Without it, SUGG-001 through SUGG-007 would have been warnings. The default behavior is to inflate. + +--- + +## Symptom 4: Context not forwarded to sub-agents + +### E18: Mode flag is the only user context the review references +Source: `process-conversation-log.md:150-151` +> Mode: WARN-justified, SUGG suppressed per reviewer instruction. + +There is no record of scope context, focus areas, PR description, or feature overview being passed to the sub-agents. + +### E19: Feature overview records reviewers ignored available scope context +Source: `feature-overview-plain-language.md:112-115` +> ## What this PR is not trying to be +> - It is not slice 2, 3, 4, 5, or 6. Several review comments were "you should add filtering / a list pane / a current-location button" and the team consistently said: that's a different slice. + +The PR description and feature overview both clearly scoped the work. The agents flagged out-of-scope features anyway. + +### E20: PR description documented deferred items the agents still flagged +Source: `pr-299-description.md:28` +> T31 / G2.7: Sentry PII stripping for `/v1/gear/locations` and `/gear/map` deferred 2026-05-07 to a follow-up. + +The PR description has a Deferred section listing Sentry PII work as out of scope. The security section of the review re-raised it. + +### E21: Both review passes independently re-discovered the deferred Sentry gap +Source: `pr-299-review-1.md:163-165` and `pr-299-review-2.md:142-143` +> The deferred Sentry PII stripping for `/v1/gear/locations` (T31/G2.7) remains the only known PII gap. Explicitly tracked outside this PR; gates first deploy per Operational Readiness gate item 6. + +Both passes re-surfaced the deferral. Neither could read the PR's own Deferred section and skip the finding. The deferral context did not reach the security sub-agent. + +### E22: Test-gap chaining requires context the agents don't have +Source: `process-conversation-log.md:242-243` +> 4. Test-gap chaining in code review. When a code-review fix adds new code, automatically check whether the new code is covered by tests. Three of four second-pass warnings here were exactly this pattern. + +Three of four second-pass warnings rediscovered test gaps for code added as a fix to first-pass findings. The new agents had no record that the new code was a fix; the test-coverage analysis ran from scratch. + +### E23: Triple-corroboration language confirms agents do not share intermediate findings +Source: `process-conversation-log.md:247` +> 8. Triple-corroboration framing in review output. Keep the practice of explicitly naming "agents X, Y, Z independently converged on this finding." It's signal the user can act on. + +Agents converge independently. They cannot consult each other's outputs nor a prior review's outputs. diff --git a/docs/plans/code-review-guardrails/artifacts/pr-307-evidence.md b/docs/plans/code-review-guardrails/artifacts/pr-307-evidence.md new file mode 100644 index 0000000..7c35e41 --- /dev/null +++ b/docs/plans/code-review-guardrails/artifacts/pr-307-evidence.md @@ -0,0 +1,147 @@ +# Evidence: PR #307 (gearjot-v2-web, sidebar shell + viewport flag) + +Source bundle: `tmp/gearjot-v2-web-pr-307/` (May 8, 2026). + +PR #307 raised 4 WARN findings and 1 YAGNI. The user adjudicated 3 of 4 WARN findings as "Won't fix" with multi-paragraph evidence pointing at code the review never examined: a separate repository, a database CHECK constraint, an entire navigation model. The 3:1 ratio of won't-fix to accepted WARN findings is direct evidence that the review's classification logic over-fires. + +Findings are grouped by symptom and keyed E1...E22. + +--- + +## Symptom 1: Goes too deep / off the rails + +### E1: WARN-002 raised a failure mode the architecture makes unrepresentable +Source: `pr-307-review.md:13-15` +> If the API ever returns a gear record with a missing or invalid latitude/longitude, the map quietly fails to center instead of skipping the bad record and showing the rest. The user sees a default "zoomed-out US" view with no explanation. This violates our internal "errors must never be silently swallowed" rule. + +The constraints that make this finding moot (database CHECK, non-nullable Go DTO, JSON serializer behavior) live in a separate repository (`gearjot-v2/`) that the review never examined. + +### E2: WARN-003 applied a standard whose premise does not hold in this app +Source: `pr-307-review.md:17-19` +> The "user already moved the map" flag lives at the module level and isn't cleared when a user switches companies. So if a user pans the map at Company A and then switches to Company B, the map for Company B may refuse to auto-center on Company B's fleet. We have a written standard requiring this kind of state to be cleared on company switch, and it isn't being followed here. + +The standard cited (`svelte-store-company-switch-clear.md`) targets SPA-style company switching with module-level Map/Set caches. The user's resolution comment cites four code locations proving every company-switch path in this app is a full-document redirect, which tears down all module-level state, including this flag. + +### E3: Process log names the pattern +Source: `process-conversation-log.md:184-189` +> WARN-002 and WARN-003 are textbook applications of "errors must never be silently swallowed" and "module-level state must clear on company switch." Both rules are real and appropriate elsewhere in the codebase. Both fail to apply here, and the resolution comments prove it with evidence the review either didn't gather or didn't weigh: +> - WARN-002 didn't check whether the upstream API can produce the failure mode at all. +> - WARN-003 didn't check whether company-switch in this app is even an SPA navigation. +> +> The review was defensible against a generic frontend but not against this specific frontend. + +### E4: User adjudication of WARN-003 cites four files the review never touched +Source: `pr-307-resolutions.md:22-27` +> Every switch entry point uses full-document navigation, not SvelteKit `goto`: `src/routes/+layout.svelte:150` (`window.location.href`), `src/routes/select-org/+page.svelte:9`, `src/routes/+layout.server.ts:49` (302 redirect), and `src/routes/api/switch-org/+server.ts` (post-WorkOS redirect). A full-document redirect tears down the JS bundle, wiping all module-level state including `userViewportFlag`. + +None of these four files were changed by PR #307. The finding rests on assumptions about navigation behavior that the review never verified by reading these files. + +### E5: User adjudication of WARN-002 cites cross-repo files the review never examined +Source: `pr-307-resolutions.md:12-17` +> - Database CHECK constraint `gear_locations_coords_valid` (see `gearjot-v2/db/client.go:154` and the migration in `ensureGearLocationsCoordsCheck`) rejects any row where `lat` is outside [-90, 90], `lng` is outside [-180, 180], or both are exactly 0. Bad coordinates can't be stored. +> - API contract is non-nullable: `GearLocationItem.Lat` / `.Lng` are `float64` (not pointers) in `gearjot-v2/gear/dto.go:155-156`... + +`gearjot-v2/db/client.go` and `gearjot-v2/gear/dto.go` live in a different repository. + +### E6: Process log names cross-repo blindness +Source: `process-conversation-log.md:209-210` +> Cross-repo evidence for "silent failure" findings. WARN-002 hinged on whether the backend can produce the bad state. The review didn't look at `gearjot-v2/`. The resolution comment cited a CHECK constraint, a Go DTO, and Go's JSON serializer: three pieces of cross-repo evidence the review should have weighed before posting. + +### E7: The won't-fix pattern is consistent across three of four findings +Source: `process-conversation-log.md:133-135` +> Pattern that emerges: the review fires generic "this looks like the standard you should follow" warnings, and the user pushes back with load-bearing architectural evidence the review didn't have access to. All four pushbacks cite specific files, line numbers, or system-level facts: not opinion. + +--- + +## Symptom 2: YAGNI under-applies + +### E8: YAGNI-001 was correct in principle but the review then accepted the developer's rationale in the same comment +Source: `pr-307-review.md:25-33` +> Finding: The sidebar shell itself needs to stay: the auto-centering math depends on that space being reserved on screen. But the placeholder copy inside it ("More map tools coming soon.") is user-visible work-in-progress messaging that takes up 40% of the screen on mobile and adds zero value today. +> +> Decision: not accepted: the placeholder stays. +> +> Rationale: The app needs to be demoable end-to-end while the centering controls are still being built. + +The YAGNI rule's "zero value today" test does not account for demoware. The review identified a candidate, then dismissed it in the same comment by accepting a rationale the YAGNI test should have solicited from a project-manager-style agent before firing. + +### E9: Process log names the missing demo-posture question +Source: `process-conversation-log.md:212-213` +> YAGNI findings should ask the demo-posture question first. "This text adds zero value today" is true for product code; for in-progress demoware it's wrong. Either a `han:user-experience-designer` or a `han:project-manager` agent should weigh in on whether the surface is internal/demo before YAGNI fires. + +### E10: PR description pre-acknowledged the placeholder rationale 21 seconds before YAGNI fired +Source: `pr-307-description.md:39` and `process-conversation-log.md:84-86` +> Adds the Gear Map sidebar shell (placeholder copy is intentional: see commit `9e97505`) + +PR description named the intent. The review fired YAGNI-001 anyway. + +### E11: WARN-002 and WARN-003 are themselves YAGNI-shaped findings classified as warnings +Source: `pr-307-resolutions.md:28-29` +> Adding `clearForCompanySwitch()` would be defensive code for a code path that does not exist: same reasoning applied to WARN-002. + +Both findings ask the team to add defensive code at a trusted internal boundary for a failure mode that cannot occur. This is the YAGNI rule's named anti-pattern "Defensive guard at a trusted internal boundary the caller fully controls". They were filed as WARN instead. + +--- + +## Symptom 3: Severity inflation + +### E12: WARN-002 rated WARN when the failure mode is architecturally unrepresentable +Source: `pr-307-review.md:13` +> WARN-002: Bad coordinates from the server cause a silently broken map + +### E13: WARN-003 rated WARN when the navigation model assumed does not exist in this app +Source: `pr-307-review.md:17` +> WARN-003: Switching companies can leak the previous company's map state + +### E14: WARN-004 is rated WARN solely because it depends on WARN-003 +Source: `pr-307-review.md:21-23` +> WARN-004: A test is missing that would catch a regression in the cleanup above +> The test for page cleanup verifies most things get cleared on unmount, but doesn't verify the user-viewport flag itself gets cleared. + +WARN-004 is a derivative finding. The user's adjudication: "Moot, given WARN-003 won't-fix; there is no cleanup line to guard against regression." + +### E15: Confidence labels are missing +Source: `process-conversation-log.md:211` +> Confidence labels on findings. Each WARN should carry a confidence indicator: did the reviewer find evidence the failure can occur, or did they pattern-match a smell? The current output reads as equally confident across all four, when in fact WARN-001 was real and reproducible (it got fixed) and WARN-002/003 were generic-frontend pattern matches that didn't apply. + +### E16: 3 of 4 WARNs were won't-fixed +Source: `process-conversation-log.md:128-134` +> | WARN-001 | Done | +> | WARN-002 | Won't fix | +> | WARN-003 | Won't fix | +> | WARN-004 | Won't fix | + +A 3:1 won't-fix ratio on WARN findings is direct quantitative evidence of severity inflation. + +--- + +## Symptom 4: Context not forwarded to sub-agents + +### E17: PR description pre-acknowledged context, review fired YAGNI anyway +See E10. + +### E18: Implementation plan in issue #304 explicitly resolved WARN-003 before code was written +Source: `tmp/gearjot-v2-web-pr-307/artifacts/github/issue-304.md`, "Reusable patterns" section +> Module-level `$state` store with `_resetForTest()`: established by `src/routes/gear/map/announcement-store.svelte.ts:1` and the `clearForCompanySwitch()` standard. The viewport flag does not need `clearForCompanySwitch()` because a company switch hard-navigates; it just needs `_resetForTest()`. + +The argument the user later wrote into the resolution comment was already in the implementation plan. The review's sub-agent had no record of the plan. + +### E19: Process log names premise-checking as the missing step +Source: `process-conversation-log.md:208-210` +> Before raising a "violates standard X" warning, prove the standard's premise applies. The standard `svelte-store-company-switch-clear.md` exists for SPA-style company switches with module-level Map/Set caches. The reviewer needs to verify the app has SPA-style switches and that the offending state is a session-scoped cache. WARN-003 raised the standard without checking either premise. + +### E20: Process log names the multi-repo CLAUDE.md as missing context +Source: `process-conversation-log.md:223-224` +> Skill prompts should know this is a multi-repo project. WARN-002's premise required reading `gearjot-v2/`. The reviewer either didn't have access or didn't reach across. The CLAUDE.md cross-repo rules document is exactly the kind of input the reviewer needed; whether it was loaded is unknown from the PR alone. + +### E21: Review fired 21 seconds after PR open +Source: `process-conversation-log.md:84-86` +> The first review comment was posted at 19:42:56Z: 21 seconds later: which strongly suggests the review was queued to fire immediately on PR open. + +21 seconds is consistent with the agents starting before the PR description's context (including the explicit "placeholder copy is intentional" note) could reach them. + +### E22: Plan's "Decisions this plan locks in" section was the context the review needed +Source: `tmp/gearjot-v2-web-pr-307/artifacts/github/issue-304.md`, "Decisions this plan locks in" section +> Plan-D3: Sidebar component is empty in this slice. Just a simple "coming soon" text placeholder. + +This and the surrounding plan content was the context that would have prevented WARN-002, WARN-003, and YAGNI-001 from firing. It existed at PR-open time and was linked from the PR. It did not reach the review sub-agents. diff --git a/docs/plans/code-review-guardrails/artifacts/pr-339-evidence.md b/docs/plans/code-review-guardrails/artifacts/pr-339-evidence.md new file mode 100644 index 0000000..35a0d15 --- /dev/null +++ b/docs/plans/code-review-guardrails/artifacts/pr-339-evidence.md @@ -0,0 +1,126 @@ +# Evidence: PR #339 (gearjot-v2-web, Task Detail General Information section) + +Source bundle: `tmp/gearjot-v2-web-pr-339/` (May 13, 2026). + +PR #339 raised 1 CRIT and 3 WARN findings. The CRIT finding was correct and led to the only user-visible fix. Two of the three WARN findings (WARN-002 and WARN-003) prescribed contradictory remedies for the same category of issue: WARN-002 said add an italic-class assertion, WARN-003 said drop a Tailwind-utility-class assertion. The user resolved the contradiction by removing both assertions, which the bundle's README labels "the most instructive single moment". + +Findings are keyed E1...E17 and grouped by symptom. + +--- + +## Symptom 1: Goes too deep / off the rails + +### E1: WARN-001 flagged a stale data-testid not introduced by this branch +Source: `pr-339-review.md:71` +> While editing, correct the unrelated stale `data-testid="gear-subtitle"` reference on line 52 to `task-subtitle` (real code at `+page.svelte:250` uses `task-subtitle`; not introduced by this branch but adjacent). + +The review explicitly labels the issue "not introduced by this branch but adjacent". This is the textbook signature of going beyond the change surface. + +### E2: Decision log post-hoc accepts the out-of-scope flag as a drive-by +Source: `02-conversation-and-decision-log.md:207` +> Removing the stale `gear-subtitle` reference is a small drive-by improvement of the kind the project's documentation guidelines encourage when already editing a file. + +The log accepts the drive-by but does not change the fact that the review surfaced a finding outside its declared scope. + +### E3: Review's own scope statement names 2 files; WARN-001 reaches a third +Source: `pr-339-review.md:10-11` +> Scope: 2 files (`src/routes/tasks/[id]/+page.svelte`, `src/routes/tasks/[id]/page.test.ts`). Adds a "General Information" section. + +The reviewer declared scope, then immediately raised a finding against `docs/project-documentation/tasks-detail-page.md`. The skill correctly identifies the scope and then ignores it. + +### E4: Review missed an in-scope token bug; the user found it in their own pass +Source: `02-conversation-and-decision-log.md:241-247` +> While reviewing the file end-to-end, Aaron noticed an unrelated bug on the same `+page.svelte`: the "Share unavailable" warning banner used `text-warning-foreground` (resolves to `#ffffff`), rendering white text on a tinted cream background. + +The same root cause pattern as CRIT-001 (a `-foreground` token misinterpreted) was visible in the same file in the same diff. The review spent attention on out-of-scope documentation lines and missed an in-scope, identical-pattern bug. Going deep in the wrong direction starves attention from in-scope work. + +--- + +## Symptom 2: YAGNI under-applies + +### E5: YAGNI section reports zero findings despite a planned, explicit YAGNI rejection +Source: `pr-339-review.md:106-110` +> None identified. The three new data points (Description, ID, UUID) all map to existing fields on `TaskItem` already used elsewhere; the empty-state guard handles a real product state; no speculative abstractions were introduced. + +### E6: The plan explicitly logged a YAGNI rejection the review did not surface +Source: `artifacts/planning/implementation-decision-log.md:84-100` +> D-5: No `` component extraction +> Decision: No. Render both rows inline in `+page.svelte` using the existing labeled-row pattern. +> Rationale: No `` component exists in the codebase today. Inline duplication of the labeled-row pattern is the established practice... Per the YAGNI rule, an abstraction needs three concrete uses or cited evidence. + +The plan made a conscious YAGNI call with reopen criteria. The review did not validate this call against the implementation, did not surface it, and did not check whether the implementation honored the decision. YAGNI ran without reading the plan. + +### E7: Feature overview confirms the deliberate non-abstraction +Source: `01-feature-overview.md:46` +> The team explicitly decided not to extract a reusable "labeled row" component, because no third use case justified the abstraction yet. + +### E8: WARN-002 contradicts YAGNI by recommending added implementation-coupled assertions +Source: `pr-339-review.md:73-83` +> Add to T6 (and T5 for symmetry): +> ```ts +> expect(body.className).toMatch(/italic/); +> ``` + +The recommendation pins a test to a Tailwind utility class. WARN-003 in the same review correctly identifies this exact pattern as implementation coupling. WARN-002 should have been YAGNI's job: flag the speculative-coupling itself, not recommend more of it. + +--- + +## Symptom 3: Severity inflation + +### E9: WARN-001 (doc not updated) treated as Warning despite the doc not being a file of interest +Source: `pr-339-description.md:47-48` +> Files of interest +> - `src/routes/tasks/[id]/+page.svelte` +> - `src/routes/tasks/[id]/page.test.ts` + +The author scoped attention to two files. The docs file was not among them. A missing documentation update at this size is a Suggestion at most. + +### E10: WARN-002 and WARN-003 are co-equal Warnings but prescribe opposite remedies +Source: `pr-339-review.md:73-96` +> WARN-002: T6 (whitespace-only body) doesn't assert italic class, unlike T4. Recommendation: Add `expect(body.className).toMatch(/italic/);`. +> WARN-003: T2 `className` assertion is implementation-coupled. Recommendation: Drop the `className` line; keep the `textContent` assertion. + +Two co-equal Warnings, one prescribing addition and one prescribing removal of the same pattern. + +### E11: Decision log names the contradiction as a reviewer failure to self-check +Source: `02-conversation-and-decision-log.md:194-196` +> WARN-002 and WARN-003 are subtly contradictory. WARN-003 says "don't couple tests to Tailwind utility classes." WARN-002 says "add a `className.toMatch(/italic/)` assertion to T6." Both can't be right, and the reviewer didn't flag the inconsistency. + +### E12: User resolved WARN-002 by doing the opposite of what it recommended +Source: `pr-339-resolutions.md:25-29` +> Resolved by removing the italic assertion from T4 rather than adding parallel assertions to T5/T6. +> Reasoning: Asserting on the `italic` class is the same category of implementation coupling flagged in WARN-003. + +A Warning that is resolved by doing the opposite of its own recommendation is not a Warning. It was at most a Suggestion needing human adjudication. + +### E13: README confirms this is the bundle's most instructive single moment +Source: `tmp/gearjot-v2-web-pr-339/README.md:46` +> Most instructive single moment: Aaron resolved the contradiction between WARN-002 and WARN-003 by removing an assertion rather than adding one: a self-consistency call the code-review skill could have made on its own. + +--- + +## Symptom 4: Context not forwarded to sub-agents + +### E14: Review acknowledges no human-supplied focus areas +Source: `pr-339-review.md:1-13` + +The review's scope statement contains no "Context provided by user", no focus-area acknowledgment, and no reference to the PR description's "What to look at first" section. + +### E15: PR description named three specific concerns; none surfaced in the review +Source: `pr-339-description.md:31-33` +> - The conditional logic for the description (`typeof task.body === 'string' && task.body.trim()`): this follows the project's non-empty string check standard; worth confirming the empty-state copy ("No description provided.") matches product expectations. +> - DOM ordering: the new `
` sits above the existing `
` and the two share a `border-t` divider: scan the diff to confirm the visual separation reads correctly at the boundary. +> - The `uuid` field is now surfaced in read-only UI for the first time; the Tasks Detail Page project doc notes `uuid` is used for share links, so confirm this exposure is intentional for this page. + +None of these three concerns are addressed in the review's findings. + +### E16: Edge-case agent surfaced a UUID concern but not the human's UUID concern +Source: `pr-339-review.md:114-118` +> EC1 (edge-case-explorer, Low): `task.uuid` has no runtime null guard. Typed as non-nullable `string` in `TaskItem`; no realistic production path produces a missing UUID. Per small-change calibration, Low findings not directly introduced by this change are omitted. + +The agent found a UUID null-guard concern (unrelated to the human's framing) and the skill correctly omitted it under calibration. But the human's actual question (is surfacing UUID intentional for this page?) was never addressed because it never reached the agents. + +### E17: DOM ordering concern from the PR description was never reviewed +Source: cross-reference of `pr-339-description.md:32` and `pr-339-review.md:27-31` + +The PR description asks reviewers to "scan the diff to confirm the visual separation reads correctly at the boundary". The review summary contains four findings; none address visual separation or DOM ordering at the new section boundary. The human's focus area did not reach the sub-agents responsible for those domains (structural-analyst, behavioral-analyst, user-experience-designer if dispatched). diff --git a/docs/plans/code-review-guardrails/artifacts/skill-behavioral-trace.md b/docs/plans/code-review-guardrails/artifacts/skill-behavioral-trace.md new file mode 100644 index 0000000..a3a3b4e --- /dev/null +++ b/docs/plans/code-review-guardrails/artifacts/skill-behavioral-trace.md @@ -0,0 +1,176 @@ +# Behavioral trace of the code-review skill + +This artifact traces how the four reported symptoms flow through the code-review skill body and the dispatched agents. Findings are keyed B1...B8 and reference verbatim lines from the skill and agent definitions. + +Files in scope: + +- `plugin/skills/code-review/SKILL.md` +- `plugin/skills/code-review/references/review-checklist.md` +- `plugin/skills/code-review/references/agent-finding-classification.md` +- `plugin/skills/code-review/references/template.md` +- `plugin/references/yagni-rule.md` +- All nine dispatched agents under `plugin/agents/` + +--- + +## B1: Agent prompt templates have no placeholder for user focus areas + +Dimension: data flow. +Files: `plugin/skills/code-review/SKILL.md`. + +Step 1 of the skill ends with: +> If the user provided focus areas in their arguments, note them for use in Step 4. + +Step 3.5 then lists nine agent prompt templates. None of them carry a `{focus areas}`, `{user context}`, `{branch description}`, or `{PR description}` placeholder. The text directly says: +> Each agent's prompt has three parts: the domain-specific question, the calibration directive verbatim from Step 3.3, and the domain-scoped file list from Step 3.4. + +Focus areas are not one of those three parts. The user's stated context reaches only Step 4's manual review pass. The eight or nine specialist agents receive no instruction to weight or prioritize the focus surface. + +Impact: when a user writes `/code-review focus on auth middleware`, the agents review the full file list with equal weight. This explains PR 307 E17 to E22 (the implementation plan's WARN-003 resolution did not reach the reviewer) and PR 339 E14 to E17 (three named focus areas in the PR description were not addressed). + +--- + +## B2: structural-analyst and behavioral-analyst have skeptic-default rules that conflict with the calibration directive + +Dimension: data flow. +Files: `plugin/agents/structural-analyst.md`, `plugin/agents/behavioral-analyst.md`. + +`structural-analyst.md` says verbatim: +> Default posture is skeptical: assume structural problems exist until proven otherwise. +> When in doubt about whether something is a structural issue, include it: a false positive is cheaper than a missed risk. + +`behavioral-analyst.md` carries the same rule with "behavioral" substituted for "structural". + +The calibration directive from Step 3.3 says: +> When uncertain about severity, prefer the lower severity. If the worst-case impact is "an operator sees an error and retries," that is not Critical. +> Do not raise theoretical concerns the change does not touch. +> Do not raise pre-existing best-practice gaps the change did not make worse. + +The agent body and the calibration directive collide. The agent body's rule is positioned as "Rules" (the last word) and is absolute; the calibration directive lives in the middle of the prompt and depends on the agent honoring conditional logic about change size. When two instructions conflict, the more absolute, more recent, more prominently positioned one wins. + +Impact: structural and behavioral findings flag pre-existing concerns that the change did not introduce. This is the primary structural mechanism behind PR 299 E1 to E7 (the structural-agent finding flood that produced SUGG-005, SUGG-006, SUGG-007). + +--- + +## B3: Six of nine agent rubrics in agent-finding-classification.md set "Most findings land here = WARN" + +Dimension: data flow. +File: `plugin/skills/code-review/references/agent-finding-classification.md`. + +The rubric contains the phrase "Most findings land here" eight times. In seven of those, the phrase points at WARN: + +- test-engineer: "WARN: ... Most findings land here." +- edge-case-explorer: "WARN: ... Most findings land here." +- structural-analyst: "WARN: ... Most findings land here." +- behavioral-analyst: "WARN: ... Most findings land here." +- concurrency-analyst: "WARN: ... Most findings land here." +- data-engineer: "WARN: ... Most findings land here." +- devops-engineer: "WARN: ... Most findings land here." + +Only junior-developer puts "Most findings land here" at SUGG. + +The phrase creates a severity floor. When a finding is ambiguous between SUGG and WARN, the rubric pulls it up to WARN because WARN is the rubric's documented expected landing zone. This applies regardless of change size or calibration. + +Impact: across PRs 299, 307, and 339, almost every reclassification or won't-fix concerned a WARN finding that the user demoted to SUGG (PR 299 E11 to E17), would have demoted (PR 307 E12 to E16), or that should have been flagged for inconsistency (PR 339 E10 to E13). The severity inflation is structural. + +--- + +## B4: The calibration directive's "prefer lower severity" rule travels in the agent prompt but not into the classification step + +Dimension: data flow. +Files: `plugin/skills/code-review/SKILL.md`, `plugin/skills/code-review/references/agent-finding-classification.md`. + +Step 3.3 embeds in each agent's prompt: +> Severity calibration scales with size: +> - Small change: only Critical findings escalate. Raise Warnings only when the finding is directly introduced by this change. Omit Suggestions entirely. +> - Medium change: Critical and Warning findings escalate. Raise Suggestions only when directly introduced by this change. +> - Large change: all severities are in scope. + +Step 7 then says: +> Classify agent findings using the rubrics at [agent-finding-classification.md](references/agent-finding-classification.md). + +The rubric does not reference Step 3.3. It does not apply the size-based demotion. Once an agent writes a finding to its output file, Step 7's rubric reclassifies it without applying the change-size demotion logic. The "Small change: omit Suggestions entirely" rule is never enforced at the step where the final severity is assigned. + +Impact: for small and medium changes, suggestions get classified as warnings because the rubric's "Most findings land here = WARN" applies without the size filter that would have demoted them. This compounds with B3. + +--- + +## B5: Step 4 applies the review checklist to whole files in Modes B and C + +Dimension: data flow. +File: `plugin/skills/code-review/SKILL.md`. + +Step 4 sub-step 4: +> Examine the diff to understand what changed. If no diff is available (Mode B uncommitted review or Mode C non-git review from Step 1), skip this sub-step: the full file read from sub-step 3 provides all necessary context. Apply the review checklist to the entire file content. + +In Mode B (uncommitted) and Mode C (no git), the checklist is explicitly applied to the entire file. The checklist has no clause limiting any category to changed lines. + +Impact: in non-Mode-A reviews, the skill cannot distinguish introduced code from pre-existing code, so every category of the checklist applies to everything in the file. This is the structural source of "goes too deep" for non-branch reviews. + +--- + +## B6: YAGNI is applied as pattern-match against named anti-patterns, not as the evidence-test gate the rule mandates + +Dimension: data flow. +Files: `plugin/skills/code-review/references/review-checklist.md`, `plugin/references/yagni-rule.md`. + +The review checklist enumerates anti-patterns as bullet triggers: +> New abstraction (interface, base class, port, adapter) introduced for code with one current concrete implementation +> Defensive guard (null check, type check, validation) added at a trusted internal boundary the caller fully controls +> ... + +The yagni-rule.md defines YAGNI through two gates, where Gate 1 is the evidence test: +> Any committed item must cite at least one piece of evidence that it is needed now. Acceptable evidence: a user-described need, a named direct dependency, an existing production code path or contract that will break without it, a regulatory or compliance rule that demonstrably applies, a documented incident or measured metric. + +The checklist does not embed the affirmative evidence test. A reviewer who pattern-matches an anti-pattern without running the evidence test flags items that should not be flagged (the user described the demoability need in PR 307 but YAGNI fired anyway, see PR 307 E8 to E11) and misses items that should be flagged (PR 339 E5, E6 where the plan's explicit YAGNI rejection went unsurfaced). + +Impact: YAGNI under-applies in one direction (legitimate items are flagged because they match a structural anti-pattern but pass the evidence test) and over-applies in the other (speculative items that don't match a named anti-pattern slip through). + +--- + +## B7: junior-developer and edge-case-explorer agents have protocol instructions to read code beyond the scoped file list + +Dimension: integration boundaries. +Files: `plugin/agents/junior-developer.md`, `plugin/agents/edge-case-explorer.md`. + +`junior-developer.md` Protocol 4 says: +> Read, in this order: CLAUDE.md at repo root, any project-discovery.md or equivalent, coding standards, ADRs, and patterns in code adjacent to what the artifact will change. + +"Patterns in code adjacent" is unbounded. + +`edge-case-explorer.md` Protocol 1 says: +> Find callers and consumers. Use Grep to search for every call site of the target code's public functions. Read the callers to understand what values they actually pass. + +This is a repo-wide grep. The agent's file list scopes what it analyzes, but the protocol pushes it to follow call sites outward. + +Impact: out-of-scope findings flow back into the review as if they were findings about the changed files. This combined with B2 explains how the skill produces findings about untouched code surfaces. + +--- + +## B8: YAGNI can be applied to entire file contents in Mode C, including code that predates the change + +Dimension: data flow. +Files: `plugin/skills/code-review/SKILL.md`, `plugin/skills/code-review/references/review-checklist.md`, `plugin/references/yagni-rule.md`. + +SKILL.md Review Constraints section: +> Apply the evidence-based YAGNI rule from references/yagni-rule.md to every change in the diff. + +But in Mode C there is no diff. Step 4 still applies the checklist (including the YAGNI bullet) to the entire file. yagni-rule.md says: +> A YAGNI finding identifies code introduced by this change that has no evidence of being needed now. + +Without a diff, the skill cannot distinguish introduced code from pre-existing code. The checklist's YAGNI bullet fires on whatever pattern-matches, regardless of when the code was written. + +Impact: YAGNI surfaces for code that predates the current change in Mode C. Combined with B6, this means YAGNI both under-applies (by missing legitimate candidates) and over-applies (against pre-existing code). + +--- + +## Summary of structural failures + +| Symptom | Direct cause | Findings | +|---------|--------------|----------| +| Goes too deep | Skeptic-default agent rules (B2) + protocol-driven scope expansion (B7) + whole-file checklist application in Modes B and C (B5, B8) | PR 299 E1-E7; PR 307 E1-E7; PR 339 E1-E3 | +| YAGNI under-applies | Pattern-match-only application bypasses the evidence test (B6, B8) | PR 299 E8-E10; PR 307 E8-E11; PR 339 E5-E8 | +| Severity inflation | "Most findings land here = WARN" floor (B3) + Step 7 classification ignores the calibration directive (B4) | PR 299 E11-E17; PR 307 E12-E16; PR 339 E9-E13 | +| Context not forwarded | No focus-area placeholder in any agent prompt template (B1) | PR 299 E18-E23; PR 307 E17-E22; PR 339 E14-E17 | + +Each symptom traces to at least one specific section of the skill where the structural cause is locatable. diff --git a/docs/plans/code-review-guardrails/feature-implementation-plan.md b/docs/plans/code-review-guardrails/feature-implementation-plan.md new file mode 100644 index 0000000..1e01081 --- /dev/null +++ b/docs/plans/code-review-guardrails/feature-implementation-plan.md @@ -0,0 +1,276 @@ +# Feature Implementation Plan: Code-Review Skill Guardrails + +This plan implements the four-symptom calibration fix to the `/code-review` skill. The implementation posture is **a single feature branch landing the changes in three sequenced commits (one per atomic pair)**, validated against three real PR bundles and shipped as a minor version bump. + +## Source Specification + +- **Feature specification:** [investigation.md](investigation.md). The investigation document acts as the source specification for this branch. It commits S1 through S13 with file locations and the C# causes each solution addresses. +- **Specification decision log:** none. No prior `plan-a-feature` decision log exists for this work; the investigation captures its own adversarial-validation findings (V1 through V9) inside the same document. +- **Specification team findings:** none. No prior `plan-a-feature` team-findings file exists. +- **Specification decisions this plan inherits:** S1 through S13 from `investigation.md` (recorded as trivial decisions D-19, D-20, D-21 in the [decision log](artifacts/implementation-decision-log.md) and as work units in Decomposition and Sequencing below). +- **Specification open items this plan must respect or resolve:** none. The investigation's "What the investigation does not cover" section names three deferred topics (test plan, cross-project validation, implementation sequence). All three are resolved in this plan: the test plan lives in Testing Strategy, cross-project validation is captured as test TP-17, and the implementation sequence is captured in Decomposition and Sequencing. + +## Outcome + +When this plan is executed, `/code-review` no longer requires a second-pass reclassification run to produce calibrated output. The first pass produces ≤ 4 warnings on PR 299 (matching the second-pass result), 1 warning and 1 YAGNI question on PR 307, and 1 critical plus at most 1 warning on PR 339, with internal detection of the WARN-002/WARN-003 contradiction. User-provided focus areas and branch-level context (PR description, planning artifacts, commit messages) reach every sub-agent. `plugin/.claude-plugin/plugin.json` reports version 2.3.0 and `CHANGELOG.md` records the calibration changes. + +## Context + +- **Driving constraint:** the user has been running a manual second-pass reclassification on every `/code-review` invocation, with explicit mode flags ("WARN-justified, SUGG suppressed per reviewer instruction"), because the first-pass default behavior produces severity inflation. The workaround is dated, documented, and unsustainable; the fix needs to land at the skill level ([D-1](artifacts/implementation-decision-log.md#d-1-step-33-is-the-authoritative-home-for-size-based-demotion)). +- **Stakeholders:** + - The single primary user who invokes `/code-review` and the derivative `/gh-pr-review` skill. Success looks like the first pass producing the calibrated output the second pass currently produces. + - Future operators of any project that adopts the `han` plugin. Success looks like the calibration rules holding across projects, not just the one project that supplied the three PR bundles (validated by TP-17). + - Maintainers of the four agents that dispatch from skills other than `/code-review` (gap-analysis, plan-implementation, plan-a-feature, investigate). Success looks like the agent definitions remaining unchanged, with `/code-review`'s tailoring carried in its own dispatcher prompts ([D-4](artifacts/implementation-decision-log.md#d-4-s3-and-s4-ship-as-step-35-dispatcher-directives-not-as-edits-to-the-four-agent-definition-files)). +- **Future-state concern:** the merged Step 7.2 reachability phrase-match gate ([D-3](artifacts/implementation-decision-log.md#d-3-merge-s2-and-s10-into-a-single-step-72-reachability-phrase-match-gate)) depends on agents continuing to use the listed phrases in their finding rationale. If agent prompts evolve to suppress these phrases, the gate stops firing. Watch for this drift in post-ship validation runs. +- **Out-of-scope boundary:** + - No edits to the four affected agent definition files. The agents remain general-purpose for callers other than `/code-review` ([D-4](artifacts/implementation-decision-log.md#d-4-s3-and-s4-ship-as-step-35-dispatcher-directives-not-as-edits-to-the-four-agent-definition-files)). + - No automated test harness. The test plan is manual and uses the three existing PR bundles as fixtures (see Testing Strategy and the Deferred section). + - No cross-file semantic contradiction detection in S8; only overlapping line ranges in a single file are checked ([D-12](artifacts/implementation-decision-log.md#d-12-s8-runs-an-extraction-pass-before-pair-comparison-scoped-to-overlapping-line-ranges-only)). + - No structured "directly introduced" field added to agent output formats; phrase-matching is the implementation ([D-3](artifacts/implementation-decision-log.md#d-3-merge-s2-and-s10-into-a-single-step-72-reachability-phrase-match-gate)). + - No S12 mode flag for default SUGG suppression; the rubric and Step 7 gate subsume it ([D-16](artifacts/implementation-decision-log.md#d-16-defer-s12-default-suppression-for-small-changes-as-yagni)). + +## Team Composition and Participation + +| Specialist | Status | Key Input | +|------------|--------|-----------| +| `project-manager` | Coordinator | Facilitated R1 with four specialists in parallel; resolved eight Open Questions via evidence, reframing, and reasonable-call defaults; synthesized this plan. | +| `software-architect` | Active | SA-1 through SA-6: established Step 3.3 as the authoritative home (D-1), specified Step 7 sub-step structure (D-2), scoped S3/S4 as dispatcher directives (D-4), defined atomic shipping pairs (D-5), and flagged S12 as YAGNI candidate (D-16). | +| `behavioral-analyst` | Active | BA-1 through BA-9: defined named bindings (D-14), specified S6 fail-open warning (D-9) and planning-directory lookup (D-10), supplied the reachability phrase list for the merged Step 7.2 gate (D-3), and defined the S8 extraction pass (D-12). | +| `test-engineer` | Active | TP-1 through TP-21: produced the P0/P1/P2 manual test catalog (covered in Testing Strategy); deferred automated harness, per-agent unit tests, and Mode C standalone tests as YAGNI. | +| `junior-developer` | Reframer | JD-001 through JD-015: surfaced docs-mirror scope (D-7), version-bump scope (D-6), the `allowed-tools` frontmatter gap (D-8), the three-location S7 rewrite (D-11), the narrower S4 wording for `edge-case-explorer` (D-18), the em-dash strip policy (D-17), the architectural-file-read requirement in S9 (D-13), and the reframing that resolved OQ-3 and OQ-7. | +| `adversarial-validator` | Stood down | Investigation already includes V1 through V9 adversarial findings; no further validation pass needed at plan stage. | +| `devops-engineer` | Not engaged | Plugin is markdown-only; no observability, SLO, or rollout surface ([Operational Readiness](#operational-readiness) is thin). | +| `adversarial-security-analyst` | Not engaged | No auth, PII, IO, or supply-chain surface in markdown skill edits. | +| `user-experience-designer` | Not engaged | No user-facing UX surface; the skill is a CLI-invoked plugin. | + +Full round-by-round detail lives in [artifacts/implementation-iteration-history.md](artifacts/implementation-iteration-history.md). + +## Implementation Approach + +### Architecture and Integration Points + +The plugin is markdown-only, distributed via the Test Double marketplace, with no runtime or build step. "Implementation" means edits to Markdown skill bodies, reference files, and operator docs. Eight plugin files and five operator docs are touched. + +**Plugin surface:** +- `plugin/skills/code-review/SKILL.md`, primary skill body. Receives edits at: frontmatter `allowed-tools` (add `Bash(gh *)`, [D-8](artifacts/implementation-decision-log.md#d-8-add-bashgh--to-skillmd-allowed-tools-frontmatter)); Review Constraints line 24 (rewrite to size-aware per S13); Step 1 (record `$focus_areas` binding, [D-14](artifacts/implementation-decision-log.md#d-14-focus_areas-and-branch_context-are-explicit-named-bindings-populated-by-steps-1-and-15)); Step 1.5 (new, S6 with [D-9](artifacts/implementation-decision-log.md#d-9-s6-fail-open-behavior-prints-a-visible-warning-when-no-pr-or-branch-context-loads) fail-open and [D-10](artifacts/implementation-decision-log.md#d-10-s6-planning-directory-lookup-uses-a-structured-claudemd-key-with-glob-fallback) planning lookup); Step 3.1 (size detection, the authoritative `{size}` write per [D-15](artifacts/implementation-decision-log.md#d-15-size-value-is-read-from-step-31-all-five-consumer-sites-reference-step-31-explicitly)); Step 3.3 (calibration directive becomes the authoritative home per [D-1](artifacts/implementation-decision-log.md#d-1-step-33-is-the-authoritative-home-for-size-based-demotion); also receives the rewritten YAGNI two-pass procedure per [D-11](artifacts/implementation-decision-log.md#d-11-s7-updates-three-yagni-bearing-locations-in-skillmd-not-two)); Step 3.5 (add `$focus_areas` and `$branch_context` blocks per S5, plus the dispatcher directives for S3 and S4 per [D-4](artifacts/implementation-decision-log.md#d-4-s3-and-s4-ship-as-step-35-dispatcher-directives-not-as-edits-to-the-four-agent-definition-files) and [D-18](artifacts/implementation-decision-log.md#d-18-s4s-constraint-is-narrower-for-edge-case-explorer-than-for-junior-developer)); Step 4 (Mode B/C scope note per S11); Step 5 (premise-verification per [D-13](artifacts/implementation-decision-log.md#d-13-s9-requires-reading-at-least-one-architectural-file-before-raising-a-violates-standard-x-finding)); Step 7 (three-sub-step rewrite per [D-2](artifacts/implementation-decision-log.md#d-2-step-7-takes-a-numbered-sub-step-structure-71-read-72-merged-demote-73-rubric) with merged 7.2 gate per [D-3](artifacts/implementation-decision-log.md#d-3-merge-s2-and-s10-into-a-single-step-72-reachability-phrase-match-gate)); Step 9 (S8 self-consistency check with extraction pass per [D-12](artifacts/implementation-decision-log.md#d-12-s8-runs-an-extraction-pass-before-pair-comparison-scoped-to-overlapping-line-ranges-only)). +- `plugin/skills/code-review/references/agent-finding-classification.md`, rubric. Receives the size-aware rewrite per S1 / [D-19](artifacts/implementation-decision-log.md#trivial-decisions) across seven sections that currently read "Most findings land here." +- `plugin/skills/code-review/references/review-checklist.md`, receives the two-pass YAGNI rewrite per S7 / [D-11](artifacts/implementation-decision-log.md#d-11-s7-updates-three-yagni-bearing-locations-in-skillmd-not-two). +- `plugin/.claude-plugin/plugin.json`, version bump from 2.2.0 to 2.3.0 per [D-6](artifacts/implementation-decision-log.md#d-6-version-bump-is-minor-220--230). +- `CHANGELOG.md`, new `## 2.3.0` entry per [D-6](artifacts/implementation-decision-log.md#d-6-version-bump-is-minor-220--230). + +**Docs surface** (per [D-7](artifacts/implementation-decision-log.md#d-7-long-form-docs-mirror-updates-are-in-scope)): +- `docs/skills/code-review.md`, full mirror of the skill behavior changes. +- `docs/agents/structural-analyst.md`, `docs/agents/behavioral-analyst.md`, `docs/agents/junior-developer.md`, `docs/agents/edge-case-explorer.md`, one-paragraph note each, explaining that `/code-review` now tailors dispatch via Step 3.5 directives; the agent's default behavior outside `/code-review` is unchanged. + +**Files explicitly not touched:** +- `plugin/agents/structural-analyst.md`, `plugin/agents/behavioral-analyst.md`, `plugin/agents/junior-developer.md`, `plugin/agents/edge-case-explorer.md`, agent definitions remain as-is per [D-4](artifacts/implementation-decision-log.md#d-4-s3-and-s4-ship-as-step-35-dispatcher-directives-not-as-edits-to-the-four-agent-definition-files). +- `plugin/skills/code-review/references/template.md`, no change required. +- The `/gh-pr-review` skill, depends transitively on `/code-review` output and inherits the behavior change. + +### Data Model and Persistence + +Not applicable. The plugin is markdown-only with no persistent data and no schema. + +### Runtime Behavior + +The new control flow inside `/code-review`: + +1. **Step 1: Detect review context.** Reads invocation arguments; binds `$focus_areas` from the user's free-form argument string (defaults to `none provided` when empty, [D-14](artifacts/implementation-decision-log.md#d-14-focus_areas-and-branch_context-are-explicit-named-bindings-populated-by-steps-1-and-15)). +2. **Step 1.5: Load branch context.** New. Attempts to load PR description (via `gh pr view`, requiring [D-8](artifacts/implementation-decision-log.md#d-8-add-bashgh--to-skillmd-allowed-tools-frontmatter)), local `pr-body` file, commit messages since the default branch, and an implementation plan from the planning directory (looked up via the structured CLAUDE.md key with Glob fallback, [D-10](artifacts/implementation-decision-log.md#d-10-s6-planning-directory-lookup-uses-a-structured-claudemd-key-with-glob-fallback)). Summarizes loaded content into a Branch Context block of at most 200 words and binds it to `$branch_context`. When nothing loads, prints the visible warning per [D-9](artifacts/implementation-decision-log.md#d-9-s6-fail-open-behavior-prints-a-visible-warning-when-no-pr-or-branch-context-loads) and binds `$branch_context` to `none provided`. +3. **Step 3.1: Detect change size.** Existing step; this is the authoritative write point for `{size}` per [D-15](artifacts/implementation-decision-log.md#d-15-size-value-is-read-from-step-31-all-five-consumer-sites-reference-step-31-explicitly). Every later consumer references Step 3.1 by name. +4. **Step 3.3: Calibration directive.** Rewritten as the authoritative home for size-based demotion per [D-1](artifacts/implementation-decision-log.md#d-1-step-33-is-the-authoritative-home-for-size-based-demotion). Contains: the size-based demotion rules (Small omits Suggestions, Medium demotes non-directly-introduced findings, Large keeps the rubric); the merged reachability phrase list referenced by Step 7.2; the two-pass YAGNI procedure (evidence test, then anti-pattern check) referenced by `review-checklist.md` per [D-11](artifacts/implementation-decision-log.md#d-11-s7-updates-three-yagni-bearing-locations-in-skillmd-not-two). +5. **Step 3.5: Dispatch agents.** Every agent prompt template gains two named blocks: `**Focus areas from the user.** $focus_areas.` and `**PR / branch context.** $branch_context.` Per [D-4](artifacts/implementation-decision-log.md#d-4-s3-and-s4-ship-as-step-35-dispatcher-directives-not-as-edits-to-the-four-agent-definition-files), the dispatch prompts for `structural-analyst` and `behavioral-analyst` add a default-severity-to-SUGG calibration line (S3 form); the dispatch prompts for `junior-developer` and `edge-case-explorer` add the file-list scoping line, with the narrower wording for `edge-case-explorer` per [D-18](artifacts/implementation-decision-log.md#d-18-s4s-constraint-is-narrower-for-edge-case-explorer-than-for-junior-developer). +6. **Step 4: Manual file-by-file review.** Receives the Mode B / Mode C scope note per S11. +7. **Step 5: Documentation compliance.** Before raising any "violates standard X" finding, reads at least one architectural file demonstrating the standard's premise per [D-13](artifacts/implementation-decision-log.md#d-13-s9-requires-reading-at-least-one-architectural-file-before-raising-a-violates-standard-x-finding). When the file read does not confirm the premise, logs "premise not verified; finding omitted" and proceeds. +8. **Step 7: Classify agent output.** Restructured into three numbered sub-steps per [D-2](artifacts/implementation-decision-log.md#d-2-step-7-takes-a-numbered-sub-step-structure-71-read-72-merged-demote-73-rubric): + - 7.1 Read agent output files. + - 7.2 Apply the merged reachability-phrase-match demotion gate. For each finding, scan the rationale text for any of the documented phrases (`theoretical`, `hypothetical`, `defense-in-depth`, `effectively impossible`, `in case the upstream`, `could happen`, `should never happen`, `edge case that does not occur`). Demote one severity on match; omit when SUGG would demote ([D-3](artifacts/implementation-decision-log.md#d-3-merge-s2-and-s10-into-a-single-step-72-reachability-phrase-match-gate)). + - 7.3 Apply the size-aware rubric from `agent-finding-classification.md` ([D-19](artifacts/implementation-decision-log.md#trivial-decisions) / S1), reading `{size}` from Step 3.1. +9. **Step 9: Verification.** Adds the S8 self-consistency check per [D-12](artifacts/implementation-decision-log.md#d-12-s8-runs-an-extraction-pass-before-pair-comparison-scoped-to-overlapping-line-ranges-only): extraction pass produces `{task-id, file-path, line-range, recommended-action-summary}` tuples, then a comparison pass flags overlapping-line-range pairs with contradictory recommended actions, demoting both and adding `Tension with {other-task-id}:` notes. + +### External Interfaces + +Not applicable. The plugin has no APIs, events, queues, or third-party integrations beyond the existing `gh` CLI dependency added by S6 ([D-8](artifacts/implementation-decision-log.md#d-8-add-bashgh--to-skillmd-allowed-tools-frontmatter)). The `gh` dependency is a tool grant in the skill frontmatter, not a network or service integration. + +## Decomposition and Sequencing + +Work units are grouped into the three atomic pairs from [D-5](artifacts/implementation-decision-log.md#d-5-atomic-shipping-pairs-and-sequencing), then additive guardrails, then docs and version bump. Each unit ships as one commit. The pair groupings are load-bearing: shipping a pair half is a known regression. + +| # | Work Unit | Delivers | Depends On | Verification | +|---|-----------|----------|------------|--------------| +| 1 | **Pair A: calibrate severity baseline.** Rewrite `agent-finding-classification.md` per S1 ([D-19](artifacts/implementation-decision-log.md#trivial-decisions)) and SKILL.md line 24 per S13. Establish Step 3.3 as authoritative home ([D-1](artifacts/implementation-decision-log.md#d-1-step-33-is-the-authoritative-home-for-size-based-demotion)). Add the `{size}` cross-reference to Step 3.1 ([D-15](artifacts/implementation-decision-log.md#d-15-size-value-is-read-from-step-31-all-five-consumer-sites-reference-step-31-explicitly)). | Size-aware rubric for agent findings; size-aware Review Constraints rule for manual findings. | none | TP-3 (PR 307 1 WARN + 1 YAGNI on rerun); TP-7 (S1 size-aware rubric); TP-21 (caps and template preservation). | +| 2 | **Pair B: merged demotion gate.** Restructure Step 7 into 7.1/7.2/7.3 sub-steps ([D-2](artifacts/implementation-decision-log.md#d-2-step-7-takes-a-numbered-sub-step-structure-71-read-72-merged-demote-73-rubric)). Insert the merged S2+S10 reachability phrase-match gate at 7.2 ([D-3](artifacts/implementation-decision-log.md#d-3-merge-s2-and-s10-into-a-single-step-72-reachability-phrase-match-gate)). | First-pass severity output matches second-pass output for PR 299. | Pair A (rubric must be size-aware before Step 7.3 calls it). | TP-1 (PR 299 ≤ 4 WARN on rerun); TP-8 (no double-demotion between 7.2 and 7.3); TP-15 (reachability demotion fires on phrase list). | +| 3 | **Pair C: context plumbing and dispatcher directives.** Add `$focus_areas` and `$branch_context` named bindings ([D-14](artifacts/implementation-decision-log.md#d-14-focus_areas-and-branch_context-are-explicit-named-bindings-populated-by-steps-1-and-15)). Add Step 1.5 with the S6 loader ([D-9](artifacts/implementation-decision-log.md#d-9-s6-fail-open-behavior-prints-a-visible-warning-when-no-pr-or-branch-context-loads), [D-10](artifacts/implementation-decision-log.md#d-10-s6-planning-directory-lookup-uses-a-structured-claudemd-key-with-glob-fallback)). Add `Bash(gh *)` to `allowed-tools` ([D-8](artifacts/implementation-decision-log.md#d-8-add-bashgh--to-skillmd-allowed-tools-frontmatter)). Add the focus-area and branch-context blocks to every Step 3.5 prompt template ([D-20](artifacts/implementation-decision-log.md#trivial-decisions) / S5). Add the dispatcher directives for S3 (default SUGG for structural/behavioral) and S4 (file-list scoping with [D-18](artifacts/implementation-decision-log.md#d-18-s4s-constraint-is-narrower-for-edge-case-explorer-than-for-junior-developer) narrower wording for edge-case-explorer) per [D-4](artifacts/implementation-decision-log.md#d-4-s3-and-s4-ship-as-step-35-dispatcher-directives-not-as-edits-to-the-four-agent-definition-files). | User-provided focus areas and branch-level context reach every sub-agent; `structural-analyst` and `behavioral-analyst` default to SUGG output; `junior-developer` and `edge-case-explorer` constrain findings to the scoped file list. | Pair A (S3 has no observable effect until S1's rubric is size-aware per BA-7). | TP-2 (PR 299 Sentry deferred item not re-raised); TP-5 (focus areas in agent prompts); TP-6 (branch context block populates); TP-11 (S3 default SUGG); TP-12 (S4 file-list scope). | +| 4 | **S7 two-pass YAGNI.** Rewrite the YAGNI section in `review-checklist.md`, Step 3.3 calibration directive, and SKILL.md Review Constraints section (lines 29–41) per [D-11](artifacts/implementation-decision-log.md#d-11-s7-updates-three-yagni-bearing-locations-in-skillmd-not-two). | YAGNI runs evidence test first, then anti-pattern check, at all three sites. | Pair A (Review Constraints rewrite must align with the size-aware rule landed in Pair A). | TP-13 (S7 two-pass YAGNI); TP-20 (YAGNI verbatim opening preserved). | +| 5 | **S8 self-consistency check.** Add the S9-step extraction pass and the overlapping-line-range comparison pass per [D-12](artifacts/implementation-decision-log.md#d-12-s8-runs-an-extraction-pass-before-pair-comparison-scoped-to-overlapping-line-ranges-only). | Internal detection of contradictory same-file findings. | Pair B (the demotion gate must fire first; otherwise S8 sees pre-demotion severities). | TP-4 (PR 339 1 CRIT + contradiction detected); TP-14 (S8 tension notes emitted). | +| 6 | **S9 premise verification.** Add the architectural-file-read requirement to Step 5 per [D-13](artifacts/implementation-decision-log.md#d-13-s9-requires-reading-at-least-one-architectural-file-before-raising-a-violates-standard-x-finding). | Standards-compliance findings only raised when an architectural file confirms the standard's premise. | none | TP-9 (S9 premise verification). | +| 7 | **S11 Mode B/C scope note.** Add the Mode B / Mode C scope-limitation note to Step 4 per [D-21](artifacts/implementation-decision-log.md#trivial-decisions). | YAGNI skipped in Mode B and Mode C unless requested. | none | TP-16 (Mode B/C YAGNI skip). | +| 8 | **Docs sync.** Update `docs/skills/code-review.md` and the four `docs/agents/*.md` files per [D-7](artifacts/implementation-decision-log.md#d-7-long-form-docs-mirror-updates-are-in-scope). Apply the em-dash strip policy per [D-17](artifacts/implementation-decision-log.md#d-17-em-dash-strip-policy-on-verbatim-copy-from-the-investigation). | Operator-facing docs match shipped plugin behavior. | Pairs A, B, C, plus units 4 through 7 (the docs describe the final behavior). | Visual review against the final SKILL.md and against `docs/writing-voice.md`. | +| 9 | **Version bump and CHANGELOG.** Bump `plugin/.claude-plugin/plugin.json` from 2.2.0 to 2.3.0 ([D-6](artifacts/implementation-decision-log.md#d-6-version-bump-is-minor-220--230)). Add `## 2.3.0` entry to `CHANGELOG.md` summarizing the four calibration changes, docs sync, and deferred items. | Released artifact under 2.3.0. | All previous units. | Spot-check the CHANGELOG entry matches the actual edits. | + +## RAID Log + +### Risks + +| ID | Risk | Likelihood | Severity | Blast Radius | Reversibility | Owner | Mitigation | +|----|------|------------|----------|--------------|---------------|-------|------------| +| R1 | The phrase-match gate at Step 7.2 ([D-3](artifacts/implementation-decision-log.md#d-3-merge-s2-and-s10-into-a-single-step-72-reachability-phrase-match-gate)) misses real reachability cases that a structured "directly introduced" field would have caught. | Medium | Medium | `/code-review` and `/gh-pr-review` outputs. | Reversible by widening the phrase list or adding the structured field. | `behavioral-analyst` | Post-ship validation on PR 299, 307, 339; reopen-trigger documented in [D-3](artifacts/implementation-decision-log.md#d-3-merge-s2-and-s10-into-a-single-step-72-reachability-phrase-match-gate). | +| R2 | Calibration mechanisms validated only on one project (gearjot-v2-web). Cross-project failure modes possible. | Medium | Medium | All users of `/code-review`. | Reversible by adjusting Step 3.3 rules. | `project-manager` | TP-17 (cross-project validation on a different project's PR) is a P2 test that gates a confidence upgrade, not the initial ship. | +| R3 | `Bash(gh *)` not present in some user environments. S6's loader fails to retrieve PR descriptions. | High | Low | Mode A users without `gh` installed. | Already mitigated by D-9's fail-open warning. | `behavioral-analyst` | [D-9](artifacts/implementation-decision-log.md#d-9-s6-fail-open-behavior-prints-a-visible-warning-when-no-pr-or-branch-context-loads) prints visible warning and proceeds. | +| R4 | Agents stop using the phrases in the Step 7.2 list as their prompts evolve. The gate stops firing without anyone noticing. | Low | Medium | Severity inflation regresses silently. | Reversible by re-tuning the phrase list. | `behavioral-analyst` | Captured as the Future-state concern in Context; post-ship validation rounds should spot-check the phrase distribution in agent output. | + +### Assumptions + +| ID | Assumption | What Changes If Wrong | Verifier | Status | +|----|------------|-----------------------|----------|--------| +| A1 | The reachability phrase list ([D-3](artifacts/implementation-decision-log.md#d-3-merge-s2-and-s10-into-a-single-step-72-reachability-phrase-match-gate)) covers the PR 299, 307, 339 demotion targets. | Some demotion targets remain at WARN on rerun. | TP-1, TP-3, TP-4 acceptance runs. | Unverified pending unit-2 completion. | +| A2 | The structured CLAUDE.md key `plans:` or `planning:` under `## Project Discovery` ([D-10](artifacts/implementation-decision-log.md#d-10-s6-planning-directory-lookup-uses-a-structured-claudemd-key-with-glob-fallback)) is a convention this repo and at least one other project will adopt. | Glob fallback carries more weight than expected. | Cross-project validation (TP-17). | Unverified. | +| A3 | The five docs mirror updates are within the scope of one commit's-worth of writing time, after the plugin edits land. | Unit 8 takes longer than budgeted. | Spot-check after unit 7 completes. | Unverified. | +| A4 | The PR bundles in `tmp/gearjot-v2-web-pr-{299,307,339}/` remain accessible during validation runs and are not modified between units. | Tests run against shifted ground truth. | Visual check before each rerun. | Verified at synthesis. | + +### Issues + +| ID | Issue | Owner | Next Step | +|----|-------|-------|-----------| +| (none open at synthesis) | n/a | n/a | n/a | + +### Dependencies + +| ID | Dependency | Owner | Status | +|----|------------|-------|--------| +| Dep1 | `gh` CLI installed in user environment for S6's PR-description load. | User environment. | Optional dependency mitigated by [D-9](artifacts/implementation-decision-log.md#d-9-s6-fail-open-behavior-prints-a-visible-warning-when-no-pr-or-branch-context-loads) fail-open. | +| Dep2 | The three `tmp/gearjot-v2-web-pr-*/` fixture directories. | This branch. | Present in working directory; used for TP-1 through TP-4 acceptance tests. | +| Dep3 | `docs/writing-voice.md` voice rule for the em-dash strip policy. | Repo-wide. | Already canonical. | + +## Testing Strategy + +All tests are manual. The fixtures are the existing PR bundles under `tmp/gearjot-v2-web-pr-{299,307,339}/`; no new fixture files are created. The automated test harness, per-agent unit tests, and Mode C standalone tests are deferred as YAGNI (see Deferred section). + +- **Observable behaviors to test:** + - Severity inflation eliminated on the three PR bundles. + - Focus areas and branch-level context reach every sub-agent prompt. + - Step 7.2 reachability gate fires on the documented phrases. + - S8 self-consistency check detects same-file overlapping-line-range contradictions. + - S9 premise verification suppresses standards-compliance findings when the architectural-file read does not confirm the premise. + - Mode B and Mode C skip YAGNI unless explicitly requested. +- **Test doubles posture:** none. All tests run the live `/code-review` skill against real PR bundles. +- **Edge cases requiring coverage:** the WARN-002/WARN-003 contradiction class (TP-4); the SEC-finding cross-reference preservation across the new gates (TP-18); data-isolation regression where focus areas from one invocation leak to another (TP-19); template structure preservation (TP-21). + +**P0 acceptance tests (5):** +- TP-1: rerun `/code-review` against PR 299; assert ≤ 4 WARN findings. +- TP-2: rerun against PR 299; assert the Sentry-deferred PII item from the PR description's Deferred section is not re-raised. +- TP-3: rerun against PR 307; assert 1 WARN + 1 YAGNI finding. +- TP-4: rerun against PR 339; assert 1 CRIT + at most 1 WARN, and assert the WARN-002/WARN-003 contradiction is internally detected with `Tension with ...:` notes. +- TP-18: rerun against PR 299; assert the SEC-finding cross-reference structure is preserved through the new gates. + +**P1 behavior tests (12):** +- TP-5 through TP-15: focus areas appear in prompts; branch context block populates; S1 size-aware rubric fires; S2/S10 merged gate produces no double-demotion; S3 default SUGG observed; S4 file-list scope observed (including the narrower [D-18](artifacts/implementation-decision-log.md#d-18-s4s-constraint-is-narrower-for-edge-case-explorer-than-for-junior-developer) wording for edge-case-explorer); S7 two-pass YAGNI; S8 tension notes emitted; S9 premise verification logs "premise not verified; finding omitted" when applicable; S10 reachability demotion on phrase list; one reserved. +- TP-19: data-isolation regression, two consecutive invocations with different `$focus_areas` produce non-overlapping focus-area content in their respective agent prompts. +- TP-20: the YAGNI verbatim opening preserved across the three-location rewrite. +- TP-21: template structure (caps, agent dispatch format, output schema) preserved. + +**P2 tests (2):** +- TP-16: Mode B / Mode C YAGNI skip behavior. +- TP-17: cross-project validation, run the calibrated skill against a PR from a project other than gearjot-v2-web; document any divergence from the three PR bundles' behavior. + +**Test levels:** integration (the skill is the unit under test). No unit-level tests of individual rubric sections, individual agent prompts, or Step 7 sub-steps as isolated functions; these are exercised by the integration runs. + +## Security Posture + +No applicable security surface. The implementation is markdown edits to a Claude Code plugin with no auth, no PII flow, no IO sinks, and no supply-chain change beyond the existing in-repo references. The only new tool grant is `Bash(gh *)` ([D-8](artifacts/implementation-decision-log.md#d-8-add-bashgh--to-skillmd-allowed-tools-frontmatter)), which is a narrowly scoped delegation to a CLI the user already has installed for their workflow. + +## Operational Readiness + +The plugin is markdown-only with no runtime, no observability surface, no SLO touchpoints, and no traditional rollout / rollback. Operational readiness is thin and limited to release mechanics: + +- **Observability:** not applicable. No metrics, logs, or traces; the skill's "observability" is the rendered output of each `/code-review` run, validated by the manual tests above. +- **SLO impact:** not applicable. No SLOs exist for the plugin. +- **Feature flag:** not applicable. The behavior change is shipped directly; the user has been compensating with mode flags for the unwanted prior behavior, and the fix removes the need for the workaround. +- **Rollout:** ship the changes in the nine work-unit sequence above on a feature branch, validate against the three PR bundles, merge to `main`, bump version, update CHANGELOG. +- **Rollback:** revert the merge commit. The plugin has no migration to unwind. +- **Cost and scale:** not applicable. The new Step 1.5 makes at most one `gh pr view` call and one Glob, both of which are sub-second operations. +- **Transitive dependency callout:** the derivative `/gh-pr-review` skill inherits the behavior change. No edits to that skill are required; it consumes `/code-review`'s output through the same interface. + +## Definition of Done + +- [ ] TP-1 passes: `/code-review` against PR 299 produces ≤ 4 WARN findings on rerun. +- [ ] TP-2 passes: PR 299's Sentry-deferred PII item is not re-raised. +- [ ] TP-3 passes: PR 307 produces 1 WARN + 1 YAGNI finding. +- [ ] TP-4 passes: PR 339 produces 1 CRIT + at most 1 WARN, with the WARN-002/WARN-003 contradiction internally detected via the S8 tension notes ([D-12](artifacts/implementation-decision-log.md#d-12-s8-runs-an-extraction-pass-before-pair-comparison-scoped-to-overlapping-line-ranges-only)). +- [ ] TP-18 passes: SEC-finding cross-reference structure preserved through the new gates. +- [ ] Step 3.3 is the single authoritative home for size-based demotion; line 24, Step 7.2, the rubric, and the YAGNI two-pass procedure all reference Step 3.3 by name ([D-1](artifacts/implementation-decision-log.md#d-1-step-33-is-the-authoritative-home-for-size-based-demotion)). +- [ ] Step 7 is structured as 7.1 / 7.2 / 7.3 sub-steps with the merged demotion gate at 7.2 ([D-2](artifacts/implementation-decision-log.md#d-2-step-7-takes-a-numbered-sub-step-structure-71-read-72-merged-demote-73-rubric), [D-3](artifacts/implementation-decision-log.md#d-3-merge-s2-and-s10-into-a-single-step-72-reachability-phrase-match-gate)). +- [ ] `Bash(gh *)` added to SKILL.md frontmatter `allowed-tools` ([D-8](artifacts/implementation-decision-log.md#d-8-add-bashgh--to-skillmd-allowed-tools-frontmatter)). +- [ ] `$focus_areas` and `$branch_context` named bindings present and referenced by every Step 3.5 agent prompt template ([D-14](artifacts/implementation-decision-log.md#d-14-focus_areas-and-branch_context-are-explicit-named-bindings-populated-by-steps-1-and-15)). +- [ ] Step 1.5 fail-open warning fires when no PR or branch context loads ([D-9](artifacts/implementation-decision-log.md#d-9-s6-fail-open-behavior-prints-a-visible-warning-when-no-pr-or-branch-context-loads)). +- [ ] All four agent definition files (`plugin/agents/{structural-analyst, behavioral-analyst, junior-developer, edge-case-explorer}.md`) are unmodified ([D-4](artifacts/implementation-decision-log.md#d-4-s3-and-s4-ship-as-step-35-dispatcher-directives-not-as-edits-to-the-four-agent-definition-files)). +- [ ] Five long-form docs updated: `docs/skills/code-review.md` and the four `docs/agents/*.md` files ([D-7](artifacts/implementation-decision-log.md#d-7-long-form-docs-mirror-updates-are-in-scope)). +- [ ] No em-dashes appear in any shipped file under `plugin/` or `docs/` for this branch ([D-17](artifacts/implementation-decision-log.md#d-17-em-dash-strip-policy-on-verbatim-copy-from-the-investigation)). +- [ ] `plugin/.claude-plugin/plugin.json` reports version 2.3.0 ([D-6](artifacts/implementation-decision-log.md#d-6-version-bump-is-minor-220--230)). +- [ ] `CHANGELOG.md` has a `## 2.3.0` entry summarizing the four calibration changes, docs sync, and deferred items. +- [ ] The deferred S12 mode flag is not present in SKILL.md ([D-16](artifacts/implementation-decision-log.md#d-16-defer-s12-default-suppression-for-small-changes-as-yagni)). + +## Specialist Handoffs for Implementation + +- **`behavioral-analyst`**, dispatch during Pair B implementation when the Step 7.2 phrase-match gate is being written; needs the phrase list from [D-3](artifacts/implementation-decision-log.md#d-3-merge-s2-and-s10-into-a-single-step-72-reachability-phrase-match-gate) and the agent output samples from `tmp/gearjot-v2-web-pr-299/` to spot-check phrase coverage. Also dispatch during Pair C for the S6 loader implementation; needs the four-source order (gh, pr-body, commit messages, planning doc) and the [D-10](artifacts/implementation-decision-log.md#d-10-s6-planning-directory-lookup-uses-a-structured-claudemd-key-with-glob-fallback) lookup format. +- **`software-architect`**, dispatch during Pair A to confirm the Step 3.3 authoritative-home rewrite holds against every consumer site (rubric, line 24, Step 7.2, Step 3.5, YAGNI two-pass); needs [D-1](artifacts/implementation-decision-log.md#d-1-step-33-is-the-authoritative-home-for-size-based-demotion) and the list of five consumer references. +- **`test-engineer`**, dispatch after each unit completes to run the corresponding TP-# tests; needs the test catalog from Testing Strategy. +- **`junior-developer`**, dispatch during the docs sync (unit 8) to apply the [D-17](artifacts/implementation-decision-log.md#d-17-em-dash-strip-policy-on-verbatim-copy-from-the-investigation) em-dash strip policy and to confirm the [D-18](artifacts/implementation-decision-log.md#d-18-s4s-constraint-is-narrower-for-edge-case-explorer-than-for-junior-developer) narrower wording in the Step 3.5 dispatcher directive for `edge-case-explorer`. + +## Deferred (YAGNI) + +Items considered during planning but deferred under the YAGNI rule ([../../../plugin/references/yagni-rule.md](../../../plugin/references/yagni-rule.md)). + +### S12 default-suppression mode flag for small changes +- **Why deferred:** Gate 2 simpler-version test. S1 + S2-merged-with-S10 + S13 satisfy the same evidence (the PR 299 E17 workaround "WARN-justified, SUGG suppressed per reviewer instruction") because the size-aware rubric already omits Suggestions on Small changes and demotes warnings on reachability phrases. +- **Reopen when:** post-ship validation against PR 299 still produces severity inflation requiring a SUGG-suppress mode flag. +- **Source:** R1, SA-6 / BA-9 / JD-006 consensus; recorded under [D-16](artifacts/implementation-decision-log.md#d-16-defer-s12-default-suppression-for-small-changes-as-yagni). + +### Structured "directly introduced" field in agent output format +- **Why deferred:** Gate 1 evidence test. No documented incident demands a structured field; phrase-matching on the reachability list is the simpler version that handles the PR 299 demotion targets. +- **Reopen when:** post-ship validation shows the reachability phrase-match misses real cases that a structured field would have caught. +- **Source:** R1, JD-005 (recorded as the rejected alternative in [D-3](artifacts/implementation-decision-log.md#d-3-merge-s2-and-s10-into-a-single-step-72-reachability-phrase-match-gate)). + +### Cross-file semantic contradiction detection in S8 +- **Why deferred:** Gate 1 evidence test. No documented incident beyond PR 339's WARN-002/WARN-003 case, which is the overlapping-line-range class already covered by [D-12](artifacts/implementation-decision-log.md#d-12-s8-runs-an-extraction-pass-before-pair-comparison-scoped-to-overlapping-line-ranges-only). +- **Reopen when:** a real review surfaces a contradictory finding pair the line-range detector misses. +- **Source:** R1, JD-015 (recorded as the rejected alternative in [D-12](artifacts/implementation-decision-log.md#d-12-s8-runs-an-extraction-pass-before-pair-comparison-scoped-to-overlapping-line-ranges-only)). + +### Automated test harness +- **Why deferred:** Gate 1 evidence test. No measured frequency of regression justifies automation infrastructure; the three manual fixture bundles are sufficient. +- **Reopen when:** the solutions run more than three times per month against new PRs and manual verification becomes a documented bottleneck. +- **Source:** R1, test-engineer YAGNI section. + +### Per-agent unit tests +- **Why deferred:** Gate 1 evidence test. The PR-bundle acceptance tests (TP-1, TP-3, TP-4) exercise the full pipeline including each dispatched agent. +- **Reopen when:** a specific agent produces a regression on a new PR that the acceptance tests did not catch. +- **Source:** R1, test-engineer YAGNI section. + +### Mode C standalone behavioral tests +- **Why deferred:** Gate 1 evidence test. Mode C is not represented in the three PR bundles; Mode A is the primary user path. +- **Reopen when:** a user files a Mode C regression. +- **Source:** R1, test-engineer YAGNI section. + +### Global edits to the four agent definition files (the investigation's original S3/S4 form) +- **Why deferred:** Gate 2 simpler-version test. Skill-level dispatch directives in Step 3.5 satisfy the same evidence with smaller blast radius; agent definitions remain general-purpose for callers outside `/code-review`. +- **Reopen when:** a cross-project validation run (TP-17) or a user report shows a non-`/code-review` skill exhibits the same calibration failure. +- **Source:** R1, SA-4 / JD-007 / JD-014 consensus; recorded as the rejected alternative in [D-4](artifacts/implementation-decision-log.md#d-4-s3-and-s4-ship-as-step-35-dispatcher-directives-not-as-edits-to-the-four-agent-definition-files). + +## Open Items + +- **OI-1:** Cross-project validation (TP-17) has not been performed at plan stage. The calibration mechanisms are validated only against gearjot-v2-web PR bundles. + - **Resolves when:** TP-17 runs against a PR from a different project and the output is recorded. + - **Blocks implementation:** No. TP-17 is a P2 confidence test; the P0 acceptance tests (TP-1 through TP-4, TP-18) gate ship. + +## Summary + +- **Outcome delivered:** `/code-review` produces calibrated first-pass output matching the user's manual second-pass workaround on PR 299, PR 307, and PR 339, with focus areas and branch context plumbed to every sub-agent and internal detection of contradictory same-file findings. +- **Team size:** 4 specialists plus the project-manager, see [artifacts/implementation-iteration-history.md](artifacts/implementation-iteration-history.md). +- **Rounds of facilitation:** 1, see [artifacts/implementation-iteration-history.md](artifacts/implementation-iteration-history.md). +- **Decisions committed:** 21, see [artifacts/implementation-decision-log.md](artifacts/implementation-decision-log.md). +- **Decisions settled by evidence:** 13, see [artifacts/implementation-decision-log.md](artifacts/implementation-decision-log.md). +- **Decisions settled by junior-developer reframing:** 2, see [artifacts/implementation-decision-log.md](artifacts/implementation-decision-log.md). +- **Decisions settled by user input:** 2, see [artifacts/implementation-decision-log.md](artifacts/implementation-decision-log.md). +- **Rejected alternatives recorded:** 27, see [artifacts/implementation-decision-log.md](artifacts/implementation-decision-log.md). +- **Open items remaining:** 1 (TP-17 cross-project validation, non-blocking). +- **Recommendation:** Ship as planned. diff --git a/docs/plans/code-review-guardrails/investigation.md b/docs/plans/code-review-guardrails/investigation.md new file mode 100644 index 0000000..fe45435 --- /dev/null +++ b/docs/plans/code-review-guardrails/investigation.md @@ -0,0 +1,292 @@ +# Investigation: Code-Review Skill Guardrails + +Author: Claude (River's investigation, 2026-05-15). +Sources: three real PR bundles (`tmp/gearjot-v2-web-pr-299/`, `pr-307/`, `pr-339/`) plus a behavioral trace of the `code-review` skill and its dispatched agents. +Evidence: 62 verbatim findings across the three PRs and 8 behavioral findings about the skill itself, written to: + +- [artifacts/pr-299-evidence.md](artifacts/pr-299-evidence.md) (E1...E23) +- [artifacts/pr-307-evidence.md](artifacts/pr-307-evidence.md) (E1...E22) +- [artifacts/pr-339-evidence.md](artifacts/pr-339-evidence.md) (E1...E17) +- [artifacts/skill-behavioral-trace.md](artifacts/skill-behavioral-trace.md) (B1...B8) + +--- + +## Executive summary + +All four reported symptoms (going too deep, YAGNI under-applying, severity inflation, and context not forwarded) are real, well-evidenced across multiple PRs, and traceable to specific lines in `plugin/skills/code-review/SKILL.md`, its three reference files, and the dispatched agent definitions. They are not separate bugs but four expressions of two underlying structural failures: + +1. **Calibration is defined once and bypassed everywhere else.** The skill's Step 3.3 calibration directive (which scopes findings to the change and demotes by change size) lives only inside the agent prompt that goes out to specialists. The classification step that processes what the agents return (Step 7) uses a different rubric (`agent-finding-classification.md`) that does not apply the size-based demotion and that sets "Most findings land here = WARN" as a default floor for six of nine agent types. Two specialist agents (`structural-analyst`, `behavioral-analyst`) also carry "skeptic-default" rules in their own bodies that explicitly conflict with the calibration directive. The result is that severity inflation and off-rails findings are not a bug; they are the structural default behavior of the skill, and the user has been compensating by passing mode flags ("WARN-justified, SUGG suppressed per reviewer instruction") and by running the review twice on the same PR. + +2. **User-provided focus areas are captured at Step 1 and never reach the sub-agents that produce most findings.** The skill notes the focus areas "for use in Step 4", but the nine agent prompt templates in Step 3.5 have no placeholder for them. PR descriptions, implementation plans, deferred items, and explicit "what to look at first" lists from the human author are not loaded by the skill or forwarded to agents. The agents work from the file list alone. This is why PR 307's reviewer raised WARN-002 and WARN-003 against premises the implementation plan had already resolved, and why PR 339's review answered none of the three concerns the PR author explicitly named. + +The most consequential single piece of evidence is that PR #299 required a **second `han:code-review` run with explicit reclassification**, in which "7 of the prior 11 warnings were downgraded to suggestions on a second pass" because they were "defensive refinements or refactoring opportunities, not reachable defects" (`pr-299-review-2.md:3`). The user has already been running the workaround. The investigation's job is to bake that second pass into the first pass. + +There are no `adversarial-validator` confidence-busting counter-examples in the evidence: every finding category is corroborated by at least two of the three PR bundles plus a behavioral trace. The investigation's confidence is high. + +Risk level: the proposed solutions edit a single skill body and its three reference files. They do not change the agent definitions invasively, with the exception of removing two skeptic-default rules from `structural-analyst.md` and `behavioral-analyst.md`. The blast radius is contained to the code-review pipeline and its derivative `gh-pr-review`. + +--- + +## Indexed examples (cross-PR) + +Examples are grouped by the four symptoms. Each example references at least one E# finding in the artifacts directory. + +### Symptom 1: Goes too deep / off the rails + +- **EX1.1.** PR 299's first pass produced 11 warnings, of which 7 were demoted to suggestions because they concerned code outside the change surface or theoretical rather than reachable defects. Evidence: PR 299 E1, E3, E4, E6, E7. +- **EX1.2.** PR 307 WARN-002 raised "API may return bad coordinates" against a frontend, even though the database CHECK constraint, the non-nullable Go DTO, and Go's JSON serializer (all in a different repository) make the bad-coordinate state unrepresentable. Evidence: PR 307 E1, E5, E6. +- **EX1.3.** PR 307 WARN-003 raised a SPA-style company-switch state-leak finding against an app whose company-switch path is a full-document redirect that tears down all module state. The four files proving this were never changed by the PR and were never read by the review. Evidence: PR 307 E2, E4, E7. +- **EX1.4.** PR 339 WARN-001 flagged a stale `data-testid` in a documentation file outside the review's own declared 2-file scope. The reviewer explicitly labeled the issue "not introduced by this branch but adjacent". Evidence: PR 339 E1, E3. +- **EX1.5.** PR 339 missed an in-scope `text-warning-foreground` bug in the same file as CRIT-001 with the same root cause pattern, because attention was spent on out-of-scope documentation drift. The PR author found the bug in their own pass. Evidence: PR 339 E4. +- **EX1.6.** PR 299's feature overview directly records: "Several review comments were 'you should add filtering / a list pane / a current-location button' and the team consistently said: that's a different slice." Evidence: PR 299 E7. + +### Symptom 2: YAGNI under-applies + +- **EX2.1.** PR 299's first pass surfaced one YAGNI finding (for dead code) while filing seven YAGNI-shaped findings as warnings: speculative defensive code, refactor opportunities, single-implementation abstractions. Evidence: PR 299 E8, E9, E10. +- **EX2.2.** PR 307 raised YAGNI-001 and then immediately accepted the developer's rationale in the same comment, instead of routing the finding to a `project-manager` or `user-experience-designer` agent who could weigh the demoability question first. Evidence: PR 307 E8, E9. +- **EX2.3.** PR 307 WARN-002 and WARN-003 are textbook YAGNI candidates (defensive code at a trusted internal boundary, for a failure mode the architecture cannot produce) but were filed as warnings. Evidence: PR 307 E11. +- **EX2.4.** PR 339's review reported zero YAGNI findings even though the implementation plan logged an explicit YAGNI rejection (D-5: no `` component extraction) that the review never validated against the implementation. Evidence: PR 339 E5, E6, E7. +- **EX2.5.** PR 339 WARN-002 recommended adding an implementation-coupled assertion (`className.toMatch(/italic/)`) which is itself the named anti-pattern WARN-003 was flagging. The review created the YAGNI violation rather than detecting it. Evidence: PR 339 E8. +- **EX2.6.** PR 307 PR description pre-acknowledged the placeholder rationale 21 seconds before YAGNI-001 fired. The YAGNI check ran without weighting the PR-level context. Evidence: PR 307 E10. + +### Symptom 3: Severity inflation + +- **EX3.1.** PR 299's second review opened with the explicit statement that 7 of 11 first-pass warnings were demoted to suggestions on reclassification. Evidence: PR 299 E11, E12. +- **EX3.2.** PR 299 SUGG-003 was a first-pass warning even though the concurrency agent that raised it labeled the defect "effectively impossible in production". Evidence: PR 299 E6, E13. +- **EX3.3.** PR 299 SUGG-004 was a first-pass warning over a "hypothetical" threat with explicit "defense-in-depth" rationale. Evidence: PR 299 E14. +- **EX3.4.** PR 307 had a 3:1 ratio of won't-fix to accepted WARN findings. Evidence: PR 307 E12, E13, E14, E16. +- **EX3.5.** PR 339 raised WARN-002 and WARN-003 with co-equal severity even though they prescribed contradictory remedies for the same category of issue. The user resolved WARN-002 by doing the opposite of what it recommended. Evidence: PR 339 E10, E11, E12. +- **EX3.6.** The user has been compensating with mode flags ("WARN-justified, SUGG suppressed per reviewer instruction") because the default mode produces inflation. Evidence: PR 299 E17. +- **EX3.7.** Process log explicitly names the missing first-pass reachability gate: "Bake this into the first pass instead of needing a second pass to catch it." Evidence: PR 299 E10, E15, E16. + +### Symptom 4: Context not forwarded to sub-agents + +- **EX4.1.** The skill's nine agent prompt templates have no placeholder for user-provided focus areas. The user's free-form argument reaches Step 4 only. Evidence: B1. +- **EX4.2.** PR 339's PR description named three specific reviewer concerns (conditional-logic copy, DOM ordering at the section boundary, intentional UUID exposure). The review surfaced none of them. Evidence: PR 339 E14, E15, E16, E17. +- **EX4.3.** PR 307's implementation plan in issue #304 explicitly resolved the WARN-003 question before any code was written: "The viewport flag does not need `clearForCompanySwitch()` because a company switch hard-navigates; it just needs `_resetForTest()`." The review raised WARN-003 anyway. Evidence: PR 307 E18, E22. +- **EX4.4.** PR 299's PR description's Deferred section explicitly listed the Sentry PII work as out of scope. Both review passes re-raised it as a security gap. Evidence: PR 299 E20, E21. +- **EX4.5.** PR 307's review fired 21 seconds after PR open, suggesting agents started before the PR-level context could be loaded by the orchestrator. Evidence: PR 307 E21. +- **EX4.6.** Process logs across all three PRs name "before raising a violates-standard-X warning, prove the standard's premise applies", "skill prompts should know this is a multi-repo project", and "test-gap chaining in code review needs prior-review context". Evidence: PR 307 E19, E20; PR 299 E22, E23. + +--- + +## Indexed root causes + +Each cause C# names the structural mechanism, cites the skill file or line, and points at the evidence and behavioral findings that support it. + +### C1: Severity floor in the classification rubric pulls findings up to WARN +- **Location:** `plugin/skills/code-review/references/agent-finding-classification.md`, lines 6, 14, 34, 42, 50, 58, 68. Seven of nine agent rubrics include the phrase "Most findings land here" pointing at WARN. +- **Effect:** Any finding that is ambiguous between SUGG and WARN is pulled up to WARN by the rubric's documented expected landing zone. +- **Supports:** symptom 3 (severity inflation). +- **Evidence:** PR 299 E11-E17; PR 307 E12-E16; PR 339 E9-E13; B3. + +### C2: Step 7 reclassifies agent output without applying Step 3.3's size-based demotion +- **Location:** `plugin/skills/code-review/SKILL.md`, Step 7 (line 253 onward) reads agent output files and applies the rubric in `agent-finding-classification.md`. The size-based demotion rules from Step 3.3's calibration directive ("Small change: omit Suggestions entirely") are inside the agent prompt only and are not referenced from the rubric. +- **Effect:** Even if an agent honored the calibration directive in its output file, the classification step re-raises severity according to the rubric, which has no notion of change size. +- **Supports:** symptom 3. +- **Evidence:** B4; PR 299 E17. + +### C3: Skeptic-default rules in two specialist agents conflict with the calibration directive +- **Location:** `plugin/agents/structural-analyst.md` Rules section ("Default posture is skeptical: assume structural problems exist until proven otherwise. When in doubt about whether something is a structural issue, include it: a false positive is cheaper than a missed risk"). Same rule with "behavioral" substituted in `plugin/agents/behavioral-analyst.md`. +- **Effect:** The agent body says "include when in doubt"; the calibration directive says "prefer the lower severity". The more absolute rule wins. +- **Supports:** symptom 1 (going too deep) and symptom 3. +- **Evidence:** B2; PR 299 E1, E2, E3. + +### C4: Adjacent-file and call-site reads in junior-developer and edge-case-explorer pull out-of-scope code into findings +- **Location:** `plugin/agents/junior-developer.md` Protocol 4 ("patterns in code adjacent to what the artifact will change"). `plugin/agents/edge-case-explorer.md` Protocol 1 ("use Grep to search for every call site... read the callers"). +- **Effect:** Findings about out-of-scope code flow back into the review as if they concerned the changed files. The Step 3.4 file-list scoping does not constrain these protocols. +- **Supports:** symptom 1. +- **Evidence:** B7; PR 299 E5, E7; PR 339 E1. + +### C5: No focus-area or user-context placeholder in any agent prompt template +- **Location:** `plugin/skills/code-review/SKILL.md` Step 3.5, all nine prompt templates. The skill body acknowledges focus areas at Step 1 ("note them for use in Step 4") and Step 4 ("apply extra scrutiny"), but never plumbs them into Step 3.5's templates. +- **Effect:** All eight or nine sub-agents work without the user's focus context. They produce findings unweighted by the user's priorities. +- **Supports:** symptom 4 (context not forwarded). +- **Evidence:** B1; PR 339 E14-E17; PR 307 E17-E22; PR 299 E18-E20. + +### C6: No step in the skill loads PR-level or branch-level context beyond a file list +- **Location:** `plugin/skills/code-review/SKILL.md` Step 1 ("Detect review context") reads the script output and the diff. It does not read the PR description, the linked issue, the branch's commit messages, or the planning artifacts in adjacent repos. CLAUDE.md is read for project-discovery purposes only. +- **Effect:** The PR description's "Deferred", "Files of interest", and "What to look at first" sections do not reach the sub-agents. Cross-repo context (planning, linked issues, backend invariants) is never loaded. +- **Supports:** symptom 4 and symptom 1. +- **Evidence:** PR 299 E20, E21; PR 307 E20; PR 339 E14, E15. + +### C7: YAGNI is applied as a pattern-match against named anti-patterns rather than as the evidence-test gate the rule requires +- **Location:** `plugin/skills/code-review/references/review-checklist.md` YAGNI section lists anti-patterns as bullet triggers without embedding the affirmative evidence test. `plugin/references/yagni-rule.md` says the rule should be "run the evidence test first; flag only if no evidence applies". +- **Effect:** YAGNI fires on items that pass the evidence test (the user described the need) and misses items that don't match a named anti-pattern but fail the evidence test (novel speculative code that looks normal). +- **Supports:** symptom 2 (YAGNI under-applies) and symptom 1. +- **Evidence:** B6; PR 307 E8, E9, E10, E11; PR 339 E5, E6, E7, E8; PR 299 E8, E9. + +### C8: No self-consistency check on findings; contradictory findings can coexist at the same severity +- **Location:** `plugin/skills/code-review/SKILL.md` Step 9 (verification) has eleven checks but no "for each pair of findings on the same file/lines, detect contradictory remedies" check. +- **Effect:** Two co-equal warnings can recommend opposite fixes. The user adjudicates manually. +- **Supports:** symptom 3. +- **Evidence:** PR 339 E10, E11, E12, E13. + +### C9: No premise-verification step before raising "violates standard X" findings +- **Location:** `plugin/skills/code-review/SKILL.md` Step 5 (documentation compliance) reads standards documents and reports violations. No sub-step asks "does this standard's premise apply to this codebase before I raise the violation?" +- **Effect:** Standards written for one architecture (SPA-style company switch, rich-error API responses) are applied to codebases with a different architecture (full-page redirect, type-system-closed contracts). +- **Supports:** symptom 1. +- **Evidence:** PR 307 E2, E4, E19; PR 299 E1. + +### C10: No reachability gate before assigning severity +- **Location:** `plugin/skills/code-review/SKILL.md` Step 3.3's calibration directive partially addresses this ("Do not raise theoretical concerns the change does not touch"), but only within the agent prompt. Step 7 has no reachability filter. +- **Effect:** Findings rated "theoretical" by their own raising agent still come through as warnings. +- **Supports:** symptom 3. +- **Evidence:** PR 299 E6, E10, E13, E14, E15, E16. + +### C11: Whole-file checklist application in Modes B and C, plus YAGNI on whole-file content +- **Location:** `plugin/skills/code-review/SKILL.md` Step 4 sub-step 4 ("Apply the review checklist to the entire file content" when no diff is available). Step 4 sub-step 5 applies the full checklist (including YAGNI) to whole files. +- **Effect:** In Mode B (uncommitted) and Mode C (no git), the skill cannot distinguish introduced code from pre-existing code. YAGNI surfaces for code that predates the change entirely. +- **Supports:** symptom 1, symptom 2. +- **Evidence:** B5, B8. + +### C12: The Review Constraints global rule "When uncertain, choose the higher severity" governs manual review (Steps 4 to 6) and directly contradicts the calibration directive +- **Location:** `plugin/skills/code-review/SKILL.md` line 24: "When uncertain, choose the higher severity." This is in the Review Constraints section that governs the orchestrator's own work in Steps 4 to 6 (file-by-file review, documentation compliance, documentation freshness). Step 3.3's calibration directive says the opposite ("prefer the lower severity") but applies only to agent prompts and only at Step 7. +- **Effect:** Manual findings produced by Steps 4 to 6 (the orchestrator reading code and docs directly, not agent output) are still routed through the "prefer higher severity" rule with no size-based demotion. Documentation-freshness findings, documentation-compliance findings, and file-by-file checklist findings all inherit the inflated default. +- **Supports:** symptom 3 (severity inflation) and symptom 1 (going too deep on documentation drift). +- **Evidence:** PR 339 E1 (WARN-001 is a documentation-compliance finding produced by Step 5 or 6, not by an agent; it concerns code "not introduced by this branch but adjacent" and would not have been raised by any agent constrained to the scoped file list). V9 in the adversarial validation. + +--- + +## Indexed solutions + +Each solution S# describes a concrete change, cites the file(s) to edit, the C# causes it addresses, and the evidence rule (every solution must trace to evidence from the artifacts). Solutions are ordered by leverage: S1 to S5 are high-impact structural changes; S6 to S11 are additive guardrails. + +### S1: Rewrite agent-finding-classification.md to remove the WARN severity floor and derive severity from change size +- **Change:** In each of the seven rubric sections that currently say "Most findings land here" pointing at WARN, replace that floor with a size-aware rule: + > For Small changes, this category lands at SUGG unless the finding is directly introduced by this change and not handled, in which case it lands at WARN; only critical or data-safety findings escalate further. For Medium changes, this category lands at WARN when directly introduced and at SUGG otherwise. For Large changes, the rubric remains as written. When in doubt, prefer the lower severity. +- **Files to edit:** `plugin/skills/code-review/references/agent-finding-classification.md`. +- **Causes addressed:** C1, C2. +- **Evidence rule:** removes the structural mechanism that produced PR 299's 7-of-11 inflation and PR 307's 3-of-4 inflation. + +### S2: Move the calibration directive's size-based demotion logic into Step 7 +- **Change:** In Step 7, before applying the classification rubric, add a sub-step that re-applies the size-based demotion rules from Step 3.3: + > Before classifying agent findings, apply the size-based demotion rules from Step 3.3's calibration directive to every finding read from agent output files. A finding marked WARN by an agent that does not satisfy the calibration directive (directly introduced by this change, or critical irrespective of who introduced it) is demoted one severity. Suggestions on Small changes are dropped entirely. +- **Files to edit:** `plugin/skills/code-review/SKILL.md` Step 7. +- **Causes addressed:** C2. +- **Evidence rule:** closes the gap between the directive in the agent prompt and the classification that produces final output. + +### S3: Lower the default output severity of structural-analyst and behavioral-analyst, but keep "include when in doubt" +- **Change:** Both agents already have Anti-Patterns sections that specifically warn against false-positive shapes ("Abstraction Purity Bias", "Duplication False Positive", "Static-as-Behavioral"). Do not remove the "include when in doubt" rule, because the second-pass evidence in PR 299 confirms that the SUGG-005, SUGG-006, SUGG-007 findings produced by the structural-analyst were *correct at SUGG severity*. The defect was that the classification rubric promoted them to WARN, not that the agents produced them. Replace the rule with: + > Default posture is skeptical, but calibrated. When in doubt about whether something is an issue, include it as a low-severity finding (SUGG). Raise to WARN or CRIT only when the change actively introduces or worsens the issue, or when the issue is critical irrespective of who introduced it. A false positive at SUGG severity is cheaper than a missed real issue, but a false positive at WARN severity erodes trust. +- **Files to edit:** `plugin/agents/structural-analyst.md`, `plugin/agents/behavioral-analyst.md`. +- **Causes addressed:** C3 (partially; the rest is handled by S1). +- **Evidence rule:** preserves the agents' ability to produce legitimate low-priority findings (which the PR 299 second-pass output validated) while preventing the auto-promotion to WARN. Aligned with V8 in the adversarial validation. + +### S4: Constrain junior-developer and edge-case-explorer reads to the scoped file list except where their protocol explicitly requires outward reads, and document the constraint +- **Change:** Add to `junior-developer.md` Protocol 4 and `edge-case-explorer.md` Protocol 1: "Outward reads (adjacent code, callers) are for context only; findings must concern code on the scoped file list. A finding about code outside the file list is permitted only when it directly demonstrates that the changed code on the file list cannot be safely interpreted without the out-of-scope context. Otherwise, omit the finding." +- **Files to edit:** `plugin/agents/junior-developer.md`, `plugin/agents/edge-case-explorer.md`. +- **Causes addressed:** C4. +- **Evidence rule:** prevents PR 339 E1 (the data-testid drive-by) and PR 299 E5 (the suppressed-suggestion deep dive) without removing the agents' ability to gather context. + +### S5: Add `{focus areas}` and `{user context}` placeholders to every agent prompt template +- **Change:** In `plugin/skills/code-review/SKILL.md` Step 3.5, add a "Focus areas" block to every agent prompt template: + > **Focus areas from the user.** {focus areas, or "none provided"}. + > **PR / branch context.** {branch context summary, or "none provided"}. + > Findings in the focus area receive extra scrutiny and additional detail. Findings outside the focus area must still satisfy the calibration directive above; do not raise minor findings outside the focus area when a focus area is provided. +- **Files to edit:** `plugin/skills/code-review/SKILL.md` Step 3.5. +- **Causes addressed:** C5. +- **Evidence rule:** PR 339 E14 to E17, PR 307 E17 to E22, and PR 299 E18 to E22 all name this as the primary "context not forwarded" failure. + +### S6: Add Step 1.5 "Load PR-level and branch-level context" +- **Change:** Insert a new Step 1.5 between current Steps 1 and 2: + > **Step 1.5: Load branch context.** When in Mode A or Mode B, gather context the agents will need: (a) read the PR description if a PR exists for the branch (use `gh pr view` if available, otherwise read locally if a `pr-body` file is present); (b) read commit messages for the branch's commits since the default branch; (c) if the CLAUDE.md project map names a planning directory or linked-issue convention, look for an implementation plan or design doc for this branch; (d) summarize the context as a "Branch Context" block of at most 200 words covering: scope, deferred items, premises the team has already locked in, focus areas the author named. Pass this block to every agent in Step 3.5 alongside the focus areas from S5. +- **Files to edit:** `plugin/skills/code-review/SKILL.md`. +- **Causes addressed:** C5, C6. +- **Evidence rule:** PR 307 E18 (the implementation plan resolved WARN-003 before code was written) and PR 299 E20 (the Deferred section listed the Sentry PII work) both required this context. + +### S7: Tighten the YAGNI checklist to run the evidence test first +- **Change:** Rewrite the YAGNI bullet list in `review-checklist.md` and the YAGNI block in Step 3.3 of `SKILL.md` to be a two-step procedure: + > For every change in the diff, apply YAGNI in two passes: + > Pass 1, evidence test: for each new abstraction, configuration knob, defensive guard, observability hook, runbook, SLO, index, or audit column, ask whether the diff contains evidence of need from one of the five acceptable evidence types in yagni-rule.md Gate 1. If yes, do not flag. + > Pass 2, anti-pattern check: only for items that fail Pass 1, match against the named anti-patterns. Items that match any anti-pattern become YAGNI-### findings. + > A YAGNI finding's body must name (a) the failing evidence type, (b) the matched anti-pattern, and (c) the simpler form considered. +- **Files to edit:** `plugin/skills/code-review/references/review-checklist.md`, `plugin/skills/code-review/SKILL.md` Step 3.3 YAGNI block. +- **Causes addressed:** C7. +- **Evidence rule:** PR 307 E8 to E11 (the placeholder rationale was evidence of need that YAGNI did not weigh) and PR 339 E5 to E8 (the plan's explicit YAGNI rejection should have been Pass 1 evidence). + +### S8: Add a self-consistency check to Step 9 +- **Change:** Add to Step 9 (verification): + > For every pair of findings that reference the same file path or overlapping line ranges, check whether their recommendations are mutually consistent. If two findings recommend opposite actions on the same code, demote one or both to Suggestion and add a note to each: "Tension with {other finding ID}: the human reviewer must adjudicate between {action A from finding X} and {action B from finding Y}." Do not silently drop either finding. +- **Files to edit:** `plugin/skills/code-review/SKILL.md` Step 9. +- **Causes addressed:** C8. +- **Evidence rule:** PR 339 E10 to E13 (the WARN-002 / WARN-003 contradiction is the bundle's "most instructive single moment"). + +### S9: Add a premise-verification sub-step to Step 5 +- **Change:** Add to Step 5 (documentation compliance): + > Before raising a "violates standard X" finding, verify that the standard's premise applies to this codebase. A standard that says "module-level Map/Set caches must clear on company switch" applies only when the codebase has SPA-style company switching with module-level Map/Set caches. Read the standard's own "When this applies" or scope statement (or infer it from the standard's own examples). If the premise does not hold in this codebase, do not raise the finding; instead, log a brief note in the agent's output explaining why the standard does not apply. +- **Files to edit:** `plugin/skills/code-review/SKILL.md` Step 5. +- **Causes addressed:** C9. +- **Evidence rule:** PR 307 E2 (WARN-003's standard premise did not apply) and PR 299 E1 (the structural agent applied standards out of context). + +### S10: Add a reachability gate to Step 3.3 and re-apply at Step 7 +- **Change:** Strengthen Step 3.3's calibration directive with explicit reachability criteria, then mirror it at Step 7: + > Reachability gate: for every finding, ask whether the described failure mode is reachable in production given the actual code paths in this codebase. "Theoretical race", "defense-in-depth", "in case the upstream is bad" all signal non-reachable findings unless the agent can cite a specific production path that exercises the failure. Demote non-reachable findings by one severity. A non-reachable "Critical" becomes Warning; a non-reachable "Warning" becomes Suggestion; a non-reachable "Suggestion" is omitted. +- **Files to edit:** `plugin/skills/code-review/SKILL.md` Step 3.3 calibration directive, Step 7 classification. +- **Causes addressed:** C10. +- **Evidence rule:** PR 299 E10, E15, E16 explicitly call for the reachability gate to be in the first pass. PR 299 E6 (SUGG-003 documented "effectively impossible in production" yet was first-pass warning) and PR 307 E12 to E16 (3:1 won't-fix ratio). + +### S11: Document Mode-B and Mode-C scope limitations +- **Change:** Add to Step 4 sub-step 4: + > In Mode B (uncommitted changes) and Mode C (no git), the skill cannot distinguish introduced code from pre-existing code. In these modes, apply the review checklist conservatively: raise findings only for items that the user explicitly asked about in the focus areas, items that the file list's own author wrote intentionally (judged by being in source files versus generated/vendored), and items at the file boundaries (imports, exports, public API). Skip YAGNI entirely in Mode B and Mode C unless the user explicitly requests it. +- **Files to edit:** `plugin/skills/code-review/SKILL.md` Step 4. +- **Causes addressed:** C11. +- **Evidence rule:** B5, B8. + +### S12: Optional: replace mode flags with explicit defaults that match user preference +- **Change:** Acknowledge that the user has been compensating with "SUGG suppressed per reviewer instruction" by making suppression the default for small changes and surfacing-on-request for medium changes. Reuse the existing `size` argument; do not add a new argument. +- **Files to edit:** `plugin/skills/code-review/SKILL.md` Step 3.3. +- **Causes addressed:** C1, C2 (reinforces). +- **Evidence rule:** PR 299 E17 (the user has already configured the workaround manually). + +### S13: Qualify the Review Constraints global "prefer higher severity" rule with size-based logic +- **Change:** Replace SKILL.md line 24 ("When uncertain, choose the higher severity") with: + > When uncertain about severity, apply the calibration directive's size-based demotion to manual findings from Steps 4 to 6 as well. For Small changes, prefer the lower severity; only Critical findings escalate. For Medium changes, Critical and Warning findings escalate; suggestions stay at SUGG. For Large changes, prefer the higher severity when in doubt. This rule applies to manual review findings (Steps 4 to 6) and is mirrored by the agent calibration directive in Step 3.3. +- **Files to edit:** `plugin/skills/code-review/SKILL.md` Review Constraints section (line 24). +- **Causes addressed:** C12. +- **Evidence rule:** PR 339 E1 (the documentation-compliance WARN-001 finding is a manual-review finding produced under the global "higher severity" rule, not an agent finding; no agent-side fix addresses it). V9 in the adversarial validation directly names this gap. + +### Note on the S1 and S2 interaction +S1 makes the rubric size-aware. S2 adds a pre-classification demotion step at Step 7 that re-applies Step 3.3's calibration directive. If both ship together, a finding could be demoted twice: once by S2's pre-step (because the agent did not satisfy the directive) and again by S1's size-aware rubric. Bound the interaction with this rule: **S2 applies only when an agent finding's rationale does not cite a "directly introduced by this change" justification.** When an agent's finding explicitly claims direct introduction, skip S2's demotion and let S1's rubric assign the final severity. When the agent's rationale is silent on direct introduction, apply S2's demotion and then S1's rubric assigns the floor. This prevents double-demotion while preserving the protective behavior in both steps. V5 in the adversarial validation surfaces this risk. + +--- + +## Cross-reference: symptom-to-cause-to-solution map + +| Symptom | Primary causes | Primary solutions | +|---------|----------------|-------------------| +| Goes too deep | C3, C4, C9, C11, C12 | S3, S4, S9, S11, S13 | +| YAGNI under-applies | C7, C11 | S7, S11 | +| Severity inflation | C1, C2, C8, C10, C12 | S1, S2, S8, S10, S13 | +| Context not forwarded | C5, C6 | S5, S6 | + +S1, S2, S5, S6, S13 are the five highest-leverage changes. Together they address the three structural failures: (a) calibration logic is not enforced where severity is assigned (S1, S2), (b) user-provided context never reaches the sub-agents (S5, S6), and (c) the global manual-review severity rule is not size-aware (S13). + +--- + +## Adversarial validation findings + +The investigation was reviewed by an `adversarial-validator` agent. Nine validation findings were raised. The investigation was updated in response. The findings and adjustments: + +- **V1.** The 7-of-11 demotion in PR 299 partly reflects user configuration (the first pass was run with "WARN-justified, SUGG suppressed" mode), not pure malfunction. **Adjustment:** confidence reduced from "High across all causes" to "High for behavioral causes (B1-B8) and directly-readable structural mechanisms (C1, C2, C5, C12); Medium for PR-outcome-only causes." +- **V2 and V8.** S3 as originally written would have suppressed correct SUGG-level findings the second pass validated, and both agents already have Anti-Patterns sections that constrain the skeptic posture. **Adjustment:** S3 rewritten to lower the default output severity to SUGG, not remove the "include when in doubt" rule. +- **V3.** S1's proposed language could under-classify concurrency findings for newly-introduced async errors. **Adjustment:** documented as a remaining risk; implementation should test S1's language against concurrency scenarios specifically. +- **V4.** C4 overstates the agents' protocols; outward reads in junior-developer and edge-case-explorer are for context, not findings. **Adjustment:** S4's framing remains correct (add the explicit constraint), but the diagnosis acknowledges that PR 339 E1's documentation finding is more likely a Step 5/6 manual finding than an agent finding (see C12, S13). +- **V5.** S1 and S2 could double-demote findings if both ship together. **Adjustment:** added an explicit interaction rule (the note after S13). +- **V6.** "Structural failure" overstates Symptom 4; "missing feature" is a more accurate framing for context-forwarding. **Adjustment:** acknowledged in the cause definitions; S5 and S6 remain the correct fix. +- **V7.** All three PR bundles come from the same project; the evidence is not fully independent. **Adjustment:** confidence framing now distinguishes behavioral causes (high confidence) from PR-outcome causes (medium confidence). Recommended pre-implementation step: test the proposed solutions against a different project's PR. +- **V9.** SKILL.md line 24's "When uncertain, choose the higher severity" governs manual review (Steps 4 to 6) and was not addressed by any solution. **Adjustment:** added C12 and S13. + +## What the investigation does not cover + +- **Test plan.** Once solutions are implemented, a test plan is needed to verify the fixes do not break existing review quality. Suggested: re-run the code-review skill against PRs 299, 307, and 339 (treating the merged-state code as the input) and confirm that (a) PR 299 produces 4 warnings (matching the second pass), not 11; (b) PR 307 produces 1 warning and 1 YAGNI-or-context-flagged demoability question, not 4 warnings; (c) PR 339 produces 1 CRIT and at most 1 WARN, and detects the WARN-002/WARN-003 contradiction internally. +- **Cross-project validation.** All three PR bundles come from gearjot-v2-web. Before shipping the solutions widely, re-run the analysis against a PR from a different project with a different stack to confirm the structural mechanisms (C1, C2, C5, C12) hold beyond the one team. +- **Implementation sequence.** The skill changes are small and can be batched, but if sequenced: do S1 and S13 first (the highest-frequency severity issue covering both agent and manual findings), then S2 (with the V5 interaction rule), then S5 and S6 (the highest-trust context issue), then S3 and S4 (the agent-rule changes), then the additive guardrails S7 through S11. + +--- + +## Confidence + +- **High for the behavioral causes (B1 to B8) and the directly-readable structural mechanisms (C1, C2, C5, C6, C12).** These are visible by reading the skill source and the agent definitions; they are not inferred from PR outcomes. +- **Medium for the PR-outcome-driven causes (C3, C4, C9, C10).** Three PRs from one project is corroboration but not independent triangulation. The mechanisms are plausible but would benefit from cross-project validation. +- **High that the user has been compensating manually.** "WARN-justified, SUGG suppressed per reviewer instruction" is a visible, dated workaround that proves the default behavior does not match the user's preference. The fix needs to land at the skill level. diff --git a/docs/skills/code-review.md b/docs/skills/code-review.md index fafb250..462c74a 100644 --- a/docs/skills/code-review.md +++ b/docs/skills/code-review.md @@ -13,10 +13,15 @@ Operator documentation for the `/code-review` skill in the han plugin. This docu ## Key concepts -- **Three severity levels.** CRIT (must fix before merge: security, data corruption, breaking API), WARN (should fix: bugs, missing error handling, missing tests), SUGG (consider: style, refactoring, docs). When uncertain, choose the higher severity for manual-review findings; agent-dispatched findings calibrate severity to size (see Sizing below). -- **Three review modes.** Mode A uses the full git branch diff. Mode B reviews uncommitted work when no branch diff exists. Mode C reviews specified files when git is absent. +- **Three severity levels.** CRIT (must fix before merge: security, data corruption, breaking API), WARN (should fix: bugs, missing error handling, missing tests), SUGG (consider: style, refactoring, docs). Severity calibration is governed by Step 3.3 in the skill body, the authoritative home for size-based demotion. Manual findings (Steps 4 to 6) and agent findings (Step 7) follow the same rules: small changes escalate only Critical and prefer the lower severity on uncertainty; medium escalates Critical and Warning; large prefers the higher severity when in doubt. +- **Three review modes.** Mode A uses the full git branch diff. Mode B reviews uncommitted work when no branch diff exists. Mode C reviews specified files when git is absent. **In Mode B and Mode C the YAGNI checklist is skipped unless explicitly requested**, because no diff exists to distinguish introduced code from pre-existing code. - **Size-aware agent dispatch.** Two agents always run on every review (`junior-developer` for clarity and standards, `adversarial-security-analyst` for exploit-path security). The rest of the roster (`test-engineer`, `edge-case-explorer`, `structural-analyst`, `behavioral-analyst`, `concurrency-analyst`, `data-engineer`, `devops-engineer`) is dispatched conditionally based on what the changed files touch. Larger sizes raise the upper bound on the roster; smaller sizes prefer fewer agents producing higher-signal findings. See the [Sizing](#sizing) section below. - **Calibration directive.** Every dispatched agent receives a calibration directive that requires findings to be either introduced or worsened by the change, or critical irrespective of who introduced it. Theoretical concerns, pre-existing best-practice gaps, and benign-outcome scaling worries are excluded. Severity scales with size: small change → only Critical findings escalate; medium → Critical and Warning; large → all severities. +- **Per-agent dispatcher tailoring.** When `/code-review` dispatches `structural-analyst` and `behavioral-analyst`, it appends a default-SUGG directive so those agents start at the lowest severity and escalate only when the change introduces or worsens the issue. When it dispatches `junior-developer` and `edge-case-explorer`, it appends a file-list scoping directive so findings concern code on the scoped file list (with a narrower wording for `edge-case-explorer` that preserves its caller-read protocol while keeping the failure-mode target on the file list). These directives are `/code-review`'s tailoring; the agents' default behavior in other skills is unchanged. +- **Reachability phrase-match gate (Step 7.2).** When an agent's own rationale contains phrases like *theoretical*, *hypothetical*, *defense-in-depth*, *effectively impossible*, *in case the upstream*, *could happen*, *should never happen*, or *edge case that does not occur*, the finding is demoted by one severity before the rubric is applied. Security findings are exempt because the security agent's evidence standard already requires a demonstrated exploit path. +- **Branch context loaded at Step 1.5.** Before agents are dispatched, the skill loads four sources of branch-level context in order: PR description (via `gh pr view` when `gh` is available, Mode A only), a local `pr-body`, `PR_BODY.md`, or `.pr-body` file at the repo root, branch commit messages, and an implementation plan from the planning directory. The planning directory resolves first to the `plans:` (or `planning:`) key under CLAUDE.md's `## Project Discovery` section; otherwise the skill globs `docs/plans/*/feature-implementation-plan.md` and `plans/*/feature-implementation-plan.md`, picking the directory whose name matches the current branch (treating `-` and `_` as interchangeable). The loaded content is summarized into a `$branch_context` block of at most 200 words and plumbed, alongside the user's `$focus_areas` argument, into every agent prompt so agents avoid re-raising items the team has already deferred or resolved. Step 1.5 is skipped in Mode C; in Mode A and Mode B, when none of the four sources returns content the skill emits a single fail-open warning and proceeds with `$branch_context` set to `none provided`. +- **Self-consistency check at Step 9.0.** Before the structural verification, the skill scans every pair of findings on the same file with overlapping line ranges, detects contradictory recommendations, demotes both, and adds a `Tension with {other-task-id}:` note for the human reviewer. Cross-file semantic contradictions are out of scope. +- **Premise verification before standards-compliance findings.** Step 5 requires reading at least one architectural file in the codebase that demonstrates a standard's premise before the skill raises a "violates standard X" finding. When the file does not confirm the premise, the finding is omitted with a logged note rather than raised on inferred premises. - **Project-pattern deference.** A pattern that differs from general best practices but is consistent within the project is *not* a finding. Only deviations from the project's own conventions count. - **Automated tool boundary.** If the project has a linter or formatter, trust it. Only flag style issues that tooling cannot catch. - **Documentation compliance and freshness.** ADRs, coding standards, and general docs are read and checked against the diff. Stale docs that misdescribe current behavior become CRIT findings. @@ -66,7 +71,8 @@ A structured review in-channel containing: - **Critical findings** (🔴). Each with task ID (`CRIT-001`, `CRIT-002`, …), category, `file_path:line_number`, the issue, the recommended fix, and (for security findings) an `EXPLOIT:` description. - **Warnings** (🟡). Same structure with task ID `WARN-NNN`. - **Suggestions** (🟢). Same structure with task ID `SUGG-NNN`. -- **Agent findings.** Coverage gaps (`T-NNN`), edge cases (`EC-NNN`), security findings (`SEC-NNN`), structural findings (`S-NNN`), behavioral findings (`B-NNN`), clarity findings (`JD-NNN`), and concurrency findings (`C-NNN` when the concurrency analyst was dispatched). Each is classified into the main severity tiers using the classification rubric and cross-referenced in the appropriate severity section. +- **Agent findings.** Coverage gaps (`T#`), edge cases (`EC#`), security findings (`SEC-NNN`), structural findings (`S#`), behavioral findings (`B#`), clarity findings (`JD#`), concurrency findings (`C#` when the concurrency analyst was dispatched), data findings (`D#` when the data engineer was dispatched), and devops findings (`DV#` when the devops engineer was dispatched). Each is classified into the main severity tiers using the classification rubric and cross-referenced in the appropriate severity section. +- **YAGNI findings.** Listed in their own `### 🟡 YAGNI` section with task IDs `YAGNI-NNN`. The section opens with the verbatim statement *"These findings will not be corrected unless explicitly requested. They are documented so the team can decide consciously whether to keep, simplify, or defer the items."* Each finding names the failing evidence type, the matched anti-pattern, and the simpler form considered. YAGNI findings are advisory; they are not counted under CRIT / WARN / SUGG and do not block a clean review. - **Deferred tests note.** Test cases the `test-engineer` considered but excluded as brittle, listed for transparency (not counted toward the finding cap). - **ADR / coding-standard / documentation compliance findings.** Violations of project-specific docs, tagged with the source (for example, `[ADR: 0042]`, `[Standard: error-handling]`, `[Docs Update: payments.md]`). @@ -88,9 +94,9 @@ Size is the primary lever the skill uses to decide how aggressively to review th | Size | Files | Other signals | Roster (max) | Severity bands in scope | |---|---|---|---|---| -| **Small** *(default)* | 1–3 files | Single subsystem; no cross-cutting concerns; no new module boundaries; no schema, migration, or infra changes; no auth/PII surface added. | The two required agents (`junior-developer`, `adversarial-security-analyst`) plus any conditional agent whose signal clearly fires (for example, a new test boundary triggers `test-engineer`). | Only Critical findings escalate from agents; manual review still flags Warnings introduced by the change. Suggestions are dropped. | -| **Medium** | 3–10 files | One or two adjacent subsystems; may touch a single cross-cutting concern (one API contract, one schema migration, one new permission check, one new index). | Required two plus the conditional agents whose signals fire. Typically `test-engineer`, `edge-case-explorer`, and one of `structural-analyst` / `behavioral-analyst` / `data-engineer` / `devops-engineer`. | Critical and Warning findings escalate from agents; Suggestions only when directly introduced by this change. | -| **Large** | More than 10 files | Multiple subsystems, architectural changes, security or data implications, multi-service coordination, or you explicitly request full agent review. | Required two plus all conditional agents whose signals fire. | All severities are in scope. | +| **Small** *(default)* | 1–3 files | Single subsystem; no cross-cutting concerns; no new module boundaries; no schema, migration, or infra changes; no auth/PII surface added. | The two required agents (`junior-developer`, `adversarial-security-analyst`) plus any conditional agent whose signal clearly fires (for example, a new test boundary triggers `test-engineer`). | Only Critical findings escalate. Raise Warnings only when the finding is directly introduced by the change. Suggestions are omitted. Same rule for manual findings (Steps 4–6) and agent findings (Step 7). | +| **Medium** | 3–10 files | One or two adjacent subsystems; may touch a single cross-cutting concern (one API contract, one schema migration, one new permission check, one new index). | Required two plus the conditional agents whose signals fire. Typically `test-engineer`, `edge-case-explorer`, and one of `structural-analyst` / `behavioral-analyst` / `data-engineer` / `devops-engineer`. | Critical and Warning findings escalate. Raise Suggestions only when directly introduced by the change. Same rule for manual and agent findings. | +| **Large** | More than 10 files | Multiple subsystems, architectural changes, security or data implications, multi-service coordination, or you explicitly request full agent review. | Required two plus all conditional agents whose signals fire. | All severities are in scope. Same rule for manual and agent findings. | How the size is chosen: @@ -118,21 +124,24 @@ Agents run on their default models. Finding caps of 30 per pass keep output boun ## In more detail -The skill walks a nine-step process: +The skill walks a ten-step process (Step 1.5 is a context loader inserted between Steps 1 and 2): -1. **Identify changes.** Detect git mode (A/B/C), resolve project config, enumerate changed files. +1. **Identify changes.** Detect git mode (A/B/C), resolve project config, enumerate changed files. Bind `$focus_areas` from the user's free-form argument. +1.5. **Load branch context.** Attempt PR description via `gh pr view --json title,body,headRefName,baseRefName` (Mode A only, when `gh` is available); a local `pr-body`, `PR_BODY.md`, or `.pr-body` file at the repo root; branch commit messages via `git log {default-branch}..HEAD --pretty=format:%B` (Mode A) or `git log -n 20 --pretty=format:%B` (Mode B); and an implementation plan resolved through the CLAUDE.md `plans:` / `planning:` key or, failing that, a Glob over `docs/plans/*/feature-implementation-plan.md` and `plans/*/feature-implementation-plan.md` that prefers the directory matching the current branch name (with `-` and `_` interchangeable). Summarize loaded content into `$branch_context` (at most 200 words). When nothing loads, emit a single fail-open warning, set `$branch_context` to `none provided`, and proceed. Skipped in Mode C. 2. **Automated quality checks.** Run the project's lint, build, and test commands. Report each failure as a CRIT finding with category `[Automated Check]`. Do not fix; report. -3. **Classify size and dispatch review agents in parallel.** Step 3.1 classifies the change as small / medium / large, defaulting to small. Step 3.2 selects agents: the required two (`junior-developer`, `adversarial-security-analyst`) plus any conditional agents whose signals appear in the file list (`test-engineer`, `edge-case-explorer`, `structural-analyst`, `behavioral-analyst`, `concurrency-analyst`, `data-engineer`, `devops-engineer`). Step 3.3 attaches a size-scoped calibration directive to every brief. Step 3.4 passes each agent only the slice of the file list relevant to its domain. Step 3.5 launches all selected agents in parallel. -4. **Manual file-by-file review.** Every changed file (alphabetical) against the review checklist: correctness, data isolation, performance, error handling, testing, API design, maintainability, organization, docs, style, database, ADR compliance. -5. **Documentation compliance analysis.** Read ADRs, coding standards, and docs relevant to the change. Flag contradictions. +3. **Classify size and dispatch review agents in parallel.** Step 3.1 classifies the change as small / medium / large, defaulting to small, and binds `{size}` for every later consumer. Step 3.2 selects agents: the required two (`junior-developer`, `adversarial-security-analyst`) plus any conditional agents whose signals appear in the file list. Step 3.3 is the authoritative home for size-based demotion and attaches the calibration directive verbatim to every brief. Step 3.4 narrows each agent's brief to a domain-scoped slice of the file list (for example, `structural-analyst` receives source files only; `data-engineer` receives schema, migration, query, ORM, and data-access files only; `junior-developer` and `adversarial-security-analyst` receive the full list). Step 3.5 launches all selected agents in parallel, appending the shared `$focus_areas` and `$branch_context` blocks to every prompt and the per-agent dispatcher directives to `structural-analyst`, `behavioral-analyst`, `junior-developer`, and `edge-case-explorer`. +4. **Manual file-by-file review.** Every changed file (alphabetical) against the review checklist: correctness, data isolation, performance, error handling, testing, API design, maintainability, organization, docs, style, database, ADR compliance. In Mode B and Mode C, the YAGNI checklist is skipped unless the user requests it in `$focus_areas`. +5. **Documentation compliance analysis.** Read ADRs, coding standards, and docs relevant to the change. Verify each standard's premise applies by reading at least one architectural file in this codebase before raising a "violates standard X" finding; omit findings whose premise is not verified. 6. **Documentation freshness review.** Check whether docs describing the changed code are now stale. -7. **Collect and classify agent results.** Read every dispatched agent's output file. Classify their findings into CRIT/WARN/SUGG using the finding-classification rubrics. Junior-developer findings that overlap with a specialist's finding reference the specialist instead of duplicating. +7. **Collect and classify agent results in three sub-steps.** 7.1 reads each dispatched agent's output file. 7.2 applies the reachability phrase-match demotion gate (CRIT → WARN → SUGG → omitted) when a finding's rationale contains a documented reachability phrase; security findings are exempt. 7.3 classifies the surviving findings using the size-aware rubric in `agent-finding-classification.md`, governed by Step 3.3's size rules. Junior-developer findings that overlap with a specialist's finding reference the specialist instead of duplicating. 8. **Generate review output.** Assemble the final review using the review template. -9. **Verify.** Task IDs sequential, `file_path:line_number` valid, exploit fields populated for security findings, summary table matches detail. +9. **Verify.** Step 9.0 runs the self-consistency check (extract `{task-id, file-path, line-range, recommended-action-summary}` tuples, then compare overlapping pairs and demote contradictory recommendations with a `Tension with {other-task-id}:` note). Step 9.1 then verifies task IDs are sequential, `file_path:line_number` references are valid, exploit fields are populated for security findings, the summary table matches the detail sections, and the YAGNI section's verbatim opening is preserved. ## YAGNI -YAGNI in `/code-review` is **advisory-only**. The reviewer surfaces speculative additions (defensive code at trusted internal boundaries, single-implementation interfaces, configuration knobs no caller sets, instrumentation for telemetry that isn't reaching the destination yet) but a YAGNI finding alone does not block a clean review. The posture is *make the cost of inclusion visible*, not *reject the change*. You decide whether to keep the speculative addition; when you keep it, the rationale is recorded so the choice stays visible. Critical-path correctness, security, and data-integrity findings are unaffected by this advisory posture and follow the standard severity rules. +YAGNI in `/code-review` is **advisory-only** and runs as a two-pass procedure. **Pass 1, evidence test:** for every speculative addition (defensive code, single-implementation interfaces, configuration knobs no caller sets, instrumentation for non-flowing telemetry), check whether the diff contains evidence of need from one of the acceptable evidence types in [`yagni-rule.md`](../../plugin/references/yagni-rule.md). When evidence is present, do not flag. **Pass 2, anti-pattern check:** only items that fail Pass 1 are matched against the named anti-patterns; matches become `YAGNI-###` findings whose body names the failing evidence type, the matched anti-pattern, and the simpler form considered. + +A YAGNI finding alone does not block a clean review; the posture is *make the cost of inclusion visible*, not *reject the change*. Critical-path correctness, security, and data-integrity findings are unaffected by this advisory posture and follow the standard severity rules. In Mode B and Mode C, the YAGNI checklist is skipped unless the user explicitly requests it, since the diff signal that distinguishes introduced code from pre-existing code is absent. See [YAGNI](../yagni.md) for the two gates, the acceptable-evidence list, the named anti-patterns, and why review skills make the cost of inclusion visible rather than enforce inclusion bans. diff --git a/docs/skills/gh-pr-review.md b/docs/skills/gh-pr-review.md index 7090364..6db7a06 100644 --- a/docs/skills/gh-pr-review.md +++ b/docs/skills/gh-pr-review.md @@ -13,6 +13,7 @@ Operator documentation for the `/gh-pr-review` skill in the han plugin. This doc ## Key concepts - **Wraps `/code-review`.** The skill delegates the actual review to `/code-review`, adds a pre-post clarity check from `junior-developer`, then handles the GitHub-posting step. +- **Branch context flows automatically.** Because the wrapped `/code-review` runs Step 1.5 (branch context loading) on the same branch, the PR description fetched via `gh pr view` is summarized into a `$branch_context` block and plumbed to every dispatched agent. Agents avoid re-raising items the PR description has already deferred or resolved. No extra dependency for `/gh-pr-review` users — `gh` is already required to post the review. - **Self-authored PR handling.** GitHub rejects formal reviews from PR authors, so when the author and the current gh user match, the skill posts as a PR comment rather than a review. When they differ, it posts as a formal review. - **Review event type chosen from severity.** `REQUEST_CHANGES` when any CRIT or WARN finding exists; `COMMENT` when only SUGG findings exist. Self-authored PRs always post as comments. - **Optional fix plan.** After the review lands, the skill offers to create an implementation plan for every Critical and Warning item, ordered by severity, with file paths and line numbers. diff --git a/docs/yagni.md b/docs/yagni.md index 4727c22..9d7239d 100644 --- a/docs/yagni.md +++ b/docs/yagni.md @@ -82,7 +82,7 @@ YAGNI applies in two postures: **producing** (when a skill drafts an artifact) a | [`/plan-implementation`](./skills/plan-implementation.md) | Every plan step, abstraction, infrastructure addition, observability hook, and rollout step. A YAGNI sweep runs before the plan is committed. | | [`/plan-a-phased-build`](./skills/plan-a-phased-build.md) | Every phase. Phases whose only justification is "completeness of the roadmap" get deferred or merged into a smaller adjacent phase. | | [`/iterative-plan-review`](./skills/iterative-plan-review.md) | A YAGNI review pillar runs alongside correctness, completeness, and risk. Every uncited item raises a `Category: YAGNI candidate` finding. | -| [`/code-review`](./skills/code-review.md) | YAGNI is **advisory-only**. Findings surface speculative additions, but they do not block a clean review on their own. The reviewer's posture is "make the cost of inclusion visible," not "reject the change." | +| [`/code-review`](./skills/code-review.md) | YAGNI is **advisory-only** and runs as a two-pass procedure (Pass 1 evidence test against Gate 1; Pass 2 named anti-pattern match). Each finding records the failing evidence type, the matched anti-pattern, and the simpler form considered. Findings surface speculative additions but do not block a clean review on their own; the reviewer's posture is "make the cost of inclusion visible," not "reject the change." Skipped in Mode B (uncommitted changes) and Mode C (no git) unless explicitly requested via the focus-areas argument, since no diff exists to distinguish introduced code from pre-existing code. | | [`/coding-standard`](./skills/coding-standard.md) | A standard is justified only when the project does the thing the standard governs *today* and the standard solves a real, concrete problem the team is currently hitting. | | [`/test-planning`](./skills/test-planning.md) | A YAGNI sweep removes speculative tests: for code paths that don't exist, hypothetical adversaries, or branches the change doesn't touch. | | [`/architectural-decision-record`](./skills/architectural-decision-record.md) | An ADR requires a **forcing function** today: a real decision being made now, with consequences. ADRs about decisions that don't have a forcing function are YAGNI. | diff --git a/plugin/.claude-plugin/plugin.json b/plugin/.claude-plugin/plugin.json index 70206cb..d30dc8f 100644 --- a/plugin/.claude-plugin/plugin.json +++ b/plugin/.claude-plugin/plugin.json @@ -1,6 +1,6 @@ { "name": "han", "description": "Evidence-based investigation and documentation skills for software projects. Includes deep-dive investigations with root cause analysis, iterative plan refinement, architectural decision records (ADRs), project and feature documentation, coding standards, code reviews, test planning, PR descriptions, PR code reviews, and adversarial security vulnerability analysis.", - "version": "2.2.0", + "version": "2.3.0", "skills": "./skills" } diff --git a/plugin/skills/code-review/SKILL.md b/plugin/skills/code-review/SKILL.md index 19453fe..a68551e 100644 --- a/plugin/skills/code-review/SKILL.md +++ b/plugin/skills/code-review/SKILL.md @@ -3,7 +3,7 @@ name: code-review description: "Run a comprehensive code review on local source files. Use this skill when the user asks to review, audit, inspect, evaluate, or check code — or when they ask to make sure, verify, or validate that code follows good coding standards, is free of errors or bugs, has sufficient test coverage, or meets best practices, even if they never use the word \"review.\" Triggers for any request to assess code quality, correctness, or security of specific files, directories, or the current branch. Also use when the user invokes /code-review directly. Works on git branches (reviewing changed files against the default branch) or on specified files and directories when git is not available. Does not post comments to GitHub pull requests — use gh-pr-review for that. Does not analyze architectural structure or module boundaries — use architectural-analysis for that." arguments: size argument-hint: "[size: small | medium | large] [optional context about changes or areas to focus on]" -allowed-tools: Bash(git *), Bash(make *), Bash(npm *), Read, Grep, Glob, Agent +allowed-tools: Bash(git *), Bash(gh *), Bash(make *), Bash(npm *), Read, Grep, Glob, Agent --- When running a code review, follow the process outlined here. @@ -21,13 +21,13 @@ Severity levels: - **Warning** — Should fix. Bugs that don't corrupt data, significant performance issues, missing required tests, missing error handling. - **Suggestion** — Consider improving. Style improvements, optional performance gains, documentation gaps, refactoring opportunities. -When uncertain, choose the **higher** severity. Include `file_path:line_number` references and code examples for suggested fixes. +Severity calibration is governed by **Step 3.3** (the authoritative home for size-based demotion). Manual findings from Steps 4 to 6 follow the same size-based rules as agent findings classified at Step 7: Small changes escalate only Critical findings and default uncertain ones to the lower severity, Medium changes escalate Critical and Warning, Large changes prefer the higher severity when in doubt. Read `{size}` from Step 3.1. Include `file_path:line_number` references and code examples for suggested fixes. **Finding caps:** Manual review findings (Steps 4-6) and agent findings (Step 7) are each capped at 30 items. Prioritize by severity: all CRIT first, then WARN, then SUGG. If either cap is exceeded, note that additional items were omitted and another code review is recommended after addressing current items. Security findings are not capped (see classification rubric). **Project pattern deference:** A pattern that differs from general best practices but is consistent within the project is not a review finding. Only flag deviations from the project's own conventions. -**YAGNI findings are a separate, non-correcting class.** Apply the evidence-based YAGNI rule from [../../references/yagni-rule.md](../../references/yagni-rule.md) to every change in the diff. A YAGNI finding identifies code introduced by this change that has no evidence of being needed *now* — a new abstraction with one implementation, a configuration knob no caller sets, a defensive guard at a trusted internal boundary, a runbook for an alert that has never fired, an observability hook for telemetry that isn't flowing, an SLO for absent traffic, an index for a query that doesn't run, an audit column nobody reads, a feature flag wrapping a single code path with no rollout strategy, code added "for future flexibility" or symmetry. **YAGNI findings are listed in their own `### 🟡 YAGNI` section, separate from Critical / Warning / Suggestion**, and **do not appear under CRIT / WARN / SUGG**. The YAGNI section opens with this exact statement: *"These findings will not be corrected unless explicitly requested. They are documented so the team can decide consciously whether to keep, simplify, or defer the items."* Each YAGNI finding records what was found, which gate or named anti-pattern from the rule applies, and the trigger that would justify keeping it. Severity calibration (the calibration directive in Step 3.3) does NOT apply to YAGNI — these findings are surfaced regardless of change size, but they are advisory, not corrective. +**YAGNI findings are a separate, non-correcting class.** Apply the two-pass YAGNI procedure documented in [`references/review-checklist.md`](references/review-checklist.md) (Pass 1 runs the evidence test from [../../references/yagni-rule.md](../../references/yagni-rule.md) Gate 1; Pass 2 matches against the named anti-patterns) to every change in the diff. **YAGNI findings are listed in their own `### 🟡 YAGNI` section, separate from Critical / Warning / Suggestion**, and **do not appear under CRIT / WARN / SUGG**. The YAGNI section opens with this exact statement: *"These findings will not be corrected unless explicitly requested. They are documented so the team can decide consciously whether to keep, simplify, or defer the items."* Each YAGNI finding records (a) the failing evidence type from Pass 1, (b) the matched anti-pattern from Pass 2, and (c) the simpler form considered. Severity calibration (the directive in Step 3.3, the authoritative home) does NOT apply to YAGNI; these findings are surfaced regardless of change size and are advisory, not corrective. **Automated tool boundary:** If the project has a linter or formatter, trust it. Only flag style issues that automated tools can't catch. @@ -73,7 +73,24 @@ Use the script output to determine the review mode. If the script reports `git-a - Present the discovered files and ask the user to confirm the review scope - Note: In Mode C, review files by reading them in full rather than comparing against a diff (no diff is available) -If the user provided focus areas in their arguments, note them for use in Step 4. +**Bind `$focus_areas`.** Read the user's free-form argument string from the invocation (everything after the optional `$size` positional). If non-empty, bind `$focus_areas` to that string verbatim. If empty, bind `$focus_areas` to the literal string `none provided`. This binding is consumed by every Step 3.5 agent prompt and by the Step 4 manual review. + +## Step 1.5: Load Branch Context + +Load PR-level and branch-level context that the agents at Step 3.5 will need. Skip this step in **Mode C** (no git); for Mode A and Mode B, attempt the four sources below in order and combine what loads into a single `$branch_context` binding. + +1. **PR description (Mode A only).** If `gh` is available, run `gh pr view --json title,body,headRefName,baseRefName 2>/dev/null` for the current branch and capture the body. If `gh` is not available or no PR exists for this branch, skip to source 2. +2. **Local `pr-body` file.** Look for a file named `pr-body`, `PR_BODY.md`, or `.pr-body` at the repo root. If present, read it. +3. **Branch commit messages.** Run `git log {default-branch}..HEAD --pretty=format:%B` (Mode A) or `git log -n 20 --pretty=format:%B` (Mode B) and capture the messages. +4. **Implementation plan in the planning directory.** Resolve the planning directory using this order: + - Read CLAUDE.md's `## Project Discovery` section for a `plans:` or `planning:` key naming the directory (e.g., `plans: docs/plans/`). Use that path if present. + - If no key, Glob `docs/plans/*/feature-implementation-plan.md` and `plans/*/feature-implementation-plan.md`. + - When the Glob returns multiple matches, pick the directory whose name matches the current branch name (treat `-` and `_` as interchangeable). If no directory matches, log `no planning artifact found for branch {branch}` and skip this source. + - Read the matched plan file if found. + +**Summarize loaded content into a Branch Context block of at most 200 words** covering: scope of the change, deferred items the team named, premises the team has already locked in, focus areas the author called out. Bind the summary to `$branch_context`. + +**Fail-open behavior.** When none of the four sources returns content, emit this single-line warning to the orchestrator's output: `Branch Context: no PR or planning artifact found; agents will run without branch-level context.` Bind `$branch_context` to the literal string `none provided` and proceed. ## Step 2: Automated Quality Checks @@ -101,6 +118,8 @@ Determine the output directory for agent reports: if the project has an existing State the chosen size in one line with the justification (e.g., "Medium: 6 files touched, adds one index and a query for it" or "Medium: passed via `$size`"). Also draft a one-line summary of what the change does — this is reused in agent briefs below. +**This step is the authoritative source for `{size}`.** Every later consumer reads `{size}` from here: the Review Constraints rule above, the Step 3.3 calibration directive, the Step 3.5 agent prompts, the Step 7.2 demotion gate, and the rubric in `references/agent-finding-classification.md`. Do not re-derive size at any of those sites. + ### Step 3.2: Select agents **Always dispatch — minimum roster across all sizes:** @@ -130,6 +149,8 @@ State the selected roster to the user in one line per agent before launching. ### Step 3.3: Scope every agent brief to the change +**Step 3.3 is the authoritative home for size-based demotion.** Every other site that needs the size-based rule references this step by name rather than restating it: the Review Constraints rule for manual findings, the Step 7.2 demotion gate for agent findings, the rubric in `references/agent-finding-classification.md`, and the YAGNI two-pass procedure in `references/review-checklist.md`. + Every dispatched agent receives — alongside its domain-specific prompt — the following calibration directive verbatim. This directive overrides the default review-wide "prefer the higher severity" rule for agent-dispatched findings: > **Calibrate findings to the change being reviewed.** This is a **{size}** change touching {N} files. The change does the following: {one-line summary from Step 3.1}. @@ -151,7 +172,7 @@ Every dispatched agent receives — alongside its domain-specific prompt — the > > When uncertain about severity, prefer the **lower** severity. If the worst-case impact is "an operator sees an error and retries," that is not Critical. > -> **YAGNI findings are separate from severity.** Apply [../../references/yagni-rule.md](../../references/yagni-rule.md) to every change in the diff regardless of size. A YAGNI finding identifies code introduced by this change that has no evidence of being needed now: a new abstraction with one implementation, configuration knob no caller sets, defensive guard at a trusted internal boundary, runbook for never-fired alert, observability for non-flowing telemetry, SLO for absent traffic, index for unrun query, audit column with no consumer, feature flag wrapping a single code path with no rollout plan, code added "for future flexibility" or symmetry. Raise YAGNI findings as `Category: YAGNI candidate` regardless of change size — they are advisory, listed in a separate section, and not corrected unless the user explicitly requests it. Cite the simpler form that would satisfy the same evidence and the trigger that would justify keeping the larger form. +> **YAGNI findings are separate from severity.** Apply the two-pass YAGNI procedure documented in [`references/review-checklist.md`](references/review-checklist.md) (Pass 1: evidence test against [`../../references/yagni-rule.md`](../../references/yagni-rule.md) Gate 1; Pass 2: named anti-pattern match) to every change in the diff regardless of size. The size-based demotion in this Step 3.3 directive does NOT apply to YAGNI findings; they are advisory at every size, listed in a separate section, and not corrected unless the user explicitly requests it. Each finding's body must name (a) the failing evidence type, (b) the matched anti-pattern, and (c) the simpler form considered. ### Step 3.4: Domain-scoped file lists @@ -171,7 +192,23 @@ Pass each agent only the slice of the file list relevant to its domain: ### Step 3.5: Dispatch -Launch all selected agents **in parallel** using the `Agent` tool with `run_in_background: true`, in a single message so they run concurrently. Each agent's prompt has three parts: the domain-specific question, the calibration directive verbatim from Step 3.3, and the domain-scoped file list from Step 3.4. Include the branch name only if one was detected (Mode A or Mode B). Do not wait for results — continue immediately to Step 4. +Launch all selected agents **in parallel** using the `Agent` tool with `run_in_background: true`, in a single message so they run concurrently. Each agent's prompt has four parts: the domain-specific question, the calibration directive verbatim from Step 3.3, the domain-scoped file list from Step 3.4, and two named-binding blocks for user focus areas and branch context. Include the branch name only if one was detected (Mode A or Mode B). Do not wait for results; continue immediately to Step 4. + +**Two named-binding blocks ship with every agent prompt.** Append the following to every prompt below, after the calibration directive and before the domain-specific instructions: + +> **Focus areas from the user.** $focus_areas. +> +> **PR / branch context.** $branch_context. +> +> Findings in the focus area receive extra scrutiny and additional detail. Findings outside the focus area must still satisfy the calibration directive above; do not raise minor findings outside the focus area when a focus area is provided. Use the branch context to avoid re-raising items the PR description or implementation plan has already deferred or resolved. + +Substitute the values of `$focus_areas` (bound at Step 1) and `$branch_context` (bound at Step 1.5) literally. Do not paraphrase or summarize either binding inside the prompt. + +**Per-agent dispatcher directives.** Add the following directive to each named agent's prompt in addition to the shared blocks above. Other agents do not receive these directives. + +- **`structural-analyst` and `behavioral-analyst`.** Add: *"Default the severity of every finding you raise to SUGG. Escalate to WARN only when the change actively introduces or worsens the issue described, and to CRIT only when the issue is critical irrespective of who introduced it. A false positive at SUGG is cheaper than a missed real issue; a false positive at WARN erodes trust."* This dispatcher directive is the `/code-review` skill's tailoring; it does not modify the agent's general behavior outside `/code-review`. +- **`junior-developer`.** Add: *"Outward reads (adjacent code, callers) are for context only; findings must concern code on the scoped file list above. A finding about code outside the file list is permitted only when it directly demonstrates that the changed code on the file list cannot be safely interpreted without the out-of-scope context. Otherwise, omit the finding."* This dispatcher directive is the `/code-review` skill's tailoring; it does not modify the agent's general behavior outside `/code-review`. +- **`edge-case-explorer`.** Add: *"Findings must ultimately trace to a failure mode in code on the scoped file list above, even when callers outside the file list provide the evidence for that failure mode. Read callers as evidence per your Protocol 1, but the failure-mode target of every finding stays on the file list."* This narrower wording preserves the agent's caller-read protocol; it is the `/code-review` skill's tailoring and does not modify the agent's general behavior outside `/code-review`. Domain-specific prompts (the `{size}`, `{N}`, `{change summary}`, `{file list}`, and `{branch}` placeholders are filled from earlier steps): @@ -204,7 +241,13 @@ Review each file from the Step 1 file list **in alphabetical order**. For each f 4. **Examine the diff** to understand what changed. If no diff is available (Mode B uncommitted review or Mode C non-git review from Step 1), skip this sub-step — the full file read from sub-step 3 provides all necessary context. Apply the review checklist to the entire file content. 5. **Apply the review checklist** at [review-checklist.md](references/review-checklist.md) -If the user provided focus areas in their arguments, apply extra scrutiny to those areas and include additional detail in findings for matching categories. +If the user provided focus areas in their arguments (the `$focus_areas` binding from Step 1), apply extra scrutiny to those areas and include additional detail in findings for matching categories. + +**Mode B and Mode C scope note.** In Mode B (uncommitted changes) and Mode C (no git), the skill cannot distinguish introduced code from pre-existing code; the diff signal that drives the calibration directive is absent. In these modes, apply the review checklist conservatively: + +- Raise findings only for items the user explicitly named in the focus areas (`$focus_areas`), items in source files (skip generated and vendored content), and items at file boundaries (imports, exports, public API). +- **Skip the YAGNI checklist entirely in Mode B and Mode C unless the user explicitly requests it in `$focus_areas`.** YAGNI requires distinguishing introduced code from pre-existing code; without a diff, every speculative addition predating the change would surface as if introduced now. +- The size-based demotion in Step 3.3 still applies, but treat the change as Small unless the user passed `$size`. ## Step 5: Documentation Compliance Analysis @@ -222,8 +265,9 @@ For each source where Step 1's project config lookup returned a path: 1. Scan filenames in the directory to identify documents relevant to the changed files 2. Read each relevant document in full -3. Evaluate whether the changes contradict, circumvent, deviate from, or are inconsistent with the document -4. Report violations as review items using the category prefix from the table above +3. **Verify the standard's premise applies before raising a "violates standard X" finding.** Read at least one architectural file in this codebase that demonstrates the standard's premise: an entry-point file for runtime-shape standards, a router or navigation surface for routing standards, a config file for configuration standards, an integration boundary for cross-service standards. When the architectural file confirms the premise, proceed with the violation analysis. When the file does not confirm the premise (e.g., the standard assumes SPA-style company switching but the codebase uses full-page redirects; the standard assumes rich-error API responses but the codebase uses type-system-closed contracts), do not raise the finding. Log a single line in the orchestrator's notes: `premise not verified for {standard}; finding omitted`. The "infer the premise from the standard's own examples" path is not a forward path; it is a reason to omit the finding. +4. Evaluate whether the changes contradict, circumvent, deviate from, or are inconsistent with the document +5. Report violations as review items using the category prefix from the table above #### Compliance severity guidance @@ -252,7 +296,11 @@ Documentation freshness findings merge into the same output sections as the othe ## Step 7: Collect and Classify Agent Results -Wait for all agents dispatched in Step 3 to complete. Each agent returns a summary with finding counts and a file path. +Wait for all agents dispatched in Step 3 to complete. Each agent returns a summary with finding counts and a file path. **Skip this step if no agents were dispatched in Step 3.** + +This step runs in three numbered sub-steps. Order matters: read the agent output, then apply the reachability demotion gate, then apply the size-aware rubric. + +### Step 7.1: Read agent output files Read only the output files for agents that were actually dispatched in Step 3. Skip the read for any agent that was not selected: @@ -266,11 +314,30 @@ Read only the output files for agents that were actually dispatched in Step 3. S - `{output_directory}/data-analysis.md` — data-engineer findings (D-series) - `{output_directory}/devops-analysis.md` — devops-engineer findings (DV-series) -Extract the items from the Findings sections of each file that was read. Then classify them as follows: +Extract the items from the Findings sections of each file that was read. + +### Step 7.2: Apply the reachability phrase-match demotion gate + +For each finding read at Step 7.1, scan the rationale text (the agent's own explanation of why the finding matters) for any of these reachability phrases: + +- `theoretical` +- `hypothetical` +- `defense-in-depth` +- `effectively impossible` +- `in case the upstream` +- `could happen` +- `should never happen` +- `edge case that does not occur` + +When a finding's rationale contains any of these phrases, the agent itself signaled that the failure mode is not reachable in production. Demote the finding by one severity: CRIT becomes WARN, WARN becomes SUGG, SUGG is omitted entirely. Apply the demotion exactly once per finding regardless of how many phrases match. + +This gate is the merged form of the reachability and "directly introduced" filters; the size-aware rubric in Step 7.3 is the single later pass and does not re-demote on these phrases. The phrase list is the only signal the gate uses; do not infer reachability from other text. -**Skip this step if no agents were dispatched in Step 3.** +Security findings (SEC-series) are exempt from this gate because the security agent's evidence standard already requires a demonstrated exploit path or CVE reference before any finding is raised. -Classify agent findings using the rubrics at [agent-finding-classification.md](references/agent-finding-classification.md). Continue task ID numbering sequentially from Steps 4-6 (see Task ID Assignment above). +### Step 7.3: Classify with the size-aware rubric + +Classify the surviving findings using the rubrics at [agent-finding-classification.md](references/agent-finding-classification.md). The rubric defines what each severity means in each agent category; Step 3.3's size-based demotion (read `{size}` from Step 3.1) governs which findings escalate to those bands. Continue task ID numbering sequentially from Steps 4-6 (see Task ID Assignment above). ### Deferred tests @@ -284,7 +351,22 @@ Use the template at [template.md](references/template.md) for the output structu ## Step 9: Verify Review Output -Before presenting the review, verify: +Before presenting the review, run the self-consistency check first, then verify the structural items below. + +### Step 9.0: Self-consistency check + +Detect contradictory recommendations on overlapping code. Run two passes: + +1. **Extraction pass.** For every finding (manual and agent), extract a tuple: `{task-id, file-path, line-range, recommended-action-summary}`. The recommended-action-summary is a one-line summary of what the finding tells the developer to do (e.g., "remove the className.toMatch assertion", "add a className.toMatch assertion", "wrap the call in try/catch", "remove the try/catch wrapper"). Skip findings that have no actionable recommendation. +2. **Comparison pass.** For every pair of tuples on the same `file-path` whose `line-range` overlaps, check whether the two `recommended-action-summary` values prescribe opposite actions on the same code (one says add X, the other says remove X; one says split, the other says merge; one says inline, the other says extract). For each contradictory pair found: + - Demote both findings by one severity (CRIT → WARN, WARN → SUGG, SUGG stays at SUGG and is annotated rather than dropped). + - Append a `Tension with {other-task-id}:` note to each finding's body, naming the contradicting task ID and the opposite action it prescribes. The human reviewer must adjudicate. + +Scope is overlapping line ranges in a single file only. Cross-file semantic contradictions are out of scope for this check. + +### Step 9.1: Structural verification + +Then verify: 1. Task IDs are sequential within each category (CRIT-001, CRIT-002, ...; WARN-001, WARN-002, ...) 2. Agent findings from every dispatched agent (testing, edge-case, structural, behavioral, concurrency, data, devops, junior-developer) have valid task IDs continuing from manual review IDs. Findings from agents that were not dispatched in Step 3 must not appear. @@ -298,4 +380,5 @@ Before presenting the review, verify: 10. Junior-developer findings that overlap with a specialist agent's finding reference the specialist finding instead of duplicating it 11. The review output is the COMPLETE and FINAL response. Do not append a trailing summary, commentary, sign-off, or follow-up message after the review. The structured review document IS the deliverable — nothing follows it. 12. The `### 🟡 YAGNI` section, when present, opens with the verbatim statement: *"These findings will not be corrected unless explicitly requested. They are documented so the team can decide consciously whether to keep, simplify, or defer the items."* YAGNI findings appear ONLY in this section — they are not duplicated under CRIT/WARN/SUGG and are not included in the Review Summary table. +13. Any `Tension with {other-task-id}:` notes added by Step 9.0 appear on both members of each contradictory pair. diff --git a/plugin/skills/code-review/references/agent-finding-classification.md b/plugin/skills/code-review/references/agent-finding-classification.md index 771d924..4c6d094 100644 --- a/plugin/skills/code-review/references/agent-finding-classification.md +++ b/plugin/skills/code-review/references/agent-finding-classification.md @@ -3,16 +3,20 @@ For each test plan item (T1, T2, ...) from the test-engineer, assign category **[Testing: Coverage Gap]** and classify severity: - **CRIT**: Coverage gap for code handling security, data integrity, financial calculations, or authentication. Also CRIT when no tests exist at all for a significant new feature or endpoint. -- **WARN**: Coverage gap for business logic, error handling paths, or integration points. Most findings land here. -- **SUGG**: Low-priority gap where brittleness risk is high or the code path is unlikely to regress. Rare. +- **WARN**: Coverage gap for business logic, error handling paths, or integration points. +- **SUGG**: Low-priority gap where brittleness risk is high or the code path is unlikely to regress. + +Size-based demotion is governed by the calibration directive in [SKILL.md](../SKILL.md) Step 3.3, which is the authoritative home for size-based severity rules. The bands above define what each severity means in this category; Step 3.3 governs which findings escalate to those bands at the change's size (read from Step 3.1). When uncertain, prefer the lower severity. ### Processing edge-case-explorer results For each edge case item (EC1, EC2, ...) from the edge-case-explorer, assign category **[Testing: Edge Case]** and classify severity: - **CRIT**: Critical or High priority edge cases — likely AND severe AND not handled or tested. Especially those involving security, data corruption, or data isolation. -- **WARN**: Medium priority edge cases — plausible with moderate impact, or partially handled but untested. Most findings land here. -- **SUGG**: Low priority edge cases — unlikely or low-impact. Rare. +- **WARN**: Medium priority edge cases (plausible with moderate impact, or partially handled but untested). +- **SUGG**: Low priority edge cases (unlikely or low-impact). + +Size-based demotion is governed by [SKILL.md](../SKILL.md) Step 3.3 (the authoritative home). The bands above define each severity; Step 3.3 governs which edge cases escalate to those bands at the change's size (read from Step 3.1). When uncertain, prefer the lower severity. ### Processing adversarial-security-analyst results @@ -31,51 +35,61 @@ Security findings are not subject to the 30-item cap that applies to manual revi For each structural finding (S1, S2, ...) from the structural-analyst, assign category **[Structure]** and classify severity: - **CRIT**: Module boundary violation or cyclic dependency introduced by this change that blocks safe future work (e.g., domain code reaching into infrastructure, a new cycle that forces cascading edits). -- **WARN**: New coupling across seams, duplication with an existing utility, a leaky abstraction, or an unjustified interface with one implementation. Most findings land here. +- **WARN**: New coupling across seams, duplication with an existing utility, a leaky abstraction, or an unjustified interface with one implementation. - **SUGG**: Cohesion or naming-level structural smells the reader can live with until a broader refactor. +Size-based demotion is governed by [SKILL.md](../SKILL.md) Step 3.3 (the authoritative home). The bands above define each severity; Step 3.3 governs which structural findings escalate to those bands at the change's size (read from Step 3.1). When uncertain, prefer the lower severity. + ### Processing behavioral-analyst results For each behavioral finding (B1, B2, ...) from the behavioral-analyst, assign category **[Behavior]** and classify severity: - **CRIT**: Silent data loss, error swallowed at an integration boundary, state mutation that violates a documented invariant, or data flow that sends incorrect values to an external system. -- **WARN**: Error propagation gap that only surfaces under failure modes, state-management coupling that makes the change brittle, or boundary assumption not matched by the caller. Most findings land here. +- **WARN**: Error propagation gap that only surfaces under failure modes, state-management coupling that makes the change brittle, or boundary assumption not matched by the caller. - **SUGG**: Data-flow clarity issue — the code works, but the pathway is hard to trace. +Size-based demotion is governed by [SKILL.md](../SKILL.md) Step 3.3 (the authoritative home). The bands above define each severity; Step 3.3 governs which behavioral findings escalate to those bands at the change's size (read from Step 3.1). When uncertain, prefer the lower severity. + ### Processing concurrency-analyst results (only if dispatched) For each concurrency finding (C1, C2, ...) from the concurrency-analyst, assign category **[Concurrency]** and classify severity: - **CRIT**: Race condition on authentication, billing, data isolation, or any path where interleaving produces wrong data. Also CRIT for demonstrable deadlock or lock-ordering reversal. -- **WARN**: Shared-resource contention under realistic load, async error path that swallows failures, or missing cancellation/timeout handling on a long-running task. Most findings land here. +- **WARN**: Shared-resource contention under realistic load, async error path that swallows failures, or missing cancellation/timeout handling on a long-running task. - **SUGG**: Concurrency hazard that is theoretically possible but requires implausible interleaving, or an async pattern that should use a stronger primitive. +Size-based demotion is governed by [SKILL.md](../SKILL.md) Step 3.3 (the authoritative home). The bands above define each severity; Step 3.3 governs which concurrency findings escalate to those bands at the change's size (read from Step 3.1). When uncertain, prefer the lower severity. + ### Processing data-engineer results (only if dispatched) For each data-engineer finding (D1, D2, ...), assign category **[Data]** and classify severity: - **CRIT**: Schema or migration that causes data loss, corruption, or unrecoverable drift; a query that returns wrong rows under realistic concurrent writes; a destructive co-deploy; PII/PHI/PCI stored in plaintext where the project requires encryption; broken referential integrity introduced by this change. -- **WARN**: New N+1, missing covering index on a query the change introduces, expand-only migration that lacks a contract step, isolation-level mismatch with the access pattern, missing constraint on a column the change uses. Most findings land here. +- **WARN**: New N+1, missing covering index on a query the change introduces, expand-only migration that lacks a contract step, isolation-level mismatch with the access pattern, missing constraint on a column the change uses. - **SUGG**: Naming, normalization-level, or modeling concerns the team can defer. Reject findings that violate the calibration directive — particularly multi-instance / replay concerns where the storage primitive is naturally idempotent (e.g., flagging `CREATE INDEX IF NOT EXISTS` as a critical concurrent-deploy hazard). When in doubt about whether a finding survives the directive, demote one severity level. +Size-based demotion is governed by [SKILL.md](../SKILL.md) Step 3.3 (the authoritative home). The bands above define each severity; Step 3.3 governs which data findings escalate to those bands at the change's size (read from Step 3.1). When uncertain, prefer the lower severity. + ### Processing devops-engineer results (only if dispatched) For each devops-engineer finding (DV1, DV2, ...), assign category **[DevOps]** and classify severity: - **CRIT**: Production-readiness gap that the change actively introduces and that has a non-benign worst case — a secret committed to source, a rollout step with no rollback path for a destructive change, an SLO-impacting regression with no observability to catch it, an auth or network policy that exposes a previously private endpoint. -- **WARN**: Missing observability for a new code path, missing feature flag for a behavior that warrants progressive rollout, a CI step that masks a real failure, a Dockerfile change that materially increases image size or attack surface. Most findings land here. +- **WARN**: Missing observability for a new code path, missing feature flag for a behavior that warrants progressive rollout, a CI step that masks a real failure, a Dockerfile change that materially increases image size or attack surface. - **SUGG**: Operational ergonomics — improved alerting, log-line shape, dashboard coverage — for code paths the change touches. Reject findings that violate the calibration directive — particularly hypothetical scale problems for workloads the project does not currently have. Demote one severity level when in doubt. +Size-based demotion is governed by [SKILL.md](../SKILL.md) Step 3.3 (the authoritative home). The bands above define each severity; Step 3.3 governs which devops findings escalate to those bands at the change's size (read from Step 3.1). When uncertain, prefer the lower severity. + ### Processing junior-developer results For each junior-developer finding (JD1, JD2, ...) from the junior-developer, assign category **[Clarity]** and classify severity: - **CRIT**: Only when the finding names a direct violation of a documented coding standard, ADR, or CLAUDE.md rule. Clarity-for-clarity's-sake is never CRIT. - **WARN**: Violation of a convention that the project clearly follows elsewhere, or an assumption baked into the change that a teammate would not spot from context. -- **SUGG**: Unclear naming, confusing control flow, or missing-but-optional comment explaining a non-obvious why. Most findings land here. +- **SUGG**: Unclear naming, confusing control flow, or missing-but-optional comment explaining a non-obvious why. Do not duplicate a junior-developer finding when another agent has already raised the same issue — prefer the specialist's classification and reference it from the JD finding instead. diff --git a/plugin/skills/code-review/references/review-checklist.md b/plugin/skills/code-review/references/review-checklist.md index 24bcbb1..bc99319 100644 --- a/plugin/skills/code-review/references/review-checklist.md +++ b/plugin/skills/code-review/references/review-checklist.md @@ -1,6 +1,14 @@ ### Review Checklist **YAGNI** (apply [../../../references/yagni-rule.md](../../../references/yagni-rule.md); these become `YAGNI-###` items in the separate YAGNI section, never CRIT/WARN/SUGG) + +Apply YAGNI in two passes for every change in the diff. Severity calibration is governed by SKILL.md Step 3.3 (the authoritative home), but YAGNI findings are advisory regardless of size and run at every change size. + +1. **Pass 1, evidence test.** For each new abstraction, configuration knob, defensive guard, observability hook, runbook, SLO, index, audit column, feature flag, or speculative addition, ask whether the diff contains evidence of need from one of the acceptable evidence types in [`yagni-rule.md`](../../../references/yagni-rule.md) Gate 1. If yes, do not flag. +2. **Pass 2, anti-pattern check.** Only for items that fail Pass 1, match against the named anti-patterns below. Items that match any anti-pattern become `YAGNI-###` findings. The body of the finding must name (a) the failing evidence type from Pass 1, (b) the matched anti-pattern from this list, and (c) the simpler form considered. + +Named anti-patterns to match in Pass 2: + - New abstraction (interface, base class, port, adapter) introduced for code with one current concrete implementation and no churn history - Configuration knob, env var, or feature flag added with no caller setting a non-default value, no documented rollout strategy, or no expiration criterion - Defensive guard (null check, type check, validation) added at a trusted internal boundary the caller fully controls