code-review guardrails#3
Merged
Merged
Conversation
Records evidence and root-cause analysis for four reported symptoms of the code-review skill: going too deep, YAGNI under-applying, severity inflation, and human-provided context not reaching sub-agents. Source evidence comes from three real PR bundles in tmp/ plus a behavioral trace of the skill body and dispatched agents. Findings include 12 root causes (C1-C12), 13 proposed solutions (S1-S13), and an adversarial-validator pass that produced V1-V9 corrections.
Synthesized from the 13 solutions in investigation.md through a medium-sized plan-implementation run (software-architect, behavioral-analyst, test-engineer, junior-developer). Round 1 resolved all 8 Open Questions via evidence, reframing, or reasonable-call user defaults; spec-maturity gate did not trip. Plan adopts BA-4's merge of S2+S10 onto a reachability phrase-match signal, scopes S3 and S4 as Step 3.5 dispatcher directives rather than global agent-body edits, adds five long-form docs and a minor version bump as in-scope work, and defers S12, structured rationale signals, cross-file semantic contradiction detection, an automated harness, per-agent unit tests, and Mode C standalone tests under the YAGNI rule.
Recalibrate /code-review so its first pass produces the output users were getting only by running a manual second-pass reclassification. - Make Step 3.3 the single source of truth for size-based demotion. Remove the WARN floor from seven of the nine agent rubrics in agent-finding-classification.md. - Restructure Step 7 into 7.1 (read), 7.2 (reachability phrase-match demotion gate, security-exempt), and 7.3 (rubric classification). - Add Step 1.5 to load PR description, branch commit messages, and implementation plan into \$branch_context, plumbed to every agent. Bind user argument to \$focus_areas. - Add per-agent dispatcher tailoring at Step 3.5 for structural-analyst, behavioral-analyst (default-SUGG), junior-developer, and edge-case-explorer (file-list scoping). - Add Step 9.0 self-consistency check to flag overlapping-line-range findings with contradictory recommendations. - Require premise verification before standards-compliance findings at Step 5. - Rewrite YAGNI as two-pass (evidence test, then anti-pattern match); skip YAGNI in Mode B and Mode C unless explicitly requested. - Add Bash(gh *) to allowed-tools for Step 1.5. - Update affected agent docs and skill docs to mirror the changes.
Audit against the v2.3.0 implementation surfaced several inaccuracies and omissions in the operator-facing documentation. Fix: - docs/agents/junior-developer.md: Plain-Language Reframing is Protocol 8, not Protocol 7 (Protocol 7 is YAGNI Evidence Sweep). - docs/skills/code-review.md "What you get back": correct agent task ID format (B#, S#, JD#, T#, EC#, C#, D#, DV#, not B-NNN/S-NNN/JD-NNN); add D# and DV# entries; add the YAGNI section to the output description with its verbatim opening, ID series, and advisory disposition. - docs/skills/code-review.md "In more detail" step 3: include Step 3.4 domain-scoped file list slicing with concrete examples. - docs/skills/code-review.md Sizing table: remove "from agents" phrasing that implied agents and manual review follow different calibration rules; SKILL.md applies the same size rule to both. - docs/skills/code-review.md Step 1.5 Key Concept and prose: name the three pr-body filename variants, the planning directory branch-name matching rule, and the 200-word $branch_context cap. - docs/yagni.md table: note the two-pass procedure and the Mode B / C YAGNI skip on the /code-review row. - docs/skills/gh-pr-review.md: add a Key Concept noting that the wrapped /code-review Step 1.5 feeds the PR description into agent prompts. - CHANGELOG.md v2.3.0 Documentation list: add docs/yagni.md and docs/skills/gh-pr-review.md.
…bric Brings the junior-developer SUGG band in line with the other rubrics after the v2.3.0 floor removal. The rubric defines what kind of finding belongs in each band; size-based placement is governed by SKILL.md Step 3.3 (the authoritative home), not by an anchor sentence.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR recalibrates
/code-reviewso its first pass produces consistently scoped, non-inflated output, eliminating the manual reclassification step users had been running to get actionable results.gh pr view, localpr-bodyfiles, branch commit messages, implementation plans) into a 200-word$branch_contextblock that ships verbatim to every dispatched agent, so agents can avoid re-raising items the team has already deferred.Behavior changes
Before: Agents ran with no shared context about the PR's intent or deferred items, a WARN floor in seven of nine agent rubrics ensured most findings inflated to WARN regardless of change size, theoretical findings ("could happen," "defense-in-depth") survived to the output unchanged, and contradictory same-file recommendations required the human to adjudicate without any flag that a conflict existed.
After:
$branch_contextTension with {other-task-id}:to each$focus_areasstructural-analyst/behavioral-analystdefault severityWhat to look at first
theoretical,hypothetical,defense-in-depth,effectively impossible,in case the upstream,could happen,should never happen,edge case that does not occur) are the entire gate. A finding whose rationale avoids these phrases but is still theoretical passes through. The plan chose phrase-matching over a structured field intentionally — verify that tradeoff is documented well enough that future editors don't add phrases ad hoc.structural-analyst,behavioral-analyst,junior-developer, andedge-case-explorerdirectives live in Step 3.5, not in the agent definition files. This means agents called by other skills are unaffected, but it also means the behavior is invisible from the agent docs alone. The four affected agent docs carry a one-paragraph note — check that it's accurate.$branch_contexttonone provided. A reviewer should confirm that the warning is sufficient and that agents handlenone providedgracefully without hallucinating context.How this was tested
tmp/gearjot-v2-web-pr-{299,307,339}/and compared finding counts and severities against pre-v2.3.0 output to confirm inflation was reduced.theoreticalanddefense-in-depthlanguage in the PR-299 bundle.Tension withannotations.Files of interest
plugin/skills/code-review/SKILL.md— the skill itself; Steps 1.5, 3.3, 3.5, 7.1-7.3, and 9.0 are the load-bearing additionsplugin/...nces/agent-finding-classification.md— the rubric where the WARN floor was removed from seven agent sectionsplugin/skills/code-review/references/review-checklist.md— the YAGNI two-pass procedure and the Mode B/C skip ruleCHANGELOG.md— the v2.3.0 entry is the fastest way to orient; the Deferred section names what was kept out and whydocs/skills/code-review.md— operator-facing doc updated to mirror the new step structure; confirms the implementation is accurately described