Skip to content

feat: ship feature-flow:consult-codex Phase 1+2 (opt-in) (#235)#240

Merged
uta2000 merged 24 commits intomainfrom
feat/codex-consultation
Apr 17, 2026
Merged

feat: ship feature-flow:consult-codex Phase 1+2 (opt-in) (#235)#240
uta2000 merged 24 commits intomainfrom
feat/codex-consultation

Conversation

@uta2000
Copy link
Copy Markdown
Owner

@uta2000 uta2000 commented Apr 17, 2026

Summary

Adds the feature-flow:consult-codex skill — a Claude-orchestrated four-phase flow that consults the existing codex MCP server as a second AI opinion at lifecycle checkpoints. Opt-in by default (codex.enabled: false), so existing installs see no behavioral change.

Closes #235.

What ships in Phase 1+2

  • feature-flow:consult-codex skill — orchestration backbone (consult.js CLI with start / record-response / verdict subcommands), state module (atomic JSON writes + GC), config loader (regex YAML, no new deps), model fallback chain, brief skeleton builder (12 KB UTF-8 byte cap), record-exchange (state + Markdown log with pending-section rewrite), and the review-design proactive mode.
  • PreToolUse verdict-gate.js hook — fail-open Skill gate that BLOCKs non-verdict calls when a strict consultation is pending. Wired in for the deferred stuck mode; soft tier ships without it.
  • feature-flow:design-document integration — new "Optional codex review" step gated on codex.enabled AND codex.proactive_reviews.design_doc. No-op when codex is off.
  • Config template — commented codex: block in skills/start/references/project-context.md and the dogfooding .feature-flow.yml so users discover the option.

Architecture (load-bearing)

Claude-orchestrated four-phase flow. MCP tools like mcp__codex__codex are only callable from Claude's own context, not subprocess scripts — so the skill is an orchestration GUIDE Claude follows, not a monolithic runner:

  1. consult.js start (subprocess) — builds brief, returns JSON, does NOT touch state
  2. Claude calls mcp__codex__codex directly (not subprocess), per the SKILL.md error-classification table
  3. consult.js record-response (subprocess, stdin-driven) — appends consultation to state, writes log, increments budget
  4. consult.js verdict (separate Skill call) — records decision, rewrites pending log section

Budget is incremented in Phase 3, NOT Phase 1, so failed MCP calls don't burn budget.

Tiered verdict enforcement

  • Soft tier (proactive review-design): single-shot reminder; missing verdict surfaces as <not recorded> PR audit defect.
  • Strict tier (reactive stuck, deferred to follow-up plan): PreToolUse hook BLOCKs non-verdict Skill calls until verdict is recorded.

Out of scope (deferred follow-ups)

  • Phase 3: review-plan mode (after verify-plan-criteria)
  • Phase 4: review-code mode + PR feature-flow-metadata integration
  • Phase 5: stuck mode + signal collector + escape-hatch dispatcher

Quality gates passed

  • 24 commits on the branch with two-stage subagent reviews (spec-compliance + code-quality) on every implementation task
  • 85 test assertions across 9 test files — state (13), config (10), resolve-model (6), build-brief (6), record-exchange (5), consult (13), review-design (8), smoke e2e (14), verdict-gate (10)
  • 4-reviewer pre-push panel (security, architecture, silent-failure, test-coverage) — all blockers fixed and re-verified:
    • Security: stdin size cap, path traversal containment in loadDoc
    • Architecture: setMetadata self-init bug fixed (broke first-use happy path); session-id default harmonized with consult.js CLI to prevent silent GC wipe
    • Silent-failure: stderr observability on F4-F8 paths
    • Test coverage: verdict-gate fail-open paths, dispatcher path-traversal, CLI shell-invocation, surrogate-pair truncation, hand-edited log
  • Architecture refactor: extracted state.peek() to consolidate three duplicate raw-fs reads in consult.js start

Real bugs caught and fixed during review

  • T1: cryptic ENOENT in mutators when load() was forgotten → readOrThrow helper
  • T2: YAML loader retained quotes around values (would've broken T3 model lookup) → stripValue helper
  • T5: 12 KB byte cap overshoot on multi-byte UTF-8 codepoint boundary → trailing \uFFFD strip
  • T7: start swallowed corrupt-state silently → stderr warning; error path bypassed state.load → routed through it
  • T8: Phase 3 echo piping broke on apostrophes/backticks in Codex response → quoted heredoc; error status undocumented → added
  • T10: same UTF-8 boundary bug as T5; load-bearing T7/T10 integration gap (transientState didn't forward design_doc_path) → bridged via persisted-state read
  • Pre-push panel: setMetadata self-init session-id used Date.now(), mismatched consult.js's cli-session default → would silently GC the metadata; harmonized

Test plan

  • All 85 unit/integration test assertions pass locally (for t in skills/consult-codex/scripts/*.test.js skills/consult-codex/scripts/modes/*.test.js hooks/scripts/verdict-gate.test.js; do node "$t" || exit 1; done)
  • End-to-end smoke test (smoke.test.js) exercises four-phase orchestration with simulated MCP result
  • Verdict-gate hook tested for fail-open paths, args-regex tolerance, and skill-name mismatch BLOCK
  • Plugin minor-bumped to 1.36.0 (additive opt-in, no breaking changes)
  • Rebased onto current main (resolved one CHANGELOG conflict, kept both ### Added entries — quick-path start: add quick-path triage with code-aware scope confirmation #234 and codex)
  • Manual live-codex smoke test (spec sub-step 12.4): completed against real mcp__codex__codex MCP server in isolated temp worktree — all four phases passed end-to-end, log/state invariants verified, Codex returned substantive design review (not rubber stamp). Result documented on #235.

Known minor follow-ups (do not block merge)

  • The codex example block was added to dogfooding .feature-flow.yml and skills/start/references/project-context.md, but NOT to references/project-context-schema.md (the canonical schema doc). Adding it there improves discoverability for users browsing the schema.
  • plugin.json version bump reformatted the previously-inline keywords array onto multiple lines (spec-prescribed JSON.stringify(j,null,2); cosmetic noise).

🤖 Generated with Claude Code

uta2000 and others added 24 commits April 17, 2026 13:04
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Addresses code-quality review of T1: mutators now fail with a clear
'state not initialized' error if load() was never called, and tests
cover both that path and the existing setVerdict unknown-id throw.
T2 review found that quoted string values (model: "gpt-5.2") were
passed through to callers with embedded quotes, which would silently
break T3 model resolution. Adds stripValue helper. Also renames the
malformed-yaml test to reflect that it tests throw-safety (the
loader is intentionally lenient and silently drops unrecognized
lines), and removes a dead currentKey variable.
T5 review found that Buffer slice mid-codepoint produced a U+FFFD
replacement that overran the cap by 1-2 bytes. Strip the trailing
replacement char so the result stays <= MAX_BRIEF_BYTES. Tighten the
truncation test to use Buffer.byteLength and add a multi-byte case.
Also fix the constraint-block test count (4->5) and add the missing
wrong-goal substring check.
…isolation

T6 review found that briefExcerpt would crash on non-string brief
inputs (caller-side guard only handled falsy values), and there was
no test pinning the multi-consultation rewrite isolation invariant
that the rewrite path silently relies on. Add a String() coercion
guard, an isolation test, and extract the PENDING_MARKER constant
so the search and slice paths share one source of truth.
…d budget-exhaustion test

T7 review found that the recordResponse error path bypassed state.load
(skipping recovery branches and inconsistent with the rest of the
codebase) and that start silently swallowed corrupt-state errors,
which would let every consultation get a 'first call free' on a
broken state file. Routes the error-path read through state.load,
adds stderr warnings on the two corrupt-state catches in start,
drops the unused sessionId destructure, hoists fs to the top of the
file, and adds the missing budget-exhaustion cross-subcommand test.
…error status

T8 review found that Phase 3's echo-pipe pattern would break on
apostrophes/backticks in Codex's response, the error status was
undocumented for both subprocess phases, and the references files
listed deferred features without a deferred-status label. Replace
echo with a quoted heredoc, add the error status bullet to both
phases, and label modes.md / escape-hatch.md deferred features.
T10 review found that loadDoc had the same UTF-8 boundary overrun as
T5's build-brief, the 10000-char truncation assertion was
unsatisfiable against the 10240-byte cap, and the new T10 buildInputs
throw broke the dispatcher's start path because consult.js's
transientState didn't forward design_doc_path from the persisted
state file. Strip trailing U+FFFD on doc truncation, switch the
assertion to byte length + add a multi-byte case, read design_doc_path
into transientState (read-only, preserves the no-touch-state invariant),
and update the 2 dispatcher tests to set up a design doc and metadata.
…din cap, observability

Panel review found two blockers and three cheap hardening wins:
- setMetadata threw on missing state, breaking the design-document
  integration's first-use happy path. Self-initialize using env-var
  defaults that match the consult.js CLI wrapper.
- loadDoc had no path-traversal containment, so a design_doc_path
  typo could silently exfiltrate ~/.aws/credentials or ~/.ssh/id_rsa
  via Codex. Reject paths outside the worktree.
- Unbounded stdin in consult.js could OOM on a runaway 100MB Codex
  response. Cap at 2 MB and convert overflow to codex_call_failed.
- verdict-gate's outer catch had zero observability — strict-tier
  enforcement could silently break. Add stderr warning.
- config.js silently treated 'enabled: tru' (typo) as disabled with
  no feedback. Warn to stderr when the value is unparseable.
Architecture re-review caught a second-order bug in the prior fix:
setMetadata's auto-init used auto-${Date.now()} when env vars were
absent, but consult.js's CLI wrapper defaults to 'cli-session'. The
mismatch silently triggered GC on the next state.load, wiping the
design_doc_path the design-document integration just wrote. Use the
same 'cli-session' default and add a regression test that proves the
no-env-var integration path survives a consult.js-style load.
Closes the remaining panel follow-ups before push:
- Extract state.peek helper to consolidate consult.js start's three
  duplicate raw-fs reads behind one read-only path with consistent
  corrupt-state handling.
- Forward introspect error message via resolved.detail so operators
  see why model resolution failed instead of bare 'model_unresolvable'.
- Persist mcpResult.error.detail in state.consultations[].error_detail
  for post-mortem audit (log render deferred — state is canonical).
- Tag rewritePendingVerdict no-op branches (header / pending marker
  missing) with stderr warnings so log/state divergence is visible.
- Tag verdict-gate's loadState and JSON.parse catches with stderr
  (kept readStdinSync silent — empty stdin is normal).
- Tag state.js .bak rename with stderr so silent history wipes are
  traceable.
- Widen verdict-gate args regex to tolerate leading whitespace and
  --id=c1 form.
- Add 12 new test assertions (1 resolve-model, 3 consult, 6 verdict-gate,
  1 review-design surrogate-pair, 1 record-exchange hand-edit). Total
  suite: 85 assertions across 9 files.
@uta2000 uta2000 merged commit 2c289be into main Apr 17, 2026
uta2000 added a commit that referenced this pull request Apr 17, 2026
…236 #238 #239

Aligns version fields across .claude-plugin/plugin.json (already at 1.36.0
from #240), .claude-plugin/marketplace.json, and .feature-flow.yml. Cuts the
[1.36.0] - 2026-04-17 release heading in CHANGELOG.md bundling the five
features shipped today:

- #238 Scope-critique pass for design-verification (PR #243, this PR)
- #236 Advisor tool integration (PR #242)
- #239 Senior developer panel, Phase 1c code review (PR #241)
- #235 Codex consultation, Phase 1+2 opt-in (PR #240)
- #234 Quick-path triage with code-aware scope confirmation (PR #237)

Consolidates .changelogs/238.md fragment into CHANGELOG.md (per the Ship
phase convention) and deletes the fragment. Updates README with a
"What's New in 1.36.0" section listing all five PRs, adds a
Codex Consultation subsection under Optional Enhancements, adds a
Quick-path triage subsection, and updates the Skills table to include
consult-codex and the scope-critique pass note for design-verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uta2000 added a commit that referenced this pull request Apr 17, 2026
* docs: add implementation plan for scope-critique-pass (#238)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(design-verification): add scope-critique pass (#238)

Adds a ## Scope critique section to skills/design-verification/SKILL.md and a
supporting reference file with five strategic-shape questions (scope,
dependencies, simpler alternatives, observability, config surface) that run
after the existing compatibility check to catch oversized scope, phantom
dependencies, and unobservable capability bets before a design reaches
create-issue.

Motivated by #236 post-creation review: 6 strategic issues bypassed existing
brainstorming, design-verification, and self-review gates. The new pass fills
that gap with a bias-resistance checklist, red-flag lists per question, and a
findings table that promotes BLOCKER rows to the top-level Blockers list.

Closes #238

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: add changelog fragment for #238

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version to 1.36.0 and consolidate changelog for #234 #235 #236 #238 #239

Aligns version fields across .claude-plugin/plugin.json (already at 1.36.0
from #240), .claude-plugin/marketplace.json, and .feature-flow.yml. Cuts the
[1.36.0] - 2026-04-17 release heading in CHANGELOG.md bundling the five
features shipped today:

- #238 Scope-critique pass for design-verification (PR #243, this PR)
- #236 Advisor tool integration (PR #242)
- #239 Senior developer panel, Phase 1c code review (PR #241)
- #235 Codex consultation, Phase 1+2 opt-in (PR #240)
- #234 Quick-path triage with code-aware scope confirmation (PR #237)

Consolidates .changelogs/238.md fragment into CHANGELOG.md (per the Ship
phase convention) and deletes the fragment. Updates README with a
"What's New in 1.36.0" section listing all five PRs, adds a
Codex Consultation subsection under Optional Enhancements, adds a
Quick-path triage subsection, and updates the Skills table to include
consult-codex and the scope-critique pass note for design-verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

codex: consult codex MCP at 4 lifecycle checkpoints

1 participant