fix(goal_curation): progress-evidence gate fails closed on semantic verdict parse-miss (#2569) by rysweet · Pull Request #2600 · rysweet/Simard

rysweet · 2026-07-04T21:41:16Z

Summary

Hardens Simard's progress-evidence gate — the reviewer that decides whether a claimed goal-progress % increase is honest — from fail-open to fail-closed on a semantic verdict parse-miss. Previously, when the LLM/recipe reviewer produced no recognizable accept/reject verdict on a successful, non-empty run, the gate returned Accept, letting hallucinated "0%→100% with no verdict" bumps land as false "done" states (a recurring reasoner-reliability failure).

Policy (mirrors the merge judge's infra-vs-semantic split), applied to both live tiers

The daemon resolves the gate in three tiers: RecipeProgressChecker (primary) → LlmReviewerProgressChecker (direct-LLM fallback, progress_reviewer.rs) → Noop. Both live LLM tiers now follow:

INFRA failure → keep fail-OPEN (Accept): LLM transport error; recipe spawn failure / non-zero exit; output that strips to entirely empty. Goals are never blocked on infra hiccups.
SEMANTIC parse-miss → fail-CLOSED (Reject): a successful, non-empty response with no recognizable verdict, or a valid JSON object with an unknown verdict string. Reject only keeps the prior percent and logs a hallucination alert (operations.rs); it does not stall the goal, and the claim is re-attemptable next cycle.

Robustness fix (recipe tier)

The recipe tier previously used a naïve substring scan (contains("reject") then contains("accept")), which wrongly Rejected a legitimate {"verdict":"accept","rationale":"…no reason to reject…"} and could Accept "unacceptable". It now parses the structured {"verdict": …} JSON first (reusing the direct-LLM tier's tolerant parse_reviewer_response), falling back to the keyword scan only for prose — matching the merge judge and direct-LLM tier.

Relationship to #2569

#2569 reports simard merge-pr hard-failing with no verdict keyword found on a SUCCESS-with-no-verdict recipe run. That merge-path hard-fail was already fixed on main by #2486 / #2490 / #2504 (JSON-envelope transport + graceful fail-closed to Verdict::Unclear → MergeOutcome::Refused), all merged before #2569 was filed against released v0.22.0. This PR:

Adds a regression test pinning the reporter's exact SUCCESS banners (both 30s and 102s, per "not a timeout race") → Verdict::Unclear, locking simard merge-pr: merge-readiness-judge recipe returns SUCCESS but no verdict keyword #2569's repro.
Fixes the analogous fail-open in the progress path (the same verdict-transport family, but on the goal-progress gate rather than the merge gate).

Merge-ready evidence

1. qa-team scenarios (gadugi-test)

New outside-in scenario: tests/gadugi/progress-verdict-fail-closed.{yaml,sh}.
gadugi-test validate -f tests/gadugi/progress-verdict-fail-closed.yaml --strict → Valid files: 1.
gadugi-test run -d tests/gadugi -s progress-verdict-fail-closed → Passed: 1, Failed: 0.
The scenario pins 13 named #[test] fns across the 3 modules (guards against silent rename→no-op) and proves: JSON-accept-with-reject-in-rationale → Accept; unknown verdict → Reject; non-empty no-keyword → Reject; empty/noise-only → Accept; production SUCCESS-banner-only → Reject; direct-LLM tier fail-closed/open; simard merge-pr: merge-readiness-judge recipe returns SUCCESS but no verdict keyword #2569 banners → Unclear.

2. Docs (user-facing reference surfaces updated)

docs/reference/progress-evidence-api.md, docs/reference/text-parsing-wire-formats.md (§2a), docs/concepts/progress-evidence-gating.md, docs/concepts/text-based-brain-protocol.md — updated to the new infra/semantic policy and the JSON-first parse order. Also corrected a stale claim that progress_reviewer.rs "was deleted" (it is the live direct-LLM fallback tier) and a wrong ADAPTER_TAG prefix in a decision table.
mkdocs build --strict → exit 0.

3. Quality-audit (3 SEEK→VALIDATE→FIX cycles, clean final)

Cycle 1 (code-review): no blockers; confirmed the discriminator + downstream Reject safety.
Cycle 2 (rubber-duck): 1 blocking finding — the recipe tier's substring scan undermined the reliability claim (accept-with-reject-in-rationale → wrong Reject). Fixed by parsing structured JSON first + added 4 regression tests.
Cycle 3 (code-review): clean — zero Critical/High/Medium findings. One LOW doc-consistency nit (sibling doc repeating the stale "deleted" claim) was also fixed.

4. CI green (validated locally against the CI invocation)

cargo test --all-features --locked --no-fail-fast -- --skip install_packages_runs_and_self_installs → 0 failures, 7327 passed (clean env matching CI).
cargo clippy --all-targets --all-features --locked -- -D warnings → pass.
cargo fmt --all -- --check → pass. Pre-commit + pre-push hooks (incl. cargo test --release + full clippy) → pass.
Note: in this live-engineer shell, SIMARD_SCALING=auto is exported, which makes the pre-existing, unrelated ooda_loop::decide::tests::decide_respects_max_concurrent_actions read a scaler override and fail. It passes with the var unset (as in CI). Not touched by this PR.

6. Focused diff

10 files: 4 source (2 behavior + 1 comment-only + 1 test-only), 4 docs, 2 new gadugi scenario files. No unrelated edits.

Closes #2569

…erdict parse-miss (#2569) The progress-evidence gate previously FAILED OPEN (Accept) whenever the LLM/recipe reviewer produced no recognizable accept/reject verdict — even on a successful, non-empty run — letting hallucinated "0%->100% with no verdict" bumps land as false "done" states. Adopt the merge judge's infra-vs-semantic split across both live tiers (RecipeProgressChecker primary + LlmReviewerProgressChecker direct-LLM fallback): - INFRA failure (transport error / spawn failure / non-zero exit / output that strips to empty) -> keep fail-OPEN, so goals aren't blocked on infra hiccups. - SEMANTIC parse-miss (successful, non-empty response with no verdict, or an unknown verdict string) -> fail-CLOSED (Reject). Reject only keeps the prior percent + logs a hallucination alert; it does not stall the goal. Also parse the structured {"verdict": ...} JSON first in the recipe tier (was a naive substring scan that wrongly Rejected an `accept` whose rationale mentioned "reject", and could Accept "unacceptable"), reusing the direct-LLM tier's tolerant parse_reviewer_response — matching the merge judge. The merge path reported in #2569 was already fail-closed on main (#2486/#2490/#2504, merged before #2569 was filed against released v0.22.0); this adds a regression test pinning the reporter's exact SUCCESS-with-no-verdict banners (30s AND 102s) -> Verdict::Unclear, and fixes the analogous fail-open in the progress path. Docs (progress-evidence-api, text-parsing-wire-formats §2a, progress-evidence-gating, text-based-brain-protocol) updated to the new policy and corrected to reflect that progress_reviewer.rs is the live direct-LLM fallback tier (not deleted). Adds a gadugi outside-in scenario plus parser/decision regression tests. Closes #2569 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-07-04T21:46:48Z

📊 Coverage Summary

Generated by cargo llvm-cov --workspace --summary-only (nightly, excluding test files)

Module	Lines	Covered	Coverage
Total	130639	108214	82.8%

_{Coverage data from CI run. Test files matching tests?/ are excluded from line counts.}

rysweet merged commit 3729e1f into main Jul 4, 2026
17 checks passed

rysweet deleted the fix/progress-verdict-fail-closed-2569 branch July 4, 2026 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(goal_curation): progress-evidence gate fails closed on semantic verdict parse-miss (#2569)#2600

fix(goal_curation): progress-evidence gate fails closed on semantic verdict parse-miss (#2569)#2600
rysweet merged 1 commit into
mainfrom
fix/progress-verdict-fail-closed-2569

rysweet commented Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rysweet commented Jul 4, 2026

Summary

Policy (mirrors the merge judge's infra-vs-semantic split), applied to both live tiers

Robustness fix (recipe tier)

Relationship to #2569

Merge-ready evidence

1. qa-team scenarios (gadugi-test)

2. Docs (user-facing reference surfaces updated)

3. Quality-audit (3 SEEK→VALIDATE→FIX cycles, clean final)

4. CI green (validated locally against the CI invocation)

6. Focused diff

Uh oh!

github-actions Bot commented Jul 4, 2026

📊 Coverage Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant