fix(goal_curation): progress-evidence gate fails closed on semantic verdict parse-miss (#2569)#2600
Merged
Merged
Conversation
…erdict parse-miss (#2569) The progress-evidence gate previously FAILED OPEN (Accept) whenever the LLM/recipe reviewer produced no recognizable accept/reject verdict — even on a successful, non-empty run — letting hallucinated "0%->100% with no verdict" bumps land as false "done" states. Adopt the merge judge's infra-vs-semantic split across both live tiers (RecipeProgressChecker primary + LlmReviewerProgressChecker direct-LLM fallback): - INFRA failure (transport error / spawn failure / non-zero exit / output that strips to empty) -> keep fail-OPEN, so goals aren't blocked on infra hiccups. - SEMANTIC parse-miss (successful, non-empty response with no verdict, or an unknown verdict string) -> fail-CLOSED (Reject). Reject only keeps the prior percent + logs a hallucination alert; it does not stall the goal. Also parse the structured {"verdict": ...} JSON first in the recipe tier (was a naive substring scan that wrongly Rejected an `accept` whose rationale mentioned "reject", and could Accept "unacceptable"), reusing the direct-LLM tier's tolerant parse_reviewer_response — matching the merge judge. The merge path reported in #2569 was already fail-closed on main (#2486/#2490/#2504, merged before #2569 was filed against released v0.22.0); this adds a regression test pinning the reporter's exact SUCCESS-with-no-verdict banners (30s AND 102s) -> Verdict::Unclear, and fixes the analogous fail-open in the progress path. Docs (progress-evidence-api, text-parsing-wire-formats §2a, progress-evidence-gating, text-based-brain-protocol) updated to the new policy and corrected to reflect that progress_reviewer.rs is the live direct-LLM fallback tier (not deleted). Adds a gadugi outside-in scenario plus parser/decision regression tests. Closes #2569 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📊 Coverage Summary
Coverage data from CI run. Test files matching |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens Simard's progress-evidence gate — the reviewer that decides whether a claimed goal-progress % increase is honest — from fail-open to fail-closed on a semantic verdict parse-miss. Previously, when the LLM/recipe reviewer produced no recognizable
accept/rejectverdict on a successful, non-empty run, the gate returnedAccept, letting hallucinated "0%→100% with no verdict" bumps land as false "done" states (a recurring reasoner-reliability failure).Policy (mirrors the merge judge's infra-vs-semantic split), applied to both live tiers
The daemon resolves the gate in three tiers:
RecipeProgressChecker(primary) →LlmReviewerProgressChecker(direct-LLM fallback,progress_reviewer.rs) →Noop. Both live LLM tiers now follow:Accept): LLM transport error; recipe spawn failure / non-zero exit; output that strips to entirely empty. Goals are never blocked on infra hiccups.Reject): a successful, non-empty response with no recognizable verdict, or a valid JSON object with an unknown verdict string.Rejectonly keeps the prior percent and logs a hallucination alert (operations.rs); it does not stall the goal, and the claim is re-attemptable next cycle.Robustness fix (recipe tier)
The recipe tier previously used a naïve substring scan (
contains("reject")thencontains("accept")), which wrongly Rejected a legitimate{"verdict":"accept","rationale":"…no reason to reject…"}and could Accept"unacceptable". It now parses the structured{"verdict": …}JSON first (reusing the direct-LLM tier's tolerantparse_reviewer_response), falling back to the keyword scan only for prose — matching the merge judge and direct-LLM tier.Relationship to #2569
#2569 reports
simard merge-prhard-failing withno verdict keyword foundon a SUCCESS-with-no-verdict recipe run. That merge-path hard-fail was already fixed onmainby #2486 / #2490 / #2504 (JSON-envelope transport + graceful fail-closed toVerdict::Unclear→MergeOutcome::Refused), all merged before #2569 was filed against released v0.22.0. This PR:Verdict::Unclear, locking simard merge-pr: merge-readiness-judge recipe returns SUCCESS but no verdict keyword #2569's repro.Merge-ready evidence
1. qa-team scenarios (gadugi-test)
tests/gadugi/progress-verdict-fail-closed.{yaml,sh}.gadugi-test validate -f tests/gadugi/progress-verdict-fail-closed.yaml --strict→ Valid files: 1.gadugi-test run -d tests/gadugi -s progress-verdict-fail-closed→ Passed: 1, Failed: 0.#[test]fns across the 3 modules (guards against silent rename→no-op) and proves: JSON-accept-with-reject-in-rationale → Accept; unknown verdict → Reject; non-empty no-keyword → Reject; empty/noise-only → Accept; production SUCCESS-banner-only → Reject; direct-LLM tier fail-closed/open; simard merge-pr: merge-readiness-judge recipe returns SUCCESS but no verdict keyword #2569 banners → Unclear.2. Docs (user-facing reference surfaces updated)
docs/reference/progress-evidence-api.md,docs/reference/text-parsing-wire-formats.md(§2a),docs/concepts/progress-evidence-gating.md,docs/concepts/text-based-brain-protocol.md— updated to the new infra/semantic policy and the JSON-first parse order. Also corrected a stale claim thatprogress_reviewer.rs"was deleted" (it is the live direct-LLM fallback tier) and a wrongADAPTER_TAGprefix in a decision table.mkdocs build --strict→ exit 0.3. Quality-audit (3 SEEK→VALIDATE→FIX cycles, clean final)
Rejectsafety.4. CI green (validated locally against the CI invocation)
cargo test --all-features --locked --no-fail-fast -- --skip install_packages_runs_and_self_installs→ 0 failures, 7327 passed (clean env matching CI).cargo clippy --all-targets --all-features --locked -- -D warnings→ pass.cargo fmt --all -- --check→ pass. Pre-commit + pre-push hooks (incl.cargo test --release+ full clippy) → pass.SIMARD_SCALING=autois exported, which makes the pre-existing, unrelatedooda_loop::decide::tests::decide_respects_max_concurrent_actionsread a scaler override and fail. It passes with the var unset (as in CI). Not touched by this PR.6. Focused diff
10 files: 4 source (2 behavior + 1 comment-only + 1 test-only), 4 docs, 2 new gadugi scenario files. No unrelated edits.
Closes #2569