fix(brain/memory): shared robust JSON/verdict extractor for noisy recipe stdout (#2484)#2490
Conversation
Independent confirmation + repair guidance to land thisA separate engineer cycle on this goal independently re-derived the same design for the distill sub-path (shared Confirms the live finding. This cycle reproduced the distill parse failure again at t≈8001 with a fresh signature Blockers to landing (neither is a defect in this diff):
Quality bar: the six merge-ready criteria (qa scenario, docs, ≥3 quality-audit cycles, CI green, evidence in body, focused diff) should be re-verified after the rebase, since the conflict resolution is a new change. |
…ipe stdout
`recipe-runner-rs` stdout (and the `step_results[].output` string inside its
`--output-format json` envelope) is routinely contaminated with three kinds of
non-payload noise that broke the formerly bespoke per-phase extractors:
1. ANSI SGR/CSI/OSC colour codes from `tracing` (a raw `\x1b` byte is invalid
inside a JSON document, so `serde_json` rejects the span).
2. Timestamped tracing-log lines interleaved with the agent answer.
3. The runner's text-mode summary banner (`Recipe: … SUCCESS`, `Steps: …`,
`[completed] …`).
Each recipe-backed phase scanned that raw text with its own fragile extractor
and fell back to a permissive default on a miss — the exact OODA failures
observed (distill `did not contain a parseable { "facts": [...] } object`,
merge/progress `no verdict keyword found`, lifecycle `continue_skipping`).
Add one shared, hardened `src/recipe_output/` module as the only
ANSI/log/banner-stripping path (`strip_ansi`, `strip_recipe_noise`,
`balanced_objects`/`last_balanced_object`/`extract_json_payload` dual-pass,
`extract_verdict`, `record_parse_outcome`). Adopt it in distill, merge-judge,
progress-checker, and the three OODA brain phases (decide/orient/lifecycle);
the distill-private ANSI stripper and the two duplicate strippers
(`meeting_backend::sanitize`, `stewardship::dedup`) now delegate to it.
Emit `recipe_parse_{success,failure}_total{phase}` at each subprocess call site
(never inside a pure parse fn, so unit tests write no metrics), complementary to
the brain phases' existing `brain_verdict_parsed_total{phase,outcome}` (#2429).
Scope guard: `strip_*` return `Cow::Borrowed`/byte-identical text on clean
output, so no phase's decision semantics change — only previously-defaulted
noisy cases now recover.
Closes #2484
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
52e08ea to
a1e81a3
Compare
📊 Coverage Summary
Coverage data from CI run. Test files matching |
…erdict parse-miss (#2569) (#2600) The progress-evidence gate previously FAILED OPEN (Accept) whenever the LLM/recipe reviewer produced no recognizable accept/reject verdict — even on a successful, non-empty run — letting hallucinated "0%->100% with no verdict" bumps land as false "done" states. Adopt the merge judge's infra-vs-semantic split across both live tiers (RecipeProgressChecker primary + LlmReviewerProgressChecker direct-LLM fallback): - INFRA failure (transport error / spawn failure / non-zero exit / output that strips to empty) -> keep fail-OPEN, so goals aren't blocked on infra hiccups. - SEMANTIC parse-miss (successful, non-empty response with no verdict, or an unknown verdict string) -> fail-CLOSED (Reject). Reject only keeps the prior percent + logs a hallucination alert; it does not stall the goal. Also parse the structured {"verdict": ...} JSON first in the recipe tier (was a naive substring scan that wrongly Rejected an `accept` whose rationale mentioned "reject", and could Accept "unacceptable"), reusing the direct-LLM tier's tolerant parse_reviewer_response — matching the merge judge. The merge path reported in #2569 was already fail-closed on main (#2486/#2490/#2504, merged before #2569 was filed against released v0.22.0); this adds a regression test pinning the reporter's exact SUCCESS-with-no-verdict banners (30s AND 102s) -> Verdict::Unclear, and fixes the analogous fail-open in the progress path. Docs (progress-evidence-api, text-parsing-wire-formats §2a, progress-evidence-gating, text-based-brain-protocol) updated to the new policy and corrected to reflect that progress_reviewer.rs is the live direct-LLM fallback tier (not deleted). Adds a gadugi outside-in scenario plus parser/decision regression tests. Closes #2569 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem (root cause)
recipe-runner-rsstdout — and thestep_results[].outputstring inside its--output-format jsonenvelope — is routinely contaminated with three kinds ofnon-payload noise that broke the formerly bespoke per-phase extractors:
tracing/env_logger(e.g. a leading\x1b[2m"dim" whose rawESC/0x1bbyte is invalid inside a JSONdocument, so
serde_jsonrejects the span).Recipe: … SUCCESS,Steps: …,[completed] …).Each recipe-backed phase scanned that raw text with its own fragile extractor
and fell back to a permissive default on a miss — the exact OODA failures
observed in episodes:
'distill' step output did not contain a parseable { "facts": [...] } object→ the whole 20-episode batch deferred a cycle (live evidence at t=8451).
no verdict keyword … found→ fail-closed / fail-open default.continue_skipping/advance_goal/ floor.These are one root cause (noisy stdout) hitting N bespoke extractors, and the
codebase was already accreting duplicate ANSI strippers.
Change
One shared, hardened
src/recipe_output/module is now the onlyANSI/log/banner-stripping path:
strip_ansi(&str) -> CowCow::Borrowedon the clean path.strip_recipe_noise(&str) -> Cowstrip_ansi+ drop ISO-8601 tracing lines and runner-banner lines.Cow::Borrowedon the clean path.balanced_objects/last_balanced_object/extract_json_payload{…}scan. JSON extraction is dual-pass (line-dropped and ANSI-only) so the payload survives both an interleaved log line inside a pretty body and a same-line log prefix.extract_verdict(raw, keywords)record_parse_outcome(phase, success)recipe_parse_{success,failure}_total{phase}tometrics.jsonl.Adopted by distill (
memory_consolidation/distillation.rs, dual-passscan_for_facts_object), merge-judge (stewardship/recipe_merge_judge.rs),progress-checker (
goal_curation/recipe_progress_checker.rs), and the threeOODA brain phases — decide / orient / engineer-lifecycle
(
ooda_brain/recipe_brain.rs). The distill-private ANSI stripper and the twoduplicate strippers (
meeting_backend::sanitize,stewardship::dedup) nowdelegate to
strip_ansi.record_parse_outcomefires only at the subprocess call sites (never insidea pure parse fn, so unit tests write no metrics), complementary to the brain
phases' existing
brain_verdict_parsed_total{phase,outcome}(#2429): that counterowns the brain-phase dashboard; the new family adds the memory/distill and
progress-checker phases and gives numerator + denominator per phase.
Scope guard: no change to any phase's decision semantics on clean output —
strip_*returnCow::Borrowed/ byte-identical text, so only previouslydefaulted noisy cases now recover.
Rebased onto current
mainThis PR was 86 commits behind
main(the brain phases independently gained theJSON-envelope transport + escalation ladder +
brain_verdict_parsed_totalsince the branch was cut). It has been rebased onto current
mainand theshared extractor re-wired onto main's evolved parsers. This also resolves the
prior
cargo-auditfailure (it came from the stale 86-commit-oldCargo.lockwhich carried
lopdf/quinn-proto; currentmain's lockfile is clean) and thefull_goal_lifecycle_crudCI failure (a stale-base artifact).Merge-ready evidence
1. qa-team scenario (gadugi)
tests/gadugi/recipe-output-extractor.yaml(+ driverrecipe-output-extractor.sh).gadugi-test validate -f tests/gadugi/recipe-output-extractor.yaml --strict→
✓ Scenario "Recipe-Output Extractor Hardening (#2484)" is valid/✓ All 1 file(s) are valid.gadugi-test run -d <dir>→✓ Passed: 1 ✗ Failed: 0 - Total: 1 ✓ All tests passed!(driver runs the hermetic extractor suite, asserts the shared-module recovery
test names and the distill-path
parse_recipe_output_recovers_from_ansi_log_noise/
parse_recipe_output_recovers_from_runner_bannernames +test result: ok;command exit 0, ~42s).
2. Docs / changed surfaces
docs/reference/text-parsing-wire-formats.md(the normativeparsing-wire-format reference): new Protocol 0: Shared noise pre-stripping
section, the
recipe_parse_*_total{phase}counters, pre-strip notes added toProtocol 1 & 2, and the refreshed test inventory.
a new internal
recipe_outputmodule). The only externally observableadditions are two new
metrics.jsonlcounter names, documented above. NoCLI/HTTP/API surface changed.
3. Quality-audit (≥3 SEEK→VALIDATE→FIX cycles, clean final)
strip_ansi/strip_recipe_noisereturnCow::Borrowedon clean input (dedicated tests),so every existing decide/orient/lifecycle/merge/progress/distill test that
feeds clean text passes byte-for-byte unchanged. FIX: the decide banner test
reclassified a single-line
Recipe:banner fromDefaultMalformedtoDefaultEmpty(the banner is now correctly recognised as pure noise andstripped); updated to use the realistic multi-line banner whose
Recipe '<name>': SUCCESSsummary line survives → stillDefaultMalformed,and added an explicit pure-noise→
DefaultEmptytest. Both outcomes areparse-failures yielding the same decision, so semantics are unchanged.
scan_for_facts_objectdual-pass recovers (a) a payload after an ANSI/log line (line-dropped view)
and (b) a payload on the SAME physical line as a log prefix (ANSI-only view),
preserving the last-non-empty preference across both views. New regression
tests
parse_recipe_output_recovers_from_ansi_log_noiseandparse_recipe_output_recovers_from_runner_banner(using valid distillconcepts) assert recovery; the raw span is pinned unparseable.
regression pin (
production_banner_has_no_verdict_keyword) still holds, addedtext_verdict_drops_ready_substring_in_noise_log_line(analready→readysubstring in a dropped log line no longer fabricates a
Readyverdict) and aprogress-checker noise-recovery test; verified the
sanitize/dedupdelegations keep
normalize_strips_ansi_and_collapses_whitespacegreen.Independent
code-reviewagent pass: no significant issues. Clean final cycle.4. CI
Local equivalents of every CI gate pass:
cargo fmt --all -- --check→ clean.cargo clippy --all-targets --all-features --locked -- -D warnings→ clean(both the pre-commit
--releasegate and the full pre-push gate passed).cargo audit→ exit 0 (clean lockfile after rebase).recipe_output(incl. the 2 distill recoverynames) 38,
ooda_brain::recipe_brain169,recipe_merge_judge31,recipe_progress_checker13,distillation72,stewardship22.cargo test --lib --all-features --locked --no-fail-fast→6382 passed.The single non-passing test (
ooda_loop::decide::decide_respects_max_concurrent_actions)is a pre-existing, host-environment failure: it fails identically on clean
mainin this build host (the host's real~/.simard/prompt_assets/…/live state shadows the temp fixtures) and is untouched by this diff; it passes
in CI's clean runner (it did not fail in the prior CI run of this PR).
6. Focused diff
12 files: the new
recipe_outputmodule (2), the six adopting parsers, the twodelegated strippers, one reference doc, and the qa scenario pair. No unrelated
edits.
Closes #2484