fix(brain/memory): shared robust JSON/verdict extractor for noisy recipe stdout (#2484) by rysweet · Pull Request #2490 · rysweet/Simard

rysweet · 2026-06-28T11:06:41Z

Problem (root cause)

recipe-runner-rs stdout — and the step_results[].output string inside its
--output-format json envelope — is routinely contaminated with three kinds of
non-payload noise that broke the formerly bespoke per-phase extractors:

ANSI SGR/CSI/OSC colour codes from tracing/env_logger (e.g. a leading
\x1b[2m "dim" whose raw ESC/0x1b byte is invalid inside a JSON
document, so serde_json rejects the span).
Timestamped tracing-log lines interleaved with the agent answer.
The runner's text-mode summary banner (Recipe: … SUCCESS, Steps: …,
[completed] …).

Each recipe-backed phase scanned that raw text with its own fragile extractor
and fell back to a permissive default on a miss — the exact OODA failures
observed in episodes:

distill: 'distill' step output did not contain a parseable { "facts": [...] } object
→ the whole 20-episode batch deferred a cycle (live evidence at t=8451).
merge-judge / progress-checker: no verdict keyword … found → fail-closed / fail-open default.
engineer-lifecycle / decide / orient: banner/noise misparse → continue_skipping / advance_goal / floor.

These are one root cause (noisy stdout) hitting N bespoke extractors, and the
codebase was already accreting duplicate ANSI strippers.

Change

One shared, hardened src/recipe_output/ module is now the only
ANSI/log/banner-stripping path:

Function	Behaviour
`strip_ansi(&str) -> Cow`	Single ANSI (CSI/OSC/two-char) stripper. `Cow::Borrowed` on the clean path.
`strip_recipe_noise(&str) -> Cow`	`strip_ansi` + drop ISO-8601 tracing lines and runner-banner lines. `Cow::Borrowed` on the clean path.
`balanced_objects` / `last_balanced_object` / `extract_json_payload`	String-literal-aware balanced `{…}` scan. JSON extraction is dual-pass (line-dropped and ANSI-only) so the payload survives both an interleaved log line inside a pretty body and a same-line log prefix.
`extract_verdict(raw, keywords)`	Precedence keyword scan over cleaned text.
`record_parse_outcome(phase, success)`	Emits `recipe_parse_{success,failure}_total{phase}` to `metrics.jsonl`.

Adopted by distill (memory_consolidation/distillation.rs, dual-pass
scan_for_facts_object), merge-judge (stewardship/recipe_merge_judge.rs),
progress-checker (goal_curation/recipe_progress_checker.rs), and the three
OODA brain phases — decide / orient / engineer-lifecycle
(ooda_brain/recipe_brain.rs). The distill-private ANSI stripper and the two
duplicate strippers (meeting_backend::sanitize, stewardship::dedup) now
delegate to strip_ansi.

record_parse_outcome fires only at the subprocess call sites (never inside
a pure parse fn, so unit tests write no metrics), complementary to the brain
phases' existing brain_verdict_parsed_total{phase,outcome} (#2429): that counter
owns the brain-phase dashboard; the new family adds the memory/distill and
progress-checker phases and gives numerator + denominator per phase.

Scope guard: no change to any phase's decision semantics on clean output —
strip_* return Cow::Borrowed / byte-identical text, so only previously
defaulted noisy cases now recover.

Rebased onto current `main`

This PR was 86 commits behind main (the brain phases independently gained the
JSON-envelope transport + escalation ladder + brain_verdict_parsed_total
since the branch was cut). It has been rebased onto current main and the
shared extractor re-wired onto main's evolved parsers. This also resolves the
prior cargo-audit failure (it came from the stale 86-commit-old Cargo.lock
which carried lopdf/quinn-proto; current main's lockfile is clean) and the
full_goal_lifecycle_crud CI failure (a stale-base artifact).

Merge-ready evidence

1. qa-team scenario (gadugi)

tests/gadugi/recipe-output-extractor.yaml (+ driver recipe-output-extractor.sh).

gadugi-test validate -f tests/gadugi/recipe-output-extractor.yaml --strict
→ ✓ Scenario "Recipe-Output Extractor Hardening (#2484)" is valid / ✓ All 1 file(s) are valid.
gadugi-test run -d <dir> → ✓ Passed: 1 ✗ Failed: 0 - Total: 1 ✓ All tests passed!
(driver runs the hermetic extractor suite, asserts the shared-module recovery
test names and the distill-path parse_recipe_output_recovers_from_ansi_log_noise
/ parse_recipe_output_recovers_from_runner_banner names + test result: ok;
command exit 0, ~42s).

2. Docs / changed surfaces

Updated docs/reference/text-parsing-wire-formats.md (the normative
parsing-wire-format reference): new Protocol 0: Shared noise pre-stripping
section, the recipe_parse_*_total{phase} counters, pre-strip notes added to
Protocol 1 & 2, and the refreshed test inventory.
Changed surfaces: all code surfaces are internal (library parse functions +
a new internal recipe_output module). The only externally observable
additions are two new metrics.jsonl counter names, documented above. No
CLI/HTTP/API surface changed.

3. Quality-audit (≥3 SEEK→VALIDATE→FIX cycles, clean final)

Cycle 1 (clean-path / decision-semantics): verified strip_ansi /
strip_recipe_noise return Cow::Borrowed on clean input (dedicated tests),
so every existing decide/orient/lifecycle/merge/progress/distill test that
feeds clean text passes byte-for-byte unchanged. FIX: the decide banner test
reclassified a single-line Recipe: banner from DefaultMalformed to
DefaultEmpty (the banner is now correctly recognised as pure noise and
stripped); updated to use the realistic multi-line banner whose
Recipe '<name>': SUCCESS summary line survives → still DefaultMalformed,
and added an explicit pure-noise→DefaultEmpty test. Both outcomes are
parse-failures yielding the same decision, so semantics are unchanged.
Cycle 2 (payload-drop safety): confirmed the distill scan_for_facts_object
dual-pass recovers (a) a payload after an ANSI/log line (line-dropped view)
and (b) a payload on the SAME physical line as a log prefix (ANSI-only view),
preserving the last-non-empty preference across both views. New regression
tests parse_recipe_output_recovers_from_ansi_log_noise and
parse_recipe_output_recovers_from_runner_banner (using valid distill
concepts) assert recovery; the raw span is pinned unparseable.
Cycle 3 (verdict false-positives + delegation): verified the merge-judge
regression pin (production_banner_has_no_verdict_keyword) still holds, added
text_verdict_drops_ready_substring_in_noise_log_line (an already→ready
substring in a dropped log line no longer fabricates a Ready verdict) and a
progress-checker noise-recovery test; verified the sanitize/dedup
delegations keep normalize_strips_ansi_and_collapses_whitespace green.
Independent code-review agent pass: no significant issues. Clean final cycle.

4. CI

Local equivalents of every CI gate pass:

cargo fmt --all -- --check → clean.
cargo clippy --all-targets --all-features --locked -- -D warnings → clean
(both the pre-commit --release gate and the full pre-push gate passed).
cargo audit → exit 0 (clean lockfile after rebase).
Touched-module unit tests green: recipe_output (incl. the 2 distill recovery
names) 38, ooda_brain::recipe_brain 169, recipe_merge_judge 31,
recipe_progress_checker 13, distillation 72, stewardship 22.
Full cargo test --lib --all-features --locked --no-fail-fast → 6382 passed.
The single non-passing test (ooda_loop::decide::decide_respects_max_concurrent_actions)
is a pre-existing, host-environment failure: it fails identically on clean
main in this build host (the host's real ~/.simard/prompt_assets/… /
live state shadows the temp fixtures) and is untouched by this diff; it passes
in CI's clean runner (it did not fail in the prior CI run of this PR).

6. Focused diff

12 files: the new recipe_output module (2), the six adopting parsers, the two
delegated strippers, one reference doc, and the qa scenario pair. No unrelated
edits.

Closes #2484

rysweet · 2026-06-28T12:30:45Z

Independent confirmation + repair guidance to land this

A separate engineer cycle on this goal independently re-derived the same design for the distill sub-path (shared recipe_output::strip_recipe_noise + non-anchored balanced-object scan + a per-phase parse-outcome counter from which a distill_parse_failure_rate is derivable) before discovering this PR. That convergence is corroboration that this is the right shape — so rather than open a duplicate, here is the concrete evidence + what's blocking the merge.

Confirms the live finding. This cycle reproduced the distill parse failure again at t≈8001 with a fresh signature \x1b[2m2026-06-28T11:21:26.517009Z… (raw ESC + ISO-8601 tracing timestamp). This PR's record_parse_outcome("distill", …) + strip_recipe_noise routing in distillation.rs remediate exactly that case; logged as evidence on #2484.

Blockers to landing (neither is a defect in this diff):

mergeable: CONFLICTING (DIRTY) vs main. main has advanced since this branched; needs a rebase/merge. The likely collision points are the files this PR and recently-merged work both touch (src/lib.rs module list, src/memory_consolidation/distillation.rs). A clean rebase onto current origin/main + re-run of the local gates should clear it.
cargo-audit red is repo-wide, not introduced here. The failing run reports 2 vulnerabilities found + unmaintained/unsound advisories (proc-macro-error2 unmaintained, plus unsound warnings) — these come from the dependency graph and fail on main as well, so they are not a regression from this PR. Landing this should not be gated on it; track the advisories separately (advisory ignore in audit.toml or a dependency bump) at the repo level.

Quality bar: the six merge-ready criteria (qa scenario, docs, ≥3 quality-audit cycles, CI green, evidence in body, focused diff) should be re-verified after the rebase, since the conflict resolution is a new change.

…ipe stdout `recipe-runner-rs` stdout (and the `step_results[].output` string inside its `--output-format json` envelope) is routinely contaminated with three kinds of non-payload noise that broke the formerly bespoke per-phase extractors: 1. ANSI SGR/CSI/OSC colour codes from `tracing` (a raw `\x1b` byte is invalid inside a JSON document, so `serde_json` rejects the span). 2. Timestamped tracing-log lines interleaved with the agent answer. 3. The runner's text-mode summary banner (`Recipe: … SUCCESS`, `Steps: …`, `[completed] …`). Each recipe-backed phase scanned that raw text with its own fragile extractor and fell back to a permissive default on a miss — the exact OODA failures observed (distill `did not contain a parseable { "facts": [...] } object`, merge/progress `no verdict keyword found`, lifecycle `continue_skipping`). Add one shared, hardened `src/recipe_output/` module as the only ANSI/log/banner-stripping path (`strip_ansi`, `strip_recipe_noise`, `balanced_objects`/`last_balanced_object`/`extract_json_payload` dual-pass, `extract_verdict`, `record_parse_outcome`). Adopt it in distill, merge-judge, progress-checker, and the three OODA brain phases (decide/orient/lifecycle); the distill-private ANSI stripper and the two duplicate strippers (`meeting_backend::sanitize`, `stewardship::dedup`) now delegate to it. Emit `recipe_parse_{success,failure}_total{phase}` at each subprocess call site (never inside a pure parse fn, so unit tests write no metrics), complementary to the brain phases' existing `brain_verdict_parsed_total{phase,outcome}` (#2429). Scope guard: `strip_*` return `Cow::Borrowed`/byte-identical text on clean output, so no phase's decision semantics change — only previously-defaulted noisy cases now recover. Closes #2484 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-06-28T19:59:20Z

📊 Coverage Summary

Generated by cargo llvm-cov --workspace --summary-only (nightly, excluding test files)

Module	Lines	Covered	Coverage
Total	113460	95051	83.8%

_{Coverage data from CI run. Test files matching tests?/ are excluded from line counts.}

…erdict parse-miss (#2569) (#2600) The progress-evidence gate previously FAILED OPEN (Accept) whenever the LLM/recipe reviewer produced no recognizable accept/reject verdict — even on a successful, non-empty run — letting hallucinated "0%->100% with no verdict" bumps land as false "done" states. Adopt the merge judge's infra-vs-semantic split across both live tiers (RecipeProgressChecker primary + LlmReviewerProgressChecker direct-LLM fallback): - INFRA failure (transport error / spawn failure / non-zero exit / output that strips to empty) -> keep fail-OPEN, so goals aren't blocked on infra hiccups. - SEMANTIC parse-miss (successful, non-empty response with no verdict, or an unknown verdict string) -> fail-CLOSED (Reject). Reject only keeps the prior percent + logs a hallucination alert; it does not stall the goal. Also parse the structured {"verdict": ...} JSON first in the recipe tier (was a naive substring scan that wrongly Rejected an `accept` whose rationale mentioned "reject", and could Accept "unacceptable"), reusing the direct-LLM tier's tolerant parse_reviewer_response — matching the merge judge. The merge path reported in #2569 was already fail-closed on main (#2486/#2490/#2504, merged before #2569 was filed against released v0.22.0); this adds a regression test pinning the reporter's exact SUCCESS-with-no-verdict banners (30s AND 102s) -> Verdict::Unclear, and fixes the analogous fail-open in the progress path. Docs (progress-evidence-api, text-parsing-wire-formats §2a, progress-evidence-gating, text-based-brain-protocol) updated to the new policy and corrected to reflect that progress_reviewer.rs is the live direct-LLM fallback tier (not deleted). Adds a gadugi outside-in scenario plus parser/decision regression tests. Closes #2569 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rysweet mentioned this pull request Jun 28, 2026

Brain/memory reliability: shared robust JSON/verdict extractor for noisy recipe-step output (root cause behind #2419-family + #2468) #2484

Closed

rysweet force-pushed the engineer/continuously-research-how-to-measure-and-improv-12a2cb2b-1782638493-9cd79f branch from 52e08ea to a1e81a3 Compare June 28, 2026 19:54

rysweet merged commit ccea51a into main Jun 29, 2026
15 checks passed

rysweet deleted the engineer/continuously-research-how-to-measure-and-improv-12a2cb2b-1782638493-9cd79f branch June 29, 2026 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(brain/memory): shared robust JSON/verdict extractor for noisy recipe stdout (#2484)#2490

fix(brain/memory): shared robust JSON/verdict extractor for noisy recipe stdout (#2484)#2490
rysweet merged 1 commit into
mainfrom
engineer/continuously-research-how-to-measure-and-improv-12a2cb2b-1782638493-9cd79f

rysweet commented Jun 28, 2026 •

edited

Loading

Uh oh!

rysweet commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rysweet commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem (root cause)

Change

Rebased onto current main

Merge-ready evidence

1. qa-team scenario (gadugi)

2. Docs / changed surfaces

3. Quality-audit (≥3 SEEK→VALIDATE→FIX cycles, clean final)

4. CI

6. Focused diff

Uh oh!

rysweet commented Jun 28, 2026

Independent confirmation + repair guidance to land this

Uh oh!

github-actions Bot commented Jun 28, 2026

📊 Coverage Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rysweet commented Jun 28, 2026 •

edited

Loading

Rebased onto current `main`