Skip to content

fix(brain/memory): shared robust JSON/verdict extractor for noisy recipe stdout (#2484)#2490

Merged
rysweet merged 1 commit into
mainfrom
engineer/continuously-research-how-to-measure-and-improv-12a2cb2b-1782638493-9cd79f
Jun 29, 2026
Merged

fix(brain/memory): shared robust JSON/verdict extractor for noisy recipe stdout (#2484)#2490
rysweet merged 1 commit into
mainfrom
engineer/continuously-research-how-to-measure-and-improv-12a2cb2b-1782638493-9cd79f

Conversation

@rysweet

@rysweet rysweet commented Jun 28, 2026

Copy link
Copy Markdown
Owner

Problem (root cause)

recipe-runner-rs stdout — and the step_results[].output string inside its
--output-format json envelope — is routinely contaminated with three kinds of
non-payload noise that broke the formerly bespoke per-phase extractors:

  1. ANSI SGR/CSI/OSC colour codes from tracing/env_logger (e.g. a leading
    \x1b[2m "dim" whose raw ESC/0x1b byte is invalid inside a JSON
    document, so serde_json rejects the span).
  2. Timestamped tracing-log lines interleaved with the agent answer.
  3. The runner's text-mode summary banner (Recipe: … SUCCESS, Steps: …,
    [completed] …).

Each recipe-backed phase scanned that raw text with its own fragile extractor
and fell back to a permissive default on a miss — the exact OODA failures
observed in episodes:

  • distill: 'distill' step output did not contain a parseable { "facts": [...] } object
    → the whole 20-episode batch deferred a cycle (live evidence at t=8451).
  • merge-judge / progress-checker: no verdict keyword … found → fail-closed / fail-open default.
  • engineer-lifecycle / decide / orient: banner/noise misparse → continue_skipping / advance_goal / floor.

These are one root cause (noisy stdout) hitting N bespoke extractors, and the
codebase was already accreting duplicate ANSI strippers.

Change

One shared, hardened src/recipe_output/ module is now the only
ANSI/log/banner-stripping path:

Function Behaviour
strip_ansi(&str) -> Cow Single ANSI (CSI/OSC/two-char) stripper. Cow::Borrowed on the clean path.
strip_recipe_noise(&str) -> Cow strip_ansi + drop ISO-8601 tracing lines and runner-banner lines. Cow::Borrowed on the clean path.
balanced_objects / last_balanced_object / extract_json_payload String-literal-aware balanced {…} scan. JSON extraction is dual-pass (line-dropped and ANSI-only) so the payload survives both an interleaved log line inside a pretty body and a same-line log prefix.
extract_verdict(raw, keywords) Precedence keyword scan over cleaned text.
record_parse_outcome(phase, success) Emits recipe_parse_{success,failure}_total{phase} to metrics.jsonl.

Adopted by distill (memory_consolidation/distillation.rs, dual-pass
scan_for_facts_object), merge-judge (stewardship/recipe_merge_judge.rs),
progress-checker (goal_curation/recipe_progress_checker.rs), and the three
OODA brain phases — decide / orient / engineer-lifecycle
(ooda_brain/recipe_brain.rs). The distill-private ANSI stripper and the two
duplicate strippers (meeting_backend::sanitize, stewardship::dedup) now
delegate to strip_ansi.

record_parse_outcome fires only at the subprocess call sites (never inside
a pure parse fn, so unit tests write no metrics), complementary to the brain
phases' existing brain_verdict_parsed_total{phase,outcome} (#2429): that counter
owns the brain-phase dashboard; the new family adds the memory/distill and
progress-checker phases and gives numerator + denominator per phase.

Scope guard: no change to any phase's decision semantics on clean output —
strip_* return Cow::Borrowed / byte-identical text, so only previously
defaulted noisy cases now recover.

Rebased onto current main

This PR was 86 commits behind main (the brain phases independently gained the
JSON-envelope transport + escalation ladder + brain_verdict_parsed_total
since the branch was cut). It has been rebased onto current main and the
shared extractor re-wired onto main's evolved parsers. This also resolves the
prior cargo-audit failure (it came from the stale 86-commit-old Cargo.lock
which carried lopdf/quinn-proto; current main's lockfile is clean) and the
full_goal_lifecycle_crud CI failure (a stale-base artifact).


Merge-ready evidence

1. qa-team scenario (gadugi)

tests/gadugi/recipe-output-extractor.yaml (+ driver recipe-output-extractor.sh).

  • gadugi-test validate -f tests/gadugi/recipe-output-extractor.yaml --strict
    ✓ Scenario "Recipe-Output Extractor Hardening (#2484)" is valid / ✓ All 1 file(s) are valid.
  • gadugi-test run -d <dir>✓ Passed: 1 ✗ Failed: 0 - Total: 1 ✓ All tests passed!
    (driver runs the hermetic extractor suite, asserts the shared-module recovery
    test names and the distill-path parse_recipe_output_recovers_from_ansi_log_noise
    / parse_recipe_output_recovers_from_runner_banner names + test result: ok;
    command exit 0, ~42s).

2. Docs / changed surfaces

  • Updated docs/reference/text-parsing-wire-formats.md (the normative
    parsing-wire-format reference): new Protocol 0: Shared noise pre-stripping
    section, the recipe_parse_*_total{phase} counters, pre-strip notes added to
    Protocol 1 & 2, and the refreshed test inventory.
  • Changed surfaces: all code surfaces are internal (library parse functions +
    a new internal recipe_output module). The only externally observable
    additions are two new metrics.jsonl counter names, documented above. No
    CLI/HTTP/API surface changed.

3. Quality-audit (≥3 SEEK→VALIDATE→FIX cycles, clean final)

  • Cycle 1 (clean-path / decision-semantics): verified strip_ansi /
    strip_recipe_noise return Cow::Borrowed on clean input (dedicated tests),
    so every existing decide/orient/lifecycle/merge/progress/distill test that
    feeds clean text passes byte-for-byte unchanged. FIX: the decide banner test
    reclassified a single-line Recipe: banner from DefaultMalformed to
    DefaultEmpty (the banner is now correctly recognised as pure noise and
    stripped); updated to use the realistic multi-line banner whose
    Recipe '<name>': SUCCESS summary line survives → still DefaultMalformed,
    and added an explicit pure-noise→DefaultEmpty test. Both outcomes are
    parse-failures yielding the same decision, so semantics are unchanged.
  • Cycle 2 (payload-drop safety): confirmed the distill scan_for_facts_object
    dual-pass recovers (a) a payload after an ANSI/log line (line-dropped view)
    and (b) a payload on the SAME physical line as a log prefix (ANSI-only view),
    preserving the last-non-empty preference across both views. New regression
    tests parse_recipe_output_recovers_from_ansi_log_noise and
    parse_recipe_output_recovers_from_runner_banner (using valid distill
    concepts) assert recovery; the raw span is pinned unparseable.
  • Cycle 3 (verdict false-positives + delegation): verified the merge-judge
    regression pin (production_banner_has_no_verdict_keyword) still holds, added
    text_verdict_drops_ready_substring_in_noise_log_line (an alreadyready
    substring in a dropped log line no longer fabricates a Ready verdict) and a
    progress-checker noise-recovery test; verified the sanitize/dedup
    delegations keep normalize_strips_ansi_and_collapses_whitespace green.
    Independent code-review agent pass: no significant issues. Clean final cycle.

4. CI

Local equivalents of every CI gate pass:

  • cargo fmt --all -- --check → clean.
  • cargo clippy --all-targets --all-features --locked -- -D warnings → clean
    (both the pre-commit --release gate and the full pre-push gate passed).
  • cargo audit → exit 0 (clean lockfile after rebase).
  • Touched-module unit tests green: recipe_output (incl. the 2 distill recovery
    names) 38, ooda_brain::recipe_brain 169, recipe_merge_judge 31,
    recipe_progress_checker 13, distillation 72, stewardship 22.
  • Full cargo test --lib --all-features --locked --no-fail-fast6382 passed.
    The single non-passing test (ooda_loop::decide::decide_respects_max_concurrent_actions)
    is a pre-existing, host-environment failure: it fails identically on clean
    main in this build host (the host's real ~/.simard/prompt_assets/… /
    live state shadows the temp fixtures) and is untouched by this diff; it passes
    in CI's clean runner (it did not fail in the prior CI run of this PR).

6. Focused diff

12 files: the new recipe_output module (2), the six adopting parsers, the two
delegated strippers, one reference doc, and the qa scenario pair. No unrelated
edits.

Closes #2484

@rysweet

rysweet commented Jun 28, 2026

Copy link
Copy Markdown
Owner Author

Independent confirmation + repair guidance to land this

A separate engineer cycle on this goal independently re-derived the same design for the distill sub-path (shared recipe_output::strip_recipe_noise + non-anchored balanced-object scan + a per-phase parse-outcome counter from which a distill_parse_failure_rate is derivable) before discovering this PR. That convergence is corroboration that this is the right shape — so rather than open a duplicate, here is the concrete evidence + what's blocking the merge.

Confirms the live finding. This cycle reproduced the distill parse failure again at t≈8001 with a fresh signature \x1b[2m2026-06-28T11:21:26.517009Z… (raw ESC + ISO-8601 tracing timestamp). This PR's record_parse_outcome("distill", …) + strip_recipe_noise routing in distillation.rs remediate exactly that case; logged as evidence on #2484.

Blockers to landing (neither is a defect in this diff):

  1. mergeable: CONFLICTING (DIRTY) vs main. main has advanced since this branched; needs a rebase/merge. The likely collision points are the files this PR and recently-merged work both touch (src/lib.rs module list, src/memory_consolidation/distillation.rs). A clean rebase onto current origin/main + re-run of the local gates should clear it.
  2. cargo-audit red is repo-wide, not introduced here. The failing run reports 2 vulnerabilities found + unmaintained/unsound advisories (proc-macro-error2 unmaintained, plus unsound warnings) — these come from the dependency graph and fail on main as well, so they are not a regression from this PR. Landing this should not be gated on it; track the advisories separately (advisory ignore in audit.toml or a dependency bump) at the repo level.

Quality bar: the six merge-ready criteria (qa scenario, docs, ≥3 quality-audit cycles, CI green, evidence in body, focused diff) should be re-verified after the rebase, since the conflict resolution is a new change.

…ipe stdout

`recipe-runner-rs` stdout (and the `step_results[].output` string inside its
`--output-format json` envelope) is routinely contaminated with three kinds of
non-payload noise that broke the formerly bespoke per-phase extractors:

1. ANSI SGR/CSI/OSC colour codes from `tracing` (a raw `\x1b` byte is invalid
   inside a JSON document, so `serde_json` rejects the span).
2. Timestamped tracing-log lines interleaved with the agent answer.
3. The runner's text-mode summary banner (`Recipe: … SUCCESS`, `Steps: …`,
   `[completed] …`).

Each recipe-backed phase scanned that raw text with its own fragile extractor
and fell back to a permissive default on a miss — the exact OODA failures
observed (distill `did not contain a parseable { "facts": [...] } object`,
merge/progress `no verdict keyword found`, lifecycle `continue_skipping`).

Add one shared, hardened `src/recipe_output/` module as the only
ANSI/log/banner-stripping path (`strip_ansi`, `strip_recipe_noise`,
`balanced_objects`/`last_balanced_object`/`extract_json_payload` dual-pass,
`extract_verdict`, `record_parse_outcome`). Adopt it in distill, merge-judge,
progress-checker, and the three OODA brain phases (decide/orient/lifecycle);
the distill-private ANSI stripper and the two duplicate strippers
(`meeting_backend::sanitize`, `stewardship::dedup`) now delegate to it.

Emit `recipe_parse_{success,failure}_total{phase}` at each subprocess call site
(never inside a pure parse fn, so unit tests write no metrics), complementary to
the brain phases' existing `brain_verdict_parsed_total{phase,outcome}` (#2429).

Scope guard: `strip_*` return `Cow::Borrowed`/byte-identical text on clean
output, so no phase's decision semantics change — only previously-defaulted
noisy cases now recover.

Closes #2484

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@rysweet rysweet force-pushed the engineer/continuously-research-how-to-measure-and-improv-12a2cb2b-1782638493-9cd79f branch from 52e08ea to a1e81a3 Compare June 28, 2026 19:54
@github-actions

Copy link
Copy Markdown

📊 Coverage Summary

Generated by cargo llvm-cov --workspace --summary-only (nightly, excluding test files)

Module Lines Covered Coverage
Total 113460 95051 83.8%

Coverage data from CI run. Test files matching tests?/ are excluded from line counts.

@rysweet rysweet merged commit ccea51a into main Jun 29, 2026
15 checks passed
@rysweet rysweet deleted the engineer/continuously-research-how-to-measure-and-improv-12a2cb2b-1782638493-9cd79f branch June 29, 2026 12:32
rysweet added a commit that referenced this pull request Jul 4, 2026
…erdict parse-miss (#2569) (#2600)

The progress-evidence gate previously FAILED OPEN (Accept) whenever the
LLM/recipe reviewer produced no recognizable accept/reject verdict — even on a
successful, non-empty run — letting hallucinated "0%->100% with no verdict"
bumps land as false "done" states.

Adopt the merge judge's infra-vs-semantic split across both live tiers
(RecipeProgressChecker primary + LlmReviewerProgressChecker direct-LLM fallback):
- INFRA failure (transport error / spawn failure / non-zero exit / output that
  strips to empty) -> keep fail-OPEN, so goals aren't blocked on infra hiccups.
- SEMANTIC parse-miss (successful, non-empty response with no verdict, or an
  unknown verdict string) -> fail-CLOSED (Reject). Reject only keeps the prior
  percent + logs a hallucination alert; it does not stall the goal.

Also parse the structured {"verdict": ...} JSON first in the recipe tier (was a
naive substring scan that wrongly Rejected an `accept` whose rationale mentioned
"reject", and could Accept "unacceptable"), reusing the direct-LLM tier's
tolerant parse_reviewer_response — matching the merge judge.

The merge path reported in #2569 was already fail-closed on main
(#2486/#2490/#2504, merged before #2569 was filed against released v0.22.0);
this adds a regression test pinning the reporter's exact SUCCESS-with-no-verdict
banners (30s AND 102s) -> Verdict::Unclear, and fixes the analogous fail-open in
the progress path.

Docs (progress-evidence-api, text-parsing-wire-formats §2a,
progress-evidence-gating, text-based-brain-protocol) updated to the new policy
and corrected to reflect that progress_reviewer.rs is the live direct-LLM
fallback tier (not deleted). Adds a gadugi outside-in scenario plus
parser/decision regression tests.

Closes #2569

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Brain/memory reliability: shared robust JSON/verdict extractor for noisy recipe-step output (root cause behind #2419-family + #2468)

1 participant