fix(subagent): enrich failure diagnostics — activity trail, signal capture, structured body#6
Merged
Conversation
…pture, structured body (#3) * fix(subagent): preserve signal and enrich failure diagnostics The 'Subagent X failed (13.9m) — terminated' message pattern surfaced a real loss of diagnostic information in the background-mode failure path. When pi's subagent child exited with a non-zero code or was killed by signal, the harness rendered only 'run.errorMessage || stderr || "(no output)"', which often showed a single cryptic word with no indication of what went wrong, what the subagent was doing, or how many tokens it used before failing. Two fixes in one commit (same concern, same code path): 1. Capture signal in proc.on('close', (code, signal)). The second argument was being discarded; signal-kills (SIGTERM / SIGKILL / SIGHUP) were indistinguishable from clean non-zero exits. 2. Replace the 'errorMessage || stderr || ...' one-liner with a structured 'buildFailureBody' formatter that surfaces, in order: errorMessage, status (stopReason + exitCode + signal), stderr (tail-truncated at 2KB to avoid drowning the parent's context), last tool call the subagent attempted, usage line (tokens + cost + model before failure), and partial assistant text if any was produced. Each section is omitted if empty; the 'no diagnostic information captured' fallback fires only when LITERALLY nothing was preserved — strictly more informative than before. The formatter lives in a new src/subagent-diagnostics.ts module so it can be unit-tested without the peer-dep imports the main extension carries (tests follow in a later commit). * feat(subagent): save raw JSON event stream to /tmp for post-mortem When a subagent fails mid-run the failure body can still be thin \u2014 e.g. the child died with only a short stderr, or the error surfaced before the assistant produced any text. In those cases the parent agent has no way to reconstruct what the subagent was doing. Mirror pi's `--mode json` stdout into `/tmp/subagent-<id>-events.jsonl` (write-on-open, flushed in finishRun). Every assistant message, tool call, tool result, and error the subagent emitted is preserved line- by-line, even when the process is later killed mid-pipe or hangs. Post-mortem: `jq . < /tmp/subagent-<id>-events.jsonl`. A footer line in the failure message points consumers at the file so an agent or human reviewing a failure knows where the detail lives without having to discover the convention. Stream errors are swallowed intentionally \u2014 failing to write a post- mortem log must never cascade into the subagent's own execution. Best-effort by design; result.md / stderr / the structured failure body remain the primary channels. * test(subagent): unit tests for buildFailureBody diagnostic formatter Sixteen cases pinning the structured failure-body format: Fallback path - empty input returns the explicit 'no diagnostic information captured' fallback, never bare empty string - whitespace-only strings are treated as absent Per-section rendering - errorMessage → **Error:** line - stopReason='end_turn' is suppressed (normal completion) - exitCode=0 is suppressed when signal is the real cause - full status line composes stop + exit + signal - stderr wraps in fenced code block - stderr tail-truncation at STDERR_TAIL_BYTES - lastToolCall, usageLine render verbatim - finalText='(no output)' is suppressed (subagent's no-output marker) Composition - stable section order (Error → Status → stderr → Last activity → Usage → Partial output) pinned by indexOf sequence - blank-line separation between sections (markdown block boundaries) - signal-kill with empty stderr (the exact symptom this PR fixes): previously rendered as '(no output)', now surfaces signal=SIGTERM and preserves partial assistant text - thin-diagnostic stopReason='aborted' alone still fires a Status line Tests import buildFailureBody directly (pure function, no peer-dep indirection). Follows the pattern of existing tests in this suite that keep pi's peer deps out of the test path. * test(subagent): unit tests for buildFailureBody diagnostic formatter 12 tests covering all rendering paths of buildFailureBody: - empty input → fallback message - each field rendered individually (errorMessage, status, stderr, lastToolCall, usageLine, finalText) - stderr tail-truncation at STDERR_TAIL_BYTES - omission of empty/whitespace-only fields - omission of finalText when it is '(no output)' - omission of status line when stopReason=end_turn and exitCode=0 - field ordering in the combined case Tests import from subagent-diagnostics.ts directly (no peer deps). * test(subagent): restore full 16-case test suite for buildFailureBody The previous commit inadvertently replaced the comprehensive test file with a shorter version. Restore the full suite: - Fallback path: empty input, whitespace-only strings - Per-section: errorMessage, status (stopReason/exitCode/signal), stderr fenced block, stderr tail-truncation, lastToolCall, usageLine, finalText suppression for '(no output)' - Composition: stable section order, blank-line separation, signal-kill-with-empty-stderr (the exact symptom this PR fixes) 52 tests total (was 48). * fix(subagent): route signal-killed subagents through failure branch Blocker surfaced in fresh correctness review of commit 05cd781. The signal capture in proc.on('close', (code, signal) => ...) was wired to run.signal, but the isError predicate on the line below didn't consult it: isError = run.exitCode !== 0 || stopReason in { error, aborted } When pi's subagent child is SIGTERM'd mid-run — the exact failure mode that motivated this PR — Node invokes close with code=null, signal= 'SIGTERM'. finishRun(code ?? 0) then sets run.exitCode = 0. No turn_end was emitted (the process died mid-stream), so run.stopReason stays undefined. Result: isError evaluates false, the 'completed' branch runs, the failure body becomes empty output. The signal field on TrackedRun was captured but never rendered because buildFailureBody was never called. Add run.signal !== undefined to the predicate so the failure branch fires and the structured body (including the new signal= status line) actually surfaces to the parent agent. Covered by existing test 'signal-kill with otherwise clean exit still surfaces the signal' in subagent-diagnostics.test.ts, which asserts the signal renders in the body — this commit wires that test's assumption to the real code path. * fix(subagent-diagnostics): dynamic fence length for stderr with backticks Blocker surfaced in correctness review: the naked triple-backtick fence around stderr content breaks markdown rendering when stderr itself contains a run of three or more backticks (e.g. a stack trace that quoted a markdown snippet, or a shell error that echoed a heredoc). CommonMark requires the opening and closing fences to be at least as long as the longest inner backtick run + 1. Add fenceFor(content) helper that scans for the longest backtick run and returns a fence one longer (minimum three). Exported for test coverage. buildFailureBody now uses fenceFor(tail) instead of a hardcoded triple-backtick fence. Six new tests pin the fence-length logic: - plain text → three backticks - one or two inner backticks → still three (minimum) - three inner backticks → four - longest-run wins across multiple runs - ten-backtick run → eleven-backtick fence * refactor(subagent-diagnostics): review follow-ups — renames, phrasing, guards Apply suggestions from the quality and correctness reviews: Renames (naming consistency + caller/reader alignment) - buildFailureBody → formatFailureBody (matches format* convention used throughout the module: formatTokens, formatUsage, formatToolCallShort) - FailureDiagnostics.finalText → partialOutput (the field is rendered under **Partial output:**; caller thought 'final text', reader saw 'partial output' — mismatch removed) Comments - usageLine: add inline doc explaining why it's a pre-formatted string rather than a structured Usage object (peer-dep-free boundary) - end_turn suppression: add '// Suppress end_turn — normal completion' comment so the filter's intent is self-evident Phrasing - Fallback message: '...the harness lost the failure detail' → '...check the post-mortem .jsonl' — less blame-y, more actionable Guards - Post-mortem footer: only append the jq pointer when eventStream was successfully opened; avoids pointing at a file that doesn't exist when /tmp is read-only or createWriteStream threw Correctness nit - stopReason: trim before comparison so leading/trailing whitespace doesn't accidentally pass the 'end_turn' filter Test tightening - /exit=1/ → /\bexit=1\b/ to avoid false-match on 'exit=10' * docs(subagent): update stale reference to formatFailureBody One inline comment referenced the old name buildFailureBody. The rename in 72aa530 missed this doc line. Substantive behaviour unchanged. * feat(subagent-diagnostics): full activity trail in failure body When a subagent fails, the parent agent needs to know the FULL paths of files the subagent was reading/editing — a truncated single "Last activity: $ cat /local/home/pcad/..." forces the parent to cat the .jsonl anyway to find out what was going on. That's context pollution (raw JSON event blobs in the parent's context) and defeats the point of having a structured failure body. Replace the single "Last activity" line with a bounded, full-fidelity activity trail: - **Last 20 tool calls** in chronological order (most-recent failures happen near the end; older events live in events.jsonl if needed) - **Full paths, full commands** — no home-tilde collapse, no smart-shortening. A path is useless if it's truncated at the interesting part. - **Per-line cap at MAX_ACTIVITY_LINE_CHARS (256)** — an "obviously unreasonable" ceiling. Typical paths ≈150 chars, commands ≈200. 256 is strictly more than normal; overflow gets tail-stripped with an explicit "…(N chars truncated)" signal pointing at the bytes. - **Header declares count + elision** — "42 tool calls, showing last 20, older 22 in /tmp/subagent-X-events.jsonl" — the parent always knows the scope and where to look for the rest. Worst-case render: 20 × 256 = ~5KB of trail. Small enough for the parent's context, large enough to cover the span of a failing subagent's final activity burst. Implementation split: - subagent-diagnostics.ts (peer-dep-free, testable): adds `ToolCallEvent` interface, `truncateTail`, `formatToolCallFull`, `buildActivityTrail`. `FailureDiagnostics.lastToolCall` renamed to `activityTrail: string` (the caller pre-builds the markdown block). - subagent.ts (glue): iterates `run.messages`, extracts `{ name, arguments }` pairs into a `ToolCallEvent[]`, calls `buildActivityTrail(events, { eventsFile })` and passes the result as `activityTrail` into `formatFailureBody`. The widget's live-status path still uses `formatToolCallShort` (different consumer, different budget) — unchanged. Tests: 22 new (truncateTail: 5, formatToolCallFull: 8, buildActivityTrail: 9) + updates to 4 existing composition/rendering tests to use the new activityTrail field. Total: 80/80 passing. * fix(subagent): review-driven — missing import + truncateTail count Two fixes from correctness+quality review of a5b0850: B1 (blocker): The import in subagent.ts was never updated to include buildActivityTrail and ToolCallEvent, so the new finishRun code at lines 296 and 308 would throw ReferenceError at runtime on every subagent failure. The feature was broken on the exact code path it targets. Tests passed only because they import from subagent- diagnostics.ts directly and never exercise subagent.ts's failure branch. Root cause: my earlier multi-edit tool call had one block that failed to match, and because the tool is atomic-or-none, the import update was silently dropped with it. Fixed by re-adding the import. B2 (correctness): truncateTail reported the wrong truncated-byte count in the '...(N chars truncated)' suffix. The value was computed as `s.length - maxChars` but the actual chars dropped is `s.length - keep`, which is larger by suffix.length (~22 chars). The suffix was lying about how many bytes were lost. Fix: iterate toward the fixpoint where `keep + suffix(s.length - keep).length === maxChars`. Converges in 1-2 passes for realistic inputs because suffix length varies only with the digit count of the truncated-count. The declared count now matches the actual chars dropped. Test update: replaced the hardcoded '400 chars truncated' expectation (which was wrong per the bug) with a semantic check: the reported count must exceed the naive `s.length - maxChars` by at least the suffix length, and output still fits within the cap. Tests: 80/80 passing. * refactor(subagent-diagnostics): polish follow-ups from quality review Address non-blocker findings from the correctness review of a5b0850. All are documentation-tightening or edge-case coverage; no behaviour changes on the hot path. Changes to subagent-diagnostics.ts: - truncateTail: JSDoc now documents the output-length invariant explicitly. out.length <= maxChars holds only when maxChars >= suffix.length (~21 chars). Below that, the suffix alone exceeds the cap and output degrades to the suffix \u2014 callers aren't allowed to go that small in this module. Previously the "never exceeds maxChars" claim was strictly false for degenerate maxChars < 21. - buildActivityTrail: maxEvents <= 0 now returns empty string instead of producing a dangling "**Activity (N tool calls; showing last 0; N older elided):**" header over no bullet list. Contract boundary: an empty trail is rendered as empty string, caller-side omission is the natural consequence. Symmetric with the existing empty-events guard. Tests added (6 new cases): - truncateTail with maxChars=0 (pins the documented degraded behaviour). - formatToolCallFull with BigInt argument (JSON.stringify-throws path through the try/catch). - formatToolCallFull with circular-reference argument (same). - buildActivityTrail with empty events + eventsFile (contract: ignored). - buildActivityTrail with maxEvents=0 (returns empty). - buildActivityTrail with negative maxEvents (same). - First-kept-event boundary assertion added to the existing cap test (events at index 5 is kept when 25 events with cap=20). Tests: 86/86 passing (was 80). Not addressed (deferred to separate PRs or skipped as taste): - formatToolCallShort vs formatToolCallFull signature drift (would touch widget code; separate concern). - build* vs format* verb consistency (rename churn > value; taste). - Adding a `typecheck` npm script (valuable but needs peer-dep scaffolding so pnpm install resolves pi-coding-agent and pi-ai for tsc \u2014 worth its own commit with proper tsconfig.json + dev deps). - Loosening tests coupled to exact section phrasing (low value until phrasing actually changes).
Owner
|
Thanks @cad0p — great PR. The before/after failure body sold it, and the peer-dep-free The Happy to open follow-up issues for the deferred items ( |
This was referenced May 12, 2026
Owner
|
Follow-ups tracked:
No pressure — just so they don't drop. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enrich the subagent extension's background-mode failure body so the parent agent gets actionable diagnostic context instead of a single cryptic word (e.g.
"terminated").Motivating observation — a session of research subagents surfaced failures rendering as:
The parent had no way to distinguish upstream API 5xx (retry) from context-budget stall (tighten scope) from signal-kill (inspect signal). After this change, the same class of failure looks like:
What changed
Two runtime correctness fixes:
Signal capture:
proc.on("close", (code, signal) => ...)— thesignalarg was being discarded. A signal-killed subagent (no precedingturn_end) was mis-routed through the "completed" branch withexitCode=0and empty body. Nowrun.signalis set on close, andisErrorincludesrun.signal !== undefinedso signal-killed subagents surface through the failure path.Dynamic stderr fence: the old code wrapped stderr with hardcoded
```fences. When stderr itself contained three or more backticks (a linter output that includes markdown, say), the inner run terminated the outer fence and split the block. NewfenceFor(content)picks a fence longer than the longest inner run.New diagnostic module
src/subagent-diagnostics.ts(peer-dep-free, so it unit-tests without pulling in@mariozechner/pi-coding-agentor@mariozechner/pi-tui):formatFailureBody({ errorMessage, stopReason, exitCode, signal, stderr, activityTrail, usageLine, partialOutput })— structured section renderer. Each section is omitted when the corresponding field is empty; when everything is empty it returns an explicit(no diagnostic information captured)fallback so the parent knows the harness lost detail rather than the subagent producing nothing.truncateTail(s, maxChars)— tail-strip with a"…(N chars truncated)"suffix whose reported count is the actual chars dropped (fixpoint iteration to account for the suffix's own digit-count impact on the budget).formatToolCallFull(event, maxLineChars)— per-event renderer for bash/read/write/edit/grep/find/ls/unknown tools. Full paths, no home-tilde collapse — the parent shouldn't have to open the events.jsonl to learn which file was being read.buildActivityTrail(events, { maxEvents, maxLineChars, eventsFile })— last N events in chronological order with count summary and optional events.jsonl pointer. Per-line capMAX_ACTIVITY_LINE_CHARS = 256(generous ceiling; real paths ≈150, real commands ≈200). Event capDEFAULT_MAX_ACTIVITY_EVENTS = 20keeps the trail under ~5KB worst case.Raw JSON event log at
/tmp/subagent-<id>-events.jsonl— every event pi emits (tool calls, tool results, thinking blocks, message deltas) is mirrored line-by-line for post-mortem. Essential when a subagent fails mid-run —jq . < /tmp/subagent-<id>-events.jsonlreconstructs exactly what it was doing. Stream errors are swallowed by design — a failed post-mortem log must never cascade into the subagent's own execution.Test coverage
86 passing tests (64 new for this change). Structure:
formatFailureBody— composition, fallback path, per-section rendering including the signal-kill-with-empty-stderr regression casefenceFor— backtick-run detection, boundary cases, extreme runstruncateTail— short passthrough, exact-length, tail-strip, report-count correctness, degeneratemaxChars=0contractformatToolCallFull— each tool type's renderer, legacy-key fallback, unknown-tool JSON serialization including BigInt + circular-ref throwsbuildActivityTrail— empty inputs, maxEvents overrides including0and negative, elision + eventsFile interaction, chronological ordering, slice boundary, per-line cap enforcementAll new tests import from
subagent-diagnostics.tsdirectly, inheriting the existing test pattern of keeping pi peer deps out of the test path.Non-goals / intentionally deferred
tsc --noEmitin CI — genuinely valuable. During this change's development a missing import was nearly shipped; it was caught only because reviewers manually rantsc. The project has notsconfig.jsonand no tsc step innpm test. Worth adding but needs its own PR with proper peer-dep scaffolding (pi-coding-agent, pi-ai, typebox as dev-deps). Flagged in the change's internal commit notes so it doesn't drop.Harmonizing
formatToolCallShort(subagent.ts) withformatToolCallFull(diagnostics) — the short form feeds the live widget and has different budget constraints. Could unify later if a third caller appears.Backwards compatibility
FailureDiagnostics.lastToolCallis renamed toactivityTrail(was only ever internal to the failure-body formatter; the widget'sTrackedRun.lastToolCallis unaffected). No public subagent-extension API changes.Testing
Developed and verified against real failures from a production subagent session — transient 5xx from the LLM gateway, upstream
terminatederror-message propagation, and long-running subagents hitting internal stalls. Each failure now renders with enough context that the parent agent can decide to retry vs respawn-with-tighter-scope without inspecting/tmp.Happy to rebase, squash, or iterate on scope. The change is additive on the failure path (the success path is untouched) so it should be safe to land independently of other in-flight work on the extension.