Skip to content

fix(subagent): enrich failure diagnostics — activity trail, signal capture, structured body#6

Merged
samfoy merged 1 commit into
samfoy:masterfrom
cad0p:feat/subagent-failure-diagnostics
May 12, 2026
Merged

fix(subagent): enrich failure diagnostics — activity trail, signal capture, structured body#6
samfoy merged 1 commit into
samfoy:masterfrom
cad0p:feat/subagent-failure-diagnostics

Conversation

@cad0p
Copy link
Copy Markdown
Contributor

@cad0p cad0p commented May 11, 2026

Summary

Enrich the subagent extension's background-mode failure body so the parent agent gets actionable diagnostic context instead of a single cryptic word (e.g. "terminated").

Motivating observation — a session of research subagents surfaced failures rendering as:

## Subagent `not-generalization` failed (13.9m)

terminated

The parent had no way to distinguish upstream API 5xx (retry) from context-budget stall (tighten scope) from signal-kill (inspect signal). After this change, the same class of failure looks like:

## Subagent `happened-not` failed (4.0m)

**Error:** Kiro API error: 500 Internal Server Error {"message":"..."}

**Status:** stop=error

**Activity (42 tool calls, showing last 20, older 22 in /tmp/subagent-xyz-events.jsonl):**

- read: /home/me/workspace/project/src/schema.ts
- bash: $ grep -rn "speculative" /home/me/workspace/project/src/
- bash: $ mkdir -p /home/me/notes/features/
- ... (17 more)

**Usage before failure:** 42t ↑3.8M ↓3.5k claude-opus-4-7

**Partial output:** Now I have a thorough quantitative picture. Let me write the deliverable:

_Post-mortem: `jq . < /tmp/subagent-happened-not-events.jsonl`_

What changed

Two runtime correctness fixes:

  1. Signal capture: proc.on("close", (code, signal) => ...) — the signal arg was being discarded. A signal-killed subagent (no preceding turn_end) was mis-routed through the "completed" branch with exitCode=0 and empty body. Now run.signal is set on close, and isError includes run.signal !== undefined so signal-killed subagents surface through the failure path.

  2. Dynamic stderr fence: the old code wrapped stderr with hardcoded ``` fences. When stderr itself contained three or more backticks (a linter output that includes markdown, say), the inner run terminated the outer fence and split the block. New fenceFor(content) picks a fence longer than the longest inner run.

New diagnostic module src/subagent-diagnostics.ts (peer-dep-free, so it unit-tests without pulling in @mariozechner/pi-coding-agent or @mariozechner/pi-tui):

  • formatFailureBody({ errorMessage, stopReason, exitCode, signal, stderr, activityTrail, usageLine, partialOutput }) — structured section renderer. Each section is omitted when the corresponding field is empty; when everything is empty it returns an explicit (no diagnostic information captured) fallback so the parent knows the harness lost detail rather than the subagent producing nothing.
  • truncateTail(s, maxChars) — tail-strip with a "…(N chars truncated)" suffix whose reported count is the actual chars dropped (fixpoint iteration to account for the suffix's own digit-count impact on the budget).
  • formatToolCallFull(event, maxLineChars) — per-event renderer for bash/read/write/edit/grep/find/ls/unknown tools. Full paths, no home-tilde collapse — the parent shouldn't have to open the events.jsonl to learn which file was being read.
  • buildActivityTrail(events, { maxEvents, maxLineChars, eventsFile }) — last N events in chronological order with count summary and optional events.jsonl pointer. Per-line cap MAX_ACTIVITY_LINE_CHARS = 256 (generous ceiling; real paths ≈150, real commands ≈200). Event cap DEFAULT_MAX_ACTIVITY_EVENTS = 20 keeps the trail under ~5KB worst case.

Raw JSON event log at /tmp/subagent-<id>-events.jsonl — every event pi emits (tool calls, tool results, thinking blocks, message deltas) is mirrored line-by-line for post-mortem. Essential when a subagent fails mid-run — jq . < /tmp/subagent-<id>-events.jsonl reconstructs exactly what it was doing. Stream errors are swallowed by design — a failed post-mortem log must never cascade into the subagent's own execution.

Test coverage

86 passing tests (64 new for this change). Structure:

  • formatFailureBody — composition, fallback path, per-section rendering including the signal-kill-with-empty-stderr regression case
  • fenceFor — backtick-run detection, boundary cases, extreme runs
  • truncateTail — short passthrough, exact-length, tail-strip, report-count correctness, degenerate maxChars=0 contract
  • formatToolCallFull — each tool type's renderer, legacy-key fallback, unknown-tool JSON serialization including BigInt + circular-ref throws
  • buildActivityTrail — empty inputs, maxEvents overrides including 0 and negative, elision + eventsFile interaction, chronological ordering, slice boundary, per-line cap enforcement

All new tests import from subagent-diagnostics.ts directly, inheriting the existing test pattern of keeping pi peer deps out of the test path.

Non-goals / intentionally deferred

  • tsc --noEmit in CI — genuinely valuable. During this change's development a missing import was nearly shipped; it was caught only because reviewers manually ran tsc. The project has no tsconfig.json and no tsc step in npm test. Worth adding but needs its own PR with proper peer-dep scaffolding (pi-coding-agent, pi-ai, typebox as dev-deps). Flagged in the change's internal commit notes so it doesn't drop.

  • Harmonizing formatToolCallShort (subagent.ts) with formatToolCallFull (diagnostics) — the short form feeds the live widget and has different budget constraints. Could unify later if a third caller appears.

Backwards compatibility

FailureDiagnostics.lastToolCall is renamed to activityTrail (was only ever internal to the failure-body formatter; the widget's TrackedRun.lastToolCall is unaffected). No public subagent-extension API changes.

Testing

Developed and verified against real failures from a production subagent session — transient 5xx from the LLM gateway, upstream terminated error-message propagation, and long-running subagents hitting internal stalls. Each failure now renders with enough context that the parent agent can decide to retry vs respawn-with-tighter-scope without inspecting /tmp.


Happy to rebase, squash, or iterate on scope. The change is additive on the failure path (the success path is untouched) so it should be safe to land independently of other in-flight work on the extension.

…pture, structured body (#3)

* fix(subagent): preserve signal and enrich failure diagnostics

The 'Subagent X failed (13.9m) — terminated' message pattern surfaced
a real loss of diagnostic information in the background-mode failure
path. When pi's subagent child exited with a non-zero code or was
killed by signal, the harness rendered only 'run.errorMessage ||
stderr || "(no output)"', which often showed a single cryptic word
with no indication of what went wrong, what the subagent was doing,
or how many tokens it used before failing.

Two fixes in one commit (same concern, same code path):

1. Capture signal in proc.on('close', (code, signal)). The second
   argument was being discarded; signal-kills (SIGTERM / SIGKILL /
   SIGHUP) were indistinguishable from clean non-zero exits.

2. Replace the 'errorMessage || stderr || ...' one-liner with a
   structured 'buildFailureBody' formatter that surfaces, in order:
   errorMessage, status (stopReason + exitCode + signal), stderr
   (tail-truncated at 2KB to avoid drowning the parent's context),
   last tool call the subagent attempted, usage line (tokens + cost +
   model before failure), and partial assistant text if any was
   produced. Each section is omitted if empty; the 'no diagnostic
   information captured' fallback fires only when LITERALLY nothing
   was preserved — strictly more informative than before.

The formatter lives in a new src/subagent-diagnostics.ts module so
it can be unit-tested without the peer-dep imports the main
extension carries (tests follow in a later commit).

* feat(subagent): save raw JSON event stream to /tmp for post-mortem

When a subagent fails mid-run the failure body can still be thin \u2014
e.g. the child died with only a short stderr, or the error surfaced
before the assistant produced any text. In those cases the parent
agent has no way to reconstruct what the subagent was doing.

Mirror pi's `--mode json` stdout into `/tmp/subagent-<id>-events.jsonl`
(write-on-open, flushed in finishRun). Every assistant message, tool
call, tool result, and error the subagent emitted is preserved line-
by-line, even when the process is later killed mid-pipe or hangs.

Post-mortem: `jq . < /tmp/subagent-<id>-events.jsonl`.

A footer line in the failure message points consumers at the file so
an agent or human reviewing a failure knows where the detail lives
without having to discover the convention.

Stream errors are swallowed intentionally \u2014 failing to write a post-
mortem log must never cascade into the subagent's own execution.
Best-effort by design; result.md / stderr / the structured failure
body remain the primary channels.

* test(subagent): unit tests for buildFailureBody diagnostic formatter

Sixteen cases pinning the structured failure-body format:

  Fallback path
    - empty input returns the explicit 'no diagnostic information captured'
      fallback, never bare empty string
    - whitespace-only strings are treated as absent

  Per-section rendering
    - errorMessage → **Error:** line
    - stopReason='end_turn' is suppressed (normal completion)
    - exitCode=0 is suppressed when signal is the real cause
    - full status line composes stop + exit + signal
    - stderr wraps in fenced code block
    - stderr tail-truncation at STDERR_TAIL_BYTES
    - lastToolCall, usageLine render verbatim
    - finalText='(no output)' is suppressed (subagent's no-output marker)

  Composition
    - stable section order (Error → Status → stderr → Last activity → Usage
      → Partial output) pinned by indexOf sequence
    - blank-line separation between sections (markdown block boundaries)
    - signal-kill with empty stderr (the exact symptom this PR fixes):
      previously rendered as '(no output)', now surfaces signal=SIGTERM
      and preserves partial assistant text
    - thin-diagnostic stopReason='aborted' alone still fires a Status line

Tests import buildFailureBody directly (pure function, no peer-dep
indirection). Follows the pattern of existing tests in this suite that
keep pi's peer deps out of the test path.

* test(subagent): unit tests for buildFailureBody diagnostic formatter

12 tests covering all rendering paths of buildFailureBody:
- empty input → fallback message
- each field rendered individually (errorMessage, status, stderr,
  lastToolCall, usageLine, finalText)
- stderr tail-truncation at STDERR_TAIL_BYTES
- omission of empty/whitespace-only fields
- omission of finalText when it is '(no output)'
- omission of status line when stopReason=end_turn and exitCode=0
- field ordering in the combined case

Tests import from subagent-diagnostics.ts directly (no peer deps).

* test(subagent): restore full 16-case test suite for buildFailureBody

The previous commit inadvertently replaced the comprehensive test file
with a shorter version. Restore the full suite:

  - Fallback path: empty input, whitespace-only strings
  - Per-section: errorMessage, status (stopReason/exitCode/signal),
    stderr fenced block, stderr tail-truncation, lastToolCall,
    usageLine, finalText suppression for '(no output)'
  - Composition: stable section order, blank-line separation,
    signal-kill-with-empty-stderr (the exact symptom this PR fixes)

52 tests total (was 48).

* fix(subagent): route signal-killed subagents through failure branch

Blocker surfaced in fresh correctness review of commit 05cd781.

The signal capture in proc.on('close', (code, signal) => ...) was
wired to run.signal, but the isError predicate on the line below
didn't consult it:

  isError = run.exitCode !== 0 || stopReason in { error, aborted }

When pi's subagent child is SIGTERM'd mid-run — the exact failure mode
that motivated this PR — Node invokes close with code=null, signal=
'SIGTERM'. finishRun(code ?? 0) then sets run.exitCode = 0. No
turn_end was emitted (the process died mid-stream), so run.stopReason
stays undefined. Result: isError evaluates false, the 'completed'
branch runs, the failure body becomes empty output. The signal field
on TrackedRun was captured but never rendered because buildFailureBody
was never called.

Add run.signal !== undefined to the predicate so the failure branch
fires and the structured body (including the new signal= status line)
actually surfaces to the parent agent.

Covered by existing test 'signal-kill with otherwise clean exit still
surfaces the signal' in subagent-diagnostics.test.ts, which asserts
the signal renders in the body — this commit wires that test's
assumption to the real code path.

* fix(subagent-diagnostics): dynamic fence length for stderr with backticks

Blocker surfaced in correctness review: the naked triple-backtick fence
around stderr content breaks markdown rendering when stderr itself
contains a run of three or more backticks (e.g. a stack trace that
quoted a markdown snippet, or a shell error that echoed a heredoc).
CommonMark requires the opening and closing fences to be at least as
long as the longest inner backtick run + 1.

Add fenceFor(content) helper that scans for the longest backtick run
and returns a fence one longer (minimum three). Exported for test
coverage. buildFailureBody now uses fenceFor(tail) instead of a
hardcoded triple-backtick fence.

Six new tests pin the fence-length logic:
  - plain text → three backticks
  - one or two inner backticks → still three (minimum)
  - three inner backticks → four
  - longest-run wins across multiple runs
  - ten-backtick run → eleven-backtick fence

* refactor(subagent-diagnostics): review follow-ups — renames, phrasing, guards

Apply suggestions from the quality and correctness reviews:

  Renames (naming consistency + caller/reader alignment)
  - buildFailureBody → formatFailureBody (matches format* convention used
    throughout the module: formatTokens, formatUsage, formatToolCallShort)
  - FailureDiagnostics.finalText → partialOutput (the field is rendered
    under **Partial output:**; caller thought 'final text', reader saw
    'partial output' — mismatch removed)

  Comments
  - usageLine: add inline doc explaining why it's a pre-formatted string
    rather than a structured Usage object (peer-dep-free boundary)
  - end_turn suppression: add '// Suppress end_turn — normal completion'
    comment so the filter's intent is self-evident

  Phrasing
  - Fallback message: '...the harness lost the failure detail' →
    '...check the post-mortem .jsonl' — less blame-y, more actionable

  Guards
  - Post-mortem footer: only append the jq pointer when eventStream was
    successfully opened; avoids pointing at a file that doesn't exist
    when /tmp is read-only or createWriteStream threw

  Correctness nit
  - stopReason: trim before comparison so leading/trailing whitespace
    doesn't accidentally pass the 'end_turn' filter

  Test tightening
  - /exit=1/ → /\bexit=1\b/ to avoid false-match on 'exit=10'

* docs(subagent): update stale reference to formatFailureBody

One inline comment referenced the old name buildFailureBody. The
rename in 72aa530 missed this doc line. Substantive behaviour
unchanged.

* feat(subagent-diagnostics): full activity trail in failure body

When a subagent fails, the parent agent needs to know the FULL paths
of files the subagent was reading/editing — a truncated single "Last
activity: $ cat /local/home/pcad/..." forces the parent to cat the
.jsonl anyway to find out what was going on. That's context pollution
(raw JSON event blobs in the parent's context) and defeats the point
of having a structured failure body.

Replace the single "Last activity" line with a bounded, full-fidelity
activity trail:

  - **Last 20 tool calls** in chronological order (most-recent failures
    happen near the end; older events live in events.jsonl if needed)
  - **Full paths, full commands** — no home-tilde collapse, no
    smart-shortening. A path is useless if it's truncated at the
    interesting part.
  - **Per-line cap at MAX_ACTIVITY_LINE_CHARS (256)** — an "obviously
    unreasonable" ceiling. Typical paths ≈150 chars, commands ≈200.
    256 is strictly more than normal; overflow gets tail-stripped with
    an explicit "…(N chars truncated)" signal pointing at the bytes.
  - **Header declares count + elision** — "42 tool calls, showing
    last 20, older 22 in /tmp/subagent-X-events.jsonl" — the parent
    always knows the scope and where to look for the rest.

Worst-case render: 20 × 256 = ~5KB of trail. Small enough for the
parent's context, large enough to cover the span of a failing
subagent's final activity burst.

Implementation split:

  - subagent-diagnostics.ts (peer-dep-free, testable): adds
    `ToolCallEvent` interface, `truncateTail`, `formatToolCallFull`,
    `buildActivityTrail`. `FailureDiagnostics.lastToolCall` renamed to
    `activityTrail: string` (the caller pre-builds the markdown block).

  - subagent.ts (glue): iterates `run.messages`, extracts
    `{ name, arguments }` pairs into a `ToolCallEvent[]`, calls
    `buildActivityTrail(events, { eventsFile })` and passes the result
    as `activityTrail` into `formatFailureBody`.

The widget's live-status path still uses `formatToolCallShort`
(different consumer, different budget) — unchanged.

Tests: 22 new (truncateTail: 5, formatToolCallFull: 8, buildActivityTrail:
9) + updates to 4 existing composition/rendering tests to use the new
activityTrail field. Total: 80/80 passing.

* fix(subagent): review-driven — missing import + truncateTail count

Two fixes from correctness+quality review of a5b0850:

B1 (blocker): The import in subagent.ts was never updated to include
buildActivityTrail and ToolCallEvent, so the new finishRun code at
lines 296 and 308 would throw ReferenceError at runtime on every
subagent failure. The feature was broken on the exact code path it
targets. Tests passed only because they import from subagent-
diagnostics.ts directly and never exercise subagent.ts's failure
branch.

Root cause: my earlier multi-edit tool call had one block that failed
to match, and because the tool is atomic-or-none, the import update
was silently dropped with it. Fixed by re-adding the import.

B2 (correctness): truncateTail reported the wrong truncated-byte count
in the '...(N chars truncated)' suffix. The value was computed as
`s.length - maxChars` but the actual chars dropped is `s.length - keep`,
which is larger by suffix.length (~22 chars). The suffix was lying
about how many bytes were lost.

Fix: iterate toward the fixpoint where
`keep + suffix(s.length - keep).length === maxChars`. Converges in
1-2 passes for realistic inputs because suffix length varies only with
the digit count of the truncated-count. The declared count now matches
the actual chars dropped.

Test update: replaced the hardcoded '400 chars truncated' expectation
(which was wrong per the bug) with a semantic check: the reported
count must exceed the naive `s.length - maxChars` by at least the
suffix length, and output still fits within the cap.

Tests: 80/80 passing.

* refactor(subagent-diagnostics): polish follow-ups from quality review

Address non-blocker findings from the correctness review of a5b0850.
All are documentation-tightening or edge-case coverage; no behaviour
changes on the hot path.

Changes to subagent-diagnostics.ts:

- truncateTail: JSDoc now documents the output-length invariant
  explicitly. out.length <= maxChars holds only when maxChars >=
  suffix.length (~21 chars). Below that, the suffix alone exceeds the
  cap and output degrades to the suffix \u2014 callers aren't allowed to
  go that small in this module. Previously the "never exceeds maxChars"
  claim was strictly false for degenerate maxChars < 21.

- buildActivityTrail: maxEvents <= 0 now returns empty string instead
  of producing a dangling "**Activity (N tool calls; showing last 0;
  N older elided):**" header over no bullet list. Contract boundary:
  an empty trail is rendered as empty string, caller-side omission is
  the natural consequence. Symmetric with the existing empty-events
  guard.

Tests added (6 new cases):

- truncateTail with maxChars=0 (pins the documented degraded behaviour).
- formatToolCallFull with BigInt argument (JSON.stringify-throws path
  through the try/catch).
- formatToolCallFull with circular-reference argument (same).
- buildActivityTrail with empty events + eventsFile (contract: ignored).
- buildActivityTrail with maxEvents=0 (returns empty).
- buildActivityTrail with negative maxEvents (same).
- First-kept-event boundary assertion added to the existing cap test
  (events at index 5 is kept when 25 events with cap=20).

Tests: 86/86 passing (was 80).

Not addressed (deferred to separate PRs or skipped as taste):

- formatToolCallShort vs formatToolCallFull signature drift (would
  touch widget code; separate concern).
- build* vs format* verb consistency (rename churn > value; taste).
- Adding a `typecheck` npm script (valuable but needs peer-dep
  scaffolding so pnpm install resolves pi-coding-agent and pi-ai for
  tsc \u2014 worth its own commit with proper tsconfig.json + dev deps).
- Loosening tests coupled to exact section phrasing (low value until
  phrasing actually changes).
@cad0p cad0p marked this pull request as ready for review May 11, 2026 23:15
@cad0p cad0p requested a review from samfoy as a code owner May 11, 2026 23:15
@samfoy samfoy merged commit b09372c into samfoy:master May 12, 2026
@samfoy
Copy link
Copy Markdown
Owner

samfoy commented May 12, 2026

Thanks @cad0p — great PR. The before/after failure body sold it, and the peer-dep-free subagent-diagnostics.ts split is a nice pattern for keeping unit tests out of the peer-dep graph. Merged as b09372c.

The fenceFor helper and truncateTail fixpoint are little gems — both bug classes I've hit elsewhere. Appreciate the care on the signal-branch routing fix too; that one was subtle.

Happy to open follow-up issues for the deferred items (tsc --noEmit in CI, harmonizing formatToolCallShort / formatToolCallFull) if you'd like them tracked.

@samfoy
Copy link
Copy Markdown
Owner

samfoy commented May 12, 2026

Follow-ups tracked:

No pressure — just so they don't drop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants