Skip to content

EPIC: Subagent LLM stall detection and visibility #398

@randomm

Description

@randomm

Goal

Enable the system to autonomously detect when a subagent LLM connection has stalled (no tokens for N minutes) and give parent agents real-time visibility into what a running subagent is doing. This eliminates the need for human monitoring of subagent LLM interactions and turns invisible indefinite hangs into visible, bounded failures the system can act on.

Background

When an LLM provider stalls mid-stream, the subagent appears "running" but is actually frozen. The for-await loop in processor.ts hangs indefinitely waiting for the next SSE chunk. Abort signals dont interrupt hung network reads. The only safety net is a 30-minute timeout — meaning zombie agents can waste half an hour before anyone notices. check_task returns only "running" with no detail about whether the agent is making progress or stuck.

Scope

  • LLM stream stall detector in processor.ts (detect "no tokens for N minutes")
  • Enhanced check_task response with last tool calls, stall status, and last activity timestamp
  • Configuration for stall timeout (default 3 minutes)

Out of Scope

  • Graceful stream resumption after stall (HTTP/SSE limitation)
  • Parent-child abort signal chaining (separate concern)
  • Structured progress reporting API (future enhancement)

Child Issues

Acceptance Criteria

Epic is done when both child issues are closed and the system can autonomously detect and surface stalled subagents.

Fork Manifest Requirement

This EPIC modifies the subagent monitoring system introduced by the async-tasks fork feature. Upon completion, the .fork-features/manifest.json entry for async-tasks MUST be updated:

  • modifiedFiles: Add packages/opencode/src/session/processor.ts (stall detector lives here)
  • criticalCode: Add the following markers so sync-time agents understand what this code does and can verify it survives upstream merges:
    • lastTokenTime — per-session timestamp tracking in processor stream loop
    • OPENCODE_STALL_TIMEOUT_MS — env var for configurable stall timeout
    • LLM stream stalled — error message thrown on stall detection
    • stallDetected — field on TaskResult indicating stall was detected
    • lastToolCalls — field on TaskResult showing recent tool call activity
    • lastActivity — field on TaskResult showing timestamp of last stream event
  • absorptionSignals: Add stall.*detector, stream.*stall, lastTokenTime so upstream adoption is detected

Rationale: Even though the modified files are already partially covered in the manifest, the stall detection logic represents a distinct semantic divergence from upstream. Without explicit criticalCode markers and absorption signals, sync-time agents will not understand what this code does or why it exists, and may silently drop it during a merge conflict resolution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions