Skip to content

Detect stalled agent processes #67

@VincentShipsIt

Description

@VincentShipsIt

name: detect-stalled-processes
description: Kill and retry agent processes that emit no output past a stall timeout.
status: backlog
estimated_complexity: low
blast_radius: contained

PRD: detect-stalled-processes

Executive Summary

Adopt Symphony's stall detection. Pipeline scheduler tick computes `now - lastEventAt` for each running agent process. Past `stallTimeoutMs` (default 5min, 0 disables), the agent is killed and the pipeline marked failed-with-reason `stalled` (eligible for retry). Closes the long-tail bug where wedged agents pin slots indefinitely. Natural follow-on to #62/#63/#64.

Problem Statement

`ProcessManager` (`packages/agents/src/process-manager.ts`) only kills on app quit (PR #62-64) or explicit cancel. A claude/codex process that hangs without exiting (e.g. blocked on a tool call, network wedge) holds a concurrency slot forever. No automatic detection.

Goals

  • Each running process tracks `lastEventAt`, updated on every stdout chunk.
  • Periodic scheduler tick kills processes idle past `stallTimeoutMs`.
  • Pipeline state for killed thread is FAILED with reason `stalled`.
  • Default 300000ms (5min); 0 disables stall detection entirely.

Non-Goals

  • Detecting infinite loops that produce continuous output but no progress.
  • Restarting the agent automatically — retry policy handled by Split continuation vs failure backoff #74.
  • Per-phase stall thresholds (single threshold for v1).

User Stories

  • As a user running an overnight pipeline, I want a wedged agent killed within minutes so the orchestrator can move on, so I do not wake up to a stuck queue.
    Acceptance:
    • An agent emitting no stdout for 5 minutes is killed.
    • Pipeline state for that thread becomes FAILED with reason mentioning `stalled`.
    • An agent emitting regular stdout survives indefinitely.

Functional Requirements

  1. `ManagedProcess` exposes `lastEventAt: number` (monotonic timestamp ms).
  2. `lastEventAt` is updated on every `pty.onData` chunk and on spawn.
  3. `ProcessManager.killStalled(stallTimeoutMs: number): string[]` iterates the registry, kills any entry where `now - lastEventAt > stallTimeoutMs`, returns the killed ids.
  4. Pipeline scheduler invokes `killStalled` on a configurable interval (default 30s).
  5. `stallTimeoutMs <= 0` short-circuits detection (no kills).

Non-Functional Requirements

  • Stall scan must be O(n) over the registry and complete in <10ms for 50 entries.
  • Kill must use the same SIGHUP-then-SIGKILL escalation as `killAllAndWait`.

Success Criteria

  • Unit tests with fake timers: idle pty killed after threshold; active pty survives.
  • Disabled-by-zero test: `killStalled(0)` is a no-op.
  • Reason surface test: pipeline state for killed thread carries `stalled` substring per PR Add WORKFLOW.md target-repo policy #65 reason-surfacing.
  • Integration test: pipeline scheduler tick wires `killStalled` and observes the kill.

Out of Scope

Dependencies

Verification Plan

  • tests: `packages/agents/src/process-manager.test.ts` — new cases for `killStalled` (idle, active, disabled). Pipeline test asserts scheduler invocation.
  • manual: start a pipeline, send `SIGSTOP` to the agent pid to simulate hang; confirm kill within 5min and FAILED reason `stalled`.

Risks & Open Questions

  • Some agents naturally pause for >5min during long tool calls (e.g. large `bun test`). Mitigation: tunable threshold per WORKFLOW.md once Add WORKFLOW.md target-repo policy #65 lands; default tuned to typical claude/codex idle window.
  • Distinguishing legitimate quiet from wedged on stdout-only signal — could augment with pty event activity later if false positives appear.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions