You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
name: detect-stalled-processes
description: Kill and retry agent processes that emit no output past a stall timeout.
status: backlog
estimated_complexity: low
blast_radius: contained
PRD: detect-stalled-processes
Executive Summary
Adopt Symphony's stall detection. Pipeline scheduler tick computes `now - lastEventAt` for each running agent process. Past `stallTimeoutMs` (default 5min, 0 disables), the agent is killed and the pipeline marked failed-with-reason `stalled` (eligible for retry). Closes the long-tail bug where wedged agents pin slots indefinitely. Natural follow-on to #62/#63/#64.
Problem Statement
`ProcessManager` (`packages/agents/src/process-manager.ts`) only kills on app quit (PR #62-64) or explicit cancel. A claude/codex process that hangs without exiting (e.g. blocked on a tool call, network wedge) holds a concurrency slot forever. No automatic detection.
Goals
Each running process tracks `lastEventAt`, updated on every stdout chunk.
Periodic scheduler tick kills processes idle past `stallTimeoutMs`.
Pipeline state for killed thread is FAILED with reason `stalled`.
Per-phase stall thresholds (single threshold for v1).
User Stories
As a user running an overnight pipeline, I want a wedged agent killed within minutes so the orchestrator can move on, so I do not wake up to a stuck queue. Acceptance:
An agent emitting no stdout for 5 minutes is killed.
Pipeline state for that thread becomes FAILED with reason mentioning `stalled`.
An agent emitting regular stdout survives indefinitely.
`lastEventAt` is updated on every `pty.onData` chunk and on spawn.
`ProcessManager.killStalled(stallTimeoutMs: number): string[]` iterates the registry, kills any entry where `now - lastEventAt > stallTimeoutMs`, returns the killed ids.
Pipeline scheduler invokes `killStalled` on a configurable interval (default 30s).
`stallTimeoutMs <= 0` short-circuits detection (no kills).
Non-Functional Requirements
Stall scan must be O(n) over the registry and complete in <10ms for 50 entries.
Kill must use the same SIGHUP-then-SIGKILL escalation as `killAllAndWait`.
Success Criteria
Unit tests with fake timers: idle pty killed after threshold; active pty survives.
Disabled-by-zero test: `killStalled(0)` is a no-op.
tests: `packages/agents/src/process-manager.test.ts` — new cases for `killStalled` (idle, active, disabled). Pipeline test asserts scheduler invocation.
manual: start a pipeline, send `SIGSTOP` to the agent pid to simulate hang; confirm kill within 5min and FAILED reason `stalled`.
Risks & Open Questions
Some agents naturally pause for >5min during long tool calls (e.g. large `bun test`). Mitigation: tunable threshold per WORKFLOW.md once Add WORKFLOW.md target-repo policy #65 lands; default tuned to typical claude/codex idle window.
Distinguishing legitimate quiet from wedged on stdout-only signal — could augment with pty event activity later if false positives appear.
name: detect-stalled-processes
description: Kill and retry agent processes that emit no output past a stall timeout.
status: backlog
estimated_complexity: low
blast_radius: contained
PRD: detect-stalled-processes
Executive Summary
Adopt Symphony's stall detection. Pipeline scheduler tick computes `now - lastEventAt` for each running agent process. Past `stallTimeoutMs` (default 5min, 0 disables), the agent is killed and the pipeline marked failed-with-reason `stalled` (eligible for retry). Closes the long-tail bug where wedged agents pin slots indefinitely. Natural follow-on to #62/#63/#64.
Problem Statement
`ProcessManager` (`packages/agents/src/process-manager.ts`) only kills on app quit (PR #62-64) or explicit cancel. A claude/codex process that hangs without exiting (e.g. blocked on a tool call, network wedge) holds a concurrency slot forever. No automatic detection.
Goals
Non-Goals
User Stories
Acceptance:
Functional Requirements
Non-Functional Requirements
Success Criteria
Out of Scope
Dependencies
Verification Plan
Risks & Open Questions