Detect stalled agent processes

---
name: detect-stalled-processes
description: Kill and retry agent processes that emit no output past a stall timeout.
status: backlog
estimated_complexity: low
blast_radius: contained
---

# PRD: detect-stalled-processes

## Executive Summary
Adopt Symphony's stall detection. Pipeline scheduler tick computes \`now - lastEventAt\` for each running agent process. Past \`stallTimeoutMs\` (default 5min, 0 disables), the agent is killed and the pipeline marked failed-with-reason \`stalled\` (eligible for retry). Closes the long-tail bug where wedged agents pin slots indefinitely. Natural follow-on to #62/#63/#64.

## Problem Statement
\`ProcessManager\` (\`packages/agents/src/process-manager.ts\`) only kills on app quit (PR #62-64) or explicit cancel. A claude/codex process that hangs without exiting (e.g. blocked on a tool call, network wedge) holds a concurrency slot forever. No automatic detection.

## Goals
- Each running process tracks \`lastEventAt\`, updated on every stdout chunk.
- Periodic scheduler tick kills processes idle past \`stallTimeoutMs\`.
- Pipeline state for killed thread is FAILED with reason \`stalled\`.
- Default 300000ms (5min); 0 disables stall detection entirely.

## Non-Goals
- Detecting infinite loops that produce continuous output but no progress.
- Restarting the agent automatically — retry policy handled by #74.
- Per-phase stall thresholds (single threshold for v1).

## User Stories
- As a user running an overnight pipeline, I want a wedged agent killed within minutes so the orchestrator can move on, so I do not wake up to a stuck queue.
  **Acceptance:**
  - An agent emitting no stdout for 5 minutes is killed.
  - Pipeline state for that thread becomes FAILED with reason mentioning \`stalled\`.
  - An agent emitting regular stdout survives indefinitely.

## Functional Requirements
1. \`ManagedProcess\` exposes \`lastEventAt: number\` (monotonic timestamp ms).
2. \`lastEventAt\` is updated on every \`pty.onData\` chunk and on spawn.
3. \`ProcessManager.killStalled(stallTimeoutMs: number): string[]\` iterates the registry, kills any entry where \`now - lastEventAt > stallTimeoutMs\`, returns the killed ids.
4. Pipeline scheduler invokes \`killStalled\` on a configurable interval (default 30s).
5. \`stallTimeoutMs <= 0\` short-circuits detection (no kills).

## Non-Functional Requirements
- Stall scan must be O(n) over the registry and complete in <10ms for 50 entries.
- Kill must use the same SIGHUP-then-SIGKILL escalation as \`killAllAndWait\`.

## Success Criteria
- Unit tests with fake timers: idle pty killed after threshold; active pty survives.
- Disabled-by-zero test: \`killStalled(0)\` is a no-op.
- Reason surface test: pipeline state for killed thread carries \`stalled\` substring per PR #65 reason-surfacing.
- Integration test: pipeline scheduler tick wires \`killStalled\` and observes the kill.

## Out of Scope
- Retry orchestration after stall (handled by #74).
- Per-phase stall thresholds.
- UI controls for live stall threshold tuning.

## Dependencies
- \`packages/agents/src/process-manager.ts\` (registry + kill).
- Pipeline scheduler tick (existing or to be introduced via #68 reconciliation).
- Symphony spec: https://github.com/openai/symphony/blob/main/SPEC.md#85-active-run-reconciliation

## Verification Plan
- **tests:** \`packages/agents/src/process-manager.test.ts\` — new cases for \`killStalled\` (idle, active, disabled). Pipeline test asserts scheduler invocation.
- **manual:** start a pipeline, send \`SIGSTOP\` to the agent pid to simulate hang; confirm kill within 5min and FAILED reason \`stalled\`.

## Risks & Open Questions
- Some agents naturally pause for >5min during long tool calls (e.g. large \`bun test\`). Mitigation: tunable threshold per WORKFLOW.md once #65 lands; default tuned to typical claude/codex idle window.
- Distinguishing legitimate quiet from wedged on stdout-only signal — could augment with pty event activity later if false positives appear.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect stalled agent processes #67

name: detect-stalled-processes
description: Kill and retry agent processes that emit no output past a stall timeout.
status: backlog
estimated_complexity: low
blast_radius: contained

PRD: detect-stalled-processes

Executive Summary

Problem Statement

Goals

Non-Goals

User Stories

Functional Requirements

Non-Functional Requirements

Success Criteria

Out of Scope

Dependencies

Verification Plan

Risks & Open Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Detect stalled agent processes #67

Description

name: detect-stalled-processes description: Kill and retry agent processes that emit no output past a stall timeout. status: backlog estimated_complexity: low blast_radius: contained

PRD: detect-stalled-processes

Executive Summary

Problem Statement

Goals

Non-Goals

User Stories

Functional Requirements

Non-Functional Requirements

Success Criteria

Out of Scope

Dependencies

Verification Plan

Risks & Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

name: detect-stalled-processes
description: Kill and retry agent processes that emit no output past a stall timeout.
status: backlog
estimated_complexity: low
blast_radius: contained