Split continuation vs failure backoff

---
name: split-continuation-failure-backoff
description: Continuation retries use a 1s fixed delay; failure retries use exponential backoff capped at max_retry_backoff_ms.
status: backlog
estimated_complexity: low
blast_radius: contained
---

# PRD: split-continuation-failure-backoff

## Executive Summary
Adopt Symphony's two-tier retry policy. Clean exits that re-check whether the issue is still active use a short fixed 1s delay. Failure-driven retries (timeout, crash, agent error) use exponential backoff (10s × 2^(attempt-1)) capped at \`max_retry_backoff_ms\` (default 5min). Aligns retry pacing with the actual reason for the retry.

## Problem Statement
Today's retry pacing is uniform regardless of why the previous attempt ended. Continuation retries (issue still active after a clean attempt) wait the same as failure retries, slowing the autonomous loop unnecessarily. Failure retries without exponential backoff can hammer a struggling API.

## Goals
- Retry scheduler accepts a reason: \`continuation\` or \`failure\`.
- Continuation: fixed 1000ms.
- Failure: \`min(10000 × 2^(attempt-1), max_retry_backoff_ms)\` with default cap 300000ms.
- Existing retry timer for the same thread is cancelled when scheduling a new one.

## Non-Goals
- Reason-specific reasons beyond continuation vs failure (e.g. distinguishing API error vs stall — could come later if useful).
- Per-phase retry budgets.
- Jittered backoff (deterministic for now; can add later if thundering-herd appears).

## User Stories
- As a user running shipcode autonomously, I want continuation re-checks to be near-instant and failure retries to back off, so my queue is responsive but does not hammer external APIs on outage.
  **Acceptance:**
  - Clean exit with issue still active → next attempt fires within ~1s.
  - Three consecutive failures → 10s, 20s, 40s delays observed.
  - Cap honored: at high attempt count the delay never exceeds \`max_retry_backoff_ms\`.

## Functional Requirements
1. Retry queue scheduler accepts \`{ threadId, reason: 'continuation' | 'failure', attempt, error? }\`.
2. Continuation reason → fixed 1000ms.
3. Failure reason → \`Math.min(10000 * 2 ** (attempt - 1), maxRetryBackoffMs)\`.
4. Scheduling a new retry for the same thread cancels the existing timer.
5. \`maxRetryBackoffMs\` is sourced from WORKFLOW.md when present (default 300000).

## Non-Functional Requirements
- Timer accounting must not leak handles on cancel.
- Default cap of 5 min is observable in tests with fake timers.

## Success Criteria
- Unit tests with fake timers: continuation fires at +1000ms.
- Failure backoff produces 10s, 20s, 40s, 80s, ..., capped at \`maxRetryBackoffMs\`.
- Cancel test: scheduling a new retry for the same thread cancels the prior timer.

## Out of Scope
- Pulling retry orchestration out of the existing scheduler — keep wiring local.
- Persisting retry queue across process restart.
- UI surface for retry timing.

## Dependencies
- #65 (config source for cap).
- Existing retry hooks in \`packages/pipeline/src/issue-group-scheduler.ts\` or equivalent.
- Symphony spec: https://github.com/openai/symphony/blob/main/SPEC.md#84-retry-and-backoff

## Verification Plan
- **tests:** \`packages/pipeline/src/issue-group-scheduler.test.ts\` — new cases for continuation delay, failure exponential, cap, cancel.
- **manual:** with daemon (#70) running, fail an attempt and observe the backoff in logs; confirm successful re-attempt afterward.

## Risks & Open Questions
- Attempt count semantics: does attempt 1 mean first retry or first attempt? Match Symphony spec (attempt 1 = first retry, exponent 0 → 10s).
- Need-to-cancel semantics on issue closing — should cancellation also cancel retry timers (yes; integrate with #68).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split continuation vs failure backoff #74

name: split-continuation-failure-backoff
description: Continuation retries use a 1s fixed delay; failure retries use exponential backoff capped at max_retry_backoff_ms.
status: backlog
estimated_complexity: low
blast_radius: contained

PRD: split-continuation-failure-backoff

Executive Summary

Problem Statement

Goals

Non-Goals

User Stories

Functional Requirements

Non-Functional Requirements

Success Criteria

Out of Scope

Dependencies

Verification Plan

Risks & Open Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Split continuation vs failure backoff #74

Description

name: split-continuation-failure-backoff description: Continuation retries use a 1s fixed delay; failure retries use exponential backoff capped at max_retry_backoff_ms. status: backlog estimated_complexity: low blast_radius: contained

PRD: split-continuation-failure-backoff

Executive Summary

Problem Statement

Goals

Non-Goals

User Stories

Functional Requirements

Non-Functional Requirements

Success Criteria

Out of Scope

Dependencies

Verification Plan

Risks & Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

name: split-continuation-failure-backoff
description: Continuation retries use a 1s fixed delay; failure retries use exponential backoff capped at max_retry_backoff_ms.
status: backlog
estimated_complexity: low
blast_radius: contained