Skip to content

Split continuation vs failure backoff #74

@VincentShipsIt

Description

@VincentShipsIt

name: split-continuation-failure-backoff
description: Continuation retries use a 1s fixed delay; failure retries use exponential backoff capped at max_retry_backoff_ms.
status: backlog
estimated_complexity: low
blast_radius: contained

PRD: split-continuation-failure-backoff

Executive Summary

Adopt Symphony's two-tier retry policy. Clean exits that re-check whether the issue is still active use a short fixed 1s delay. Failure-driven retries (timeout, crash, agent error) use exponential backoff (10s × 2^(attempt-1)) capped at `max_retry_backoff_ms` (default 5min). Aligns retry pacing with the actual reason for the retry.

Problem Statement

Today's retry pacing is uniform regardless of why the previous attempt ended. Continuation retries (issue still active after a clean attempt) wait the same as failure retries, slowing the autonomous loop unnecessarily. Failure retries without exponential backoff can hammer a struggling API.

Goals

  • Retry scheduler accepts a reason: `continuation` or `failure`.
  • Continuation: fixed 1000ms.
  • Failure: `min(10000 × 2^(attempt-1), max_retry_backoff_ms)` with default cap 300000ms.
  • Existing retry timer for the same thread is cancelled when scheduling a new one.

Non-Goals

  • Reason-specific reasons beyond continuation vs failure (e.g. distinguishing API error vs stall — could come later if useful).
  • Per-phase retry budgets.
  • Jittered backoff (deterministic for now; can add later if thundering-herd appears).

User Stories

  • As a user running shipcode autonomously, I want continuation re-checks to be near-instant and failure retries to back off, so my queue is responsive but does not hammer external APIs on outage.
    Acceptance:
    • Clean exit with issue still active → next attempt fires within ~1s.
    • Three consecutive failures → 10s, 20s, 40s delays observed.
    • Cap honored: at high attempt count the delay never exceeds `max_retry_backoff_ms`.

Functional Requirements

  1. Retry queue scheduler accepts `{ threadId, reason: 'continuation' | 'failure', attempt, error? }`.
  2. Continuation reason → fixed 1000ms.
  3. Failure reason → `Math.min(10000 * 2 ** (attempt - 1), maxRetryBackoffMs)`.
  4. Scheduling a new retry for the same thread cancels the existing timer.
  5. `maxRetryBackoffMs` is sourced from WORKFLOW.md when present (default 300000).

Non-Functional Requirements

  • Timer accounting must not leak handles on cancel.
  • Default cap of 5 min is observable in tests with fake timers.

Success Criteria

  • Unit tests with fake timers: continuation fires at +1000ms.
  • Failure backoff produces 10s, 20s, 40s, 80s, ..., capped at `maxRetryBackoffMs`.
  • Cancel test: scheduling a new retry for the same thread cancels the prior timer.

Out of Scope

  • Pulling retry orchestration out of the existing scheduler — keep wiring local.
  • Persisting retry queue across process restart.
  • UI surface for retry timing.

Dependencies

Verification Plan

  • tests: `packages/pipeline/src/issue-group-scheduler.test.ts` — new cases for continuation delay, failure exponential, cap, cancel.
  • manual: with daemon (Add daemon mode for label dispatch #70) running, fail an attempt and observe the backoff in logs; confirm successful re-attempt afterward.

Risks & Open Questions

  • Attempt count semantics: does attempt 1 mean first retry or first attempt? Match Symphony spec (attempt 1 = first retry, exponent 0 → 10s).
  • Need-to-cancel semantics on issue closing — should cancellation also cancel retry timers (yes; integrate with Cancel pipelines on issue state change #68).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions