Skip to content

feat: Retry Iron Loop executor on API overload (529) with configurable backoff #6

@davidbijl

Description

@davidbijl

Problem

When the Anthropic API returns HTTP 529 (overloaded) during an Iron Loop executor run,
the agent terminates mid-execution. Depending on when the overload hits, this leaves
the plan in one of two states:

  1. Pre-run overload (no tools executed yet): plan stays in-progress, status file
    shows working, but nothing was written. Safe to auto-retry.
  2. Mid-run overload (some steps completed, files written to disk): same .status
    state, but the implementation is partially on disk. Auto-retry here risks duplicate
    writes or inconsistent state — a human should review before continuing.

Currently there is no recovery path. The operator must:

  • Notice the plan is stuck (no dashboard signal)
  • Inspect what was written
  • Decide whether to retry or finish manually

This creates real friction on long executor runs (6–10 Iron Loop steps), where a single
529 terminates hours of progress.


Proposed solution (three-layer change)

Layer 1 — executor agent definition

In agents/iron-loop/iron-loop-executor.md, add an instruction to the agent:

If the API returns overloaded (529) before any tool writes have been made in the
current step, write status: "overload-retry" to the plan's .status file and call
ScheduleWakeup with a configurable interval. If writes have already occurred,
write status: "overload-partial" instead, so a human gate can review.

Layer 2 — state layer (actions.js / background.js)

  • Add overload-retry and overload-partial to the status enum.
  • In advanceAgent(): if the current plan's status is overload-retry, treat it like
    working and resume from the last completed step marker, rather than skipping to the
    next plan.
  • overload-partial should behave like a human gate: block advanceAgent() until a
    human clears it.

Layer 3 — dashboard display (menu-screens.js)

Replace the spinning with a human-readable indicator:

Status Display
overload-retry ⏳ retry in Xm
overload-partial ⚠ partial — review

The "retry in Xm" countdown could read from the .status file's retry_at timestamp.

Config schema (.ctoc/settings.yaml)

retry:
  overload_interval_seconds: 600   # default: 10 min

Open questions for maintainers

  1. Preferred layer for retry logic: Should the executor agent drive the retry
    (writing overload-retry + calling ScheduleWakeup itself), or should this be
    handled entirely in the state layer with the agent just exiting cleanly?

  2. Step-level resume vs full restart: The executor processes steps sequentially
    (7–15). Is there an existing step-marker mechanism that would allow resuming from
    step N, or would a restart from step 7 be the safer default?

  3. ScheduleWakeup availability: Is ScheduleWakeup available inside executor
    agent context, or does the resume need to go through a different scheduling
    mechanism (e.g., a cron entry in .ctoc/)?

  4. Scope appetite: Happy to submit a focused PR for just Layer 1 (agent instruction
    only, no state-layer changes) as a first step if that's easier to review.


Happy to contribute this — just want to confirm the preferred approach before writing
code. Flagging the mid-run vs pre-run distinction upfront because it's the design
decision with the most downstream impact.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions