Problem
When the Anthropic API returns HTTP 529 (overloaded) during an Iron Loop executor run,
the agent terminates mid-execution. Depending on when the overload hits, this leaves
the plan in one of two states:
- Pre-run overload (no tools executed yet): plan stays
in-progress, status file
shows working, but nothing was written. Safe to auto-retry.
- Mid-run overload (some steps completed, files written to disk): same
.status
state, but the implementation is partially on disk. Auto-retry here risks duplicate
writes or inconsistent state — a human should review before continuing.
Currently there is no recovery path. The operator must:
- Notice the plan is stuck (no dashboard signal)
- Inspect what was written
- Decide whether to retry or finish manually
This creates real friction on long executor runs (6–10 Iron Loop steps), where a single
529 terminates hours of progress.
Proposed solution (three-layer change)
Layer 1 — executor agent definition
In agents/iron-loop/iron-loop-executor.md, add an instruction to the agent:
If the API returns overloaded (529) before any tool writes have been made in the
current step, write status: "overload-retry" to the plan's .status file and call
ScheduleWakeup with a configurable interval. If writes have already occurred,
write status: "overload-partial" instead, so a human gate can review.
Layer 2 — state layer (actions.js / background.js)
- Add
overload-retry and overload-partial to the status enum.
- In
advanceAgent(): if the current plan's status is overload-retry, treat it like
working and resume from the last completed step marker, rather than skipping to the
next plan.
overload-partial should behave like a human gate: block advanceAgent() until a
human clears it.
Layer 3 — dashboard display (menu-screens.js)
Replace the spinning ◐ with a human-readable indicator:
| Status |
Display |
overload-retry |
⏳ retry in Xm |
overload-partial |
⚠ partial — review |
The "retry in Xm" countdown could read from the .status file's retry_at timestamp.
Config schema (.ctoc/settings.yaml)
retry:
overload_interval_seconds: 600 # default: 10 min
Open questions for maintainers
-
Preferred layer for retry logic: Should the executor agent drive the retry
(writing overload-retry + calling ScheduleWakeup itself), or should this be
handled entirely in the state layer with the agent just exiting cleanly?
-
Step-level resume vs full restart: The executor processes steps sequentially
(7–15). Is there an existing step-marker mechanism that would allow resuming from
step N, or would a restart from step 7 be the safer default?
-
ScheduleWakeup availability: Is ScheduleWakeup available inside executor
agent context, or does the resume need to go through a different scheduling
mechanism (e.g., a cron entry in .ctoc/)?
-
Scope appetite: Happy to submit a focused PR for just Layer 1 (agent instruction
only, no state-layer changes) as a first step if that's easier to review.
Happy to contribute this — just want to confirm the preferred approach before writing
code. Flagging the mid-run vs pre-run distinction upfront because it's the design
decision with the most downstream impact.
Problem
When the Anthropic API returns HTTP 529 (overloaded) during an Iron Loop executor run,
the agent terminates mid-execution. Depending on when the overload hits, this leaves
the plan in one of two states:
in-progress, status fileshows
working, but nothing was written. Safe to auto-retry..statusstate, but the implementation is partially on disk. Auto-retry here risks duplicate
writes or inconsistent state — a human should review before continuing.
Currently there is no recovery path. The operator must:
This creates real friction on long executor runs (6–10 Iron Loop steps), where a single
529 terminates hours of progress.
Proposed solution (three-layer change)
Layer 1 — executor agent definition
In
agents/iron-loop/iron-loop-executor.md, add an instruction to the agent:Layer 2 — state layer (
actions.js/background.js)overload-retryandoverload-partialto the status enum.advanceAgent(): if the current plan's status isoverload-retry, treat it likeworkingand resume from the last completed step marker, rather than skipping to thenext plan.
overload-partialshould behave like a human gate: blockadvanceAgent()until ahuman clears it.
Layer 3 — dashboard display (
menu-screens.js)Replace the spinning
◐with a human-readable indicator:overload-retry⏳ retry in Xmoverload-partial⚠ partial — reviewThe "retry in Xm" countdown could read from the
.statusfile'sretry_attimestamp.Config schema (
.ctoc/settings.yaml)Open questions for maintainers
Preferred layer for retry logic: Should the executor agent drive the retry
(writing
overload-retry+ calling ScheduleWakeup itself), or should this behandled entirely in the state layer with the agent just exiting cleanly?
Step-level resume vs full restart: The executor processes steps sequentially
(7–15). Is there an existing step-marker mechanism that would allow resuming from
step N, or would a restart from step 7 be the safer default?
ScheduleWakeupavailability: IsScheduleWakeupavailable inside executoragent context, or does the resume need to go through a different scheduling
mechanism (e.g., a cron entry in
.ctoc/)?Scope appetite: Happy to submit a focused PR for just Layer 1 (agent instruction
only, no state-layer changes) as a first step if that's easier to review.
Happy to contribute this — just want to confirm the preferred approach before writing
code. Flagging the mid-run vs pre-run distinction upfront because it's the design
decision with the most downstream impact.