Skip to content

fix(cron): re-arm timer when onTimer rejects unexpectedly#7

Open
suboss87 wants to merge 57 commits into
mainfrom
fix/cron-timer-rearm-on-catch
Open

fix(cron): re-arm timer when onTimer rejects unexpectedly#7
suboss87 wants to merge 57 commits into
mainfrom
fix/cron-timer-rearm-on-catch

Conversation

@suboss87
Copy link
Copy Markdown
Owner

@suboss87 suboss87 commented Apr 28, 2026

Closes openclaw#73166

Problem

When onTimer rejects unexpectedly (e.g. a transient error thrown from inside the finally block's armTimer call due to Node.js internals or GC pressure), the .catch() handler in armTimer's setTimeout callback only logs the error. No new timer is registered, permanently breaking the scheduler chain with no recovery path until the next gateway restart.

Root cause

src/cron/service/timer.ts — the setTimeout callback inside armTimer:

state.timer = setTimeout(() => {
  void onTimer(state).catch((err) => {
    state.deps.log.error({ err: String(err) }, "cron: timer tick failed");
    // Missing: armTimer(state) re-arm
  });
}, clampedDelay);

If onTimer rejects, the catch block logs but does not re-arm. state.timer is left as null (set to null at the top of armTimer before the throw).

Fix

Call armTimer(state) inside the .catch() handler so the scheduler chain survives an unexpected rejection.

Regression test

Added to src/cron/service.armtimer-tight-loop.test.ts: makes nowMs() throw on the 4th call (inside the finally block's armTimer), which causes onTimer to reject. Verifies that log.error is called and state.timer is non-null after the .catch() re-arm. All 4 tests in the file pass.


Generated by Claude Code


Open in Devin Review

claude added 30 commits April 1, 2026 15:34
claude and others added 25 commits April 19, 2026 15:41
Before this fix, if onTimer rejected unexpectedly (e.g. a Node.js
internal error or GC pressure causing an exception in the finally
block's armTimer call), the .catch() handler only logged the error.
The scheduler chain was then permanently broken with no timer set,
silently halting all cron jobs until the next gateway restart.

Fix: call armTimer(state) inside the .catch() handler so a rare
unexpected rejection does not permanently stop the scheduler.

Regression test exercises the path by making nowMs() throw on the
4th call (inside the finally block's armTimer), which causes onTimer
to reject; the .catch() re-arm is then verified via state.timer.

Closes openclaw#73166.

https://claude.ai/code/session_01NHHoPHTrH4F9qFJBJHqjTk
devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 3 additional findings in Devin Review.

Open in Devin Review

Comment thread src/cron/service/timer.ts
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 armRunningRecheckTimer's .catch() doesn't re-arm, leaving scheduler vulnerable to the same permanent death the PR aims to fix

The PR adds re-arm logic in armTimer's .catch() handler (lines 553-559) to prevent permanent scheduler death when onTimer rejects unexpectedly. However, armRunningRecheckTimer at src/cron/service/timer.ts:572-576 has the same void onTimer(state).catch(...) pattern without the re-arm fix. If onTimer is called from armRunningRecheckTimer's callback with state.running = false (possible via a race where the watchdog fires at the same event-loop tick that the primary onTimer completes), and armTimer throws in onTimer's finally block (line 736), the .catch() at line 573-574 only logs the error without re-arming. At that point, armRunningRecheckTimer's callback cleared the timer that armTimer had set (line 597 calls armRunningRecheckTimer which clears state.timer), and the finally block's armTimer cleared the watchdog before throwing (line 508-509). No active timer remains, permanently killing the scheduler chain — the exact failure mode this PR intends to prevent.

(Refers to lines 572-576)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cron scheduler silently stops firing after ~2.5 days of gateway uptime

2 participants