Skip to content

fix(workflow-engine): prevent loop state from being persisted during rollback replay#4386

Merged
NathanFlurry merged 1 commit intomainfrom
queue-next-timeout
Mar 8, 2026
Merged

fix(workflow-engine): prevent loop state from being persisted during rollback replay#4386
NathanFlurry merged 1 commit intomainfrom
queue-next-timeout

Conversation

@NathanFlurry
Copy link
Member

Description

Fixes a bug where loop entry state (iteration count and state value) was being incorrectly persisted during rollback replay. Previously, the loop's progress was unconditionally updated every historyPruneInterval iterations, even when the engine was replaying history during a rollback. This corrupted the saved loop progress by advancing the persisted iteration counter while merely replaying previously-completed iterations.

The fix separates two concerns: state persistence (which should only happen outside rollback mode) and history pruning (which is safe and desirable regardless of mode).

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • rollback.test.ts: Verifies that when a loop iteration fails and triggers a rollback, the persisted loop iteration count reflects only the progress made before the failure (not the replayed iterations).
  • sleep.test.ts: Verifies that after a sleep inside a loop, resuming the workflow does not re-execute previous loop iterations.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective

@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4386 March 8, 2026 22:09 Destroyed
@railway-app
Copy link

railway-app bot commented Mar 8, 2026

🚅 Deployed to the rivet-pr-4386 environment in rivet-frontend

Service Status Web Updated (UTC)
mcp-hub ✅ Success (View Logs) Web Mar 8, 2026 at 10:10 pm
ladle ❌ Build Failed (View Logs) Web Mar 8, 2026 at 10:09 pm
website 🕒 Building (View Logs) Web Mar 8, 2026 at 10:09 pm
frontend-inspector 🕒 Building (View Logs) Web Mar 8, 2026 at 10:09 pm
frontend-cloud 🕒 Building (View Logs) Web Mar 8, 2026 at 10:09 pm

@NathanFlurry NathanFlurry merged commit b8de8f6 into main Mar 8, 2026
13 of 18 checks passed
@NathanFlurry NathanFlurry deleted the queue-next-timeout branch March 8, 2026 22:10
@claude
Copy link

claude bot commented Mar 8, 2026

Code Review

Summary

This PR fixes a real bug: during rollback replay, the loop's persisted iteration and state were being overwritten with replayed values every historyPruneInterval iterations, corrupting the checkpoint. The fix is well-targeted and the separation of concerns (state persistence vs. history pruning) is the right approach.


Core Fix (context.ts)

Correctness — looks good.

The original code bundled two unrelated concerns into the same historyPruneInterval block:

  1. Updating the persisted state/iteration on the entry.
  2. Collecting and flushing loop history deletions.

The fix correctly pulls (1) out into a !rollbackMode guard and leaves (2) to run unconditionally. This is exactly right: pruning stale history entries during replay is safe and desirable; updating the persisted iteration counter during replay is not.

Subtle behavior change worth noting.

Before the fix, entry.dirty = true (with updated state/iteration) was only set every historyPruneInterval iterations. After the fix, entry.dirty = true is set on every non-rollback iteration. In practice this doesn't change flush frequency (flushes still only happen at the prune interval via deferredFlush and at loop break), but it does mean the latest state/iteration will be included in any flush that does happen to run. This is strictly more correct — it removes a class of subtle state-loss between prune intervals. Still, worth being aware this is a secondary behavior change.

At loop break — already correct (pre-existing).

The entry.kind.data.iteration = iteration and entry.dirty = true inside the "break" branch (lines 773–778) run unconditionally, including in rollback mode. That looks intentional: when the loop completes (even during a rollback replay path), the final output and metadata need to be persisted.


Tests

rollback.test.ts — the core regression test.

The test is clear and targets the right behavior. One minor observation: it asserts expect(rollbacks).toContain("outside") but doesn't verify the loop-internal rollbacks (loop-0, loop-1). Adding those checks would make it easier for future readers to understand exactly what was (and wasn't) rolled back. Not a blocker, just a suggestion.

The type-narrowing guard pattern used here:

expect(loopEntry?.kind.type).toBe("loop");
if (loopEntry?.kind.type !== "loop") {
  throw new Error("Expected loop entry");
}

is a clean way to satisfy TypeScript without suppressing the assertion — good pattern.

sleep.test.ts — covers a related replay path.

Testing both "yield" and "live" modes is the right call. The yield path (two separate runWorkflow calls across a real timer wait) is the more realistic scenario and the test handles it well with the 30ms wait.

The live mode path is less obviously exercising the same code path (since sleep resolution happens inline), but it still validates the seen side-effect array stays at ["first"].


Minor Nits

  • The updated comment // Periodically defer the flush so the next iteration can overlap with loop pruning and any pending dirty state writes. is clearer than the original. No issues.
  • The rollback test reaches Loop.continue(2) in the second run (after shouldFail = false), but the workflow hits ctx.loop which has no Loop.break. The test only asserts on the rejection path, so the Loop.continue(2) branch is never reached — the test effectively just validates the failure + rollback case. This is intentional and fine.

Overall

The fix is minimal, correct, and well-tested. The two tests together cover both the rollback corruption case and the sleep-resume case. This is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant