fix(gas): optimize loops #3475

MasterPtato · 2025-11-18T02:30:04Z

No description provided.

vercel · 2025-11-18T02:30:10Z

The latest updates on your projects. Learn more about Vercel for GitHub.

4 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 18, 2025 2:30am
rivet-inspector	Ignored	Preview	Nov 18, 2025 2:30am
rivet-site	Ignored	Preview	Nov 18, 2025 2:30am
rivetkit-serverless	Skipped		Nov 18, 2025 2:30am

MasterPtato · 2025-11-18T02:30:23Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-11-18T02:31:47Z

PR Review: fix(gas): optimize loops

Summary

This PR optimizes loop performance in the workflow engine by:

Parallelizing database operations using tokio::join!
Reducing database write frequency from every iteration to every 20 iterations (controlled by LOOP_ITERS_PER_COMMIT)

Code Quality & Best Practices

Strengths:

✅ The optimization approach is sound - parallelizing independent DB operations will reduce latency
✅ Good use of tokio::join! to run loop event commit, branch commit, and user code concurrently
✅ Helpful inline comments explaining the optimization strategy (lines 949-955)
✅ Maintains backward compatibility with existing loop behavior

Concerns:

Missing Safety Documentation ⚠️
The comment on line 45 mentions "mark previous loop history to forgotten" but this only happens when upsert_workflow_loop_event is called. With the new batching, history is only marked as forgotten every 20 iterations instead of every iteration. This could have implications:
- The "active" history will accumulate 20 iterations worth of events before being moved to "forgotten" history
- Consider documenting this behavior change and whether it impacts recovery scenarios
Hard-coded Magic Number ⚠️
```
const LOOP_ITERS_PER_COMMIT: usize = 20;
```
While defined as a constant, there's no justification for why 20 was chosen. Consider:
- Adding a comment explaining the rationale (performance vs. recovery granularity tradeoff)
- Whether this should be configurable based on use case (tight loops with small state vs. expensive operations)
State Loss Risk 🔴 CRITICAL
If a workflow crashes between iterations (e.g., at iteration 15), the state from iterations 1-14 will be lost since they were never persisted. The workflow will restart from iteration 0 when replayed.

Example scenario:
- Loop runs iterations 1-15
- Workflow crashes at iteration 15
- State was only committed at iteration 0
- On replay, workflow restarts from iteration 0, redoing all 15 iterations
This breaks the workflow engine's durability guarantees. The current behavior ensures every iteration is recoverable.

Potential Bugs

Critical Issue: The optimization introduces a correctness bug. Looking at line 994:

if iteration % LOOP_ITERS_PER_COMMIT == 0 {
    self.db.upsert_workflow_loop_event(...)
}

The modulo check means:

Iteration 0: NOT committed (0 % 20 = 0, but handled separately in initial creation)
Iteration 1-19: NOT committed
Iteration 20: committed (20 % 20 = 0)
Iteration 21-39: NOT committed
etc.

But the loop event was already created at iteration 0 (lines 892-904), so iteration 20 would be the first update. However, if the loop breaks at iteration 15, the final state IS committed (lines 1021-1032), so that case is handled.

The real issue: If a workflow crashes/restarts between commits (e.g., at iteration 15), it will replay from iteration 0, not iteration 15. This changes the recovery semantics.

Performance Considerations

Positive:

Reducing DB writes from N to N/20 for a loop with N iterations will significantly improve throughput
Parallelizing the initial loop event commit with the first iteration's branch commit is clever

Concerns:

The tradeoff is between performance and crash recovery granularity
For long-running loops (thousands of iterations), losing progress on crash could be expensive
Consider profiling with realistic workloads to validate the 20-iteration interval

Security Concerns

No security issues identified. The changes are confined to internal workflow state management.

Test Coverage

Missing tests for the optimization:

No test verifies that state commits happen every 20 iterations
No test validates crash/restart behavior with the new batching
No test for the parallelization of DB operations

Recommended tests:

// Test that loop state is persisted every LOOP_ITERS_PER_COMMIT iterations
#[test]
async fn test_loop_commit_batching() {
    // Run a loop for 50 iterations
    // Kill and restart the workflow at iteration 35
    // Verify it resumes from iteration 20 (last commit), not iteration 0
}

// Test that final state is always committed on Loop::Break
#[test]
async fn test_loop_final_state_commit() {
    // Run loop that breaks at arbitrary iteration (e.g., 15)
    // Verify final state is persisted correctly
}

Recommendations

Consider reverting or adding a feature flag - The durability implications need careful evaluation
Document the tradeoff - Add detailed comments explaining why batching is safe (or make it configurable)
Add comprehensive tests - Especially for crash/recovery scenarios
Profile before merging - Measure actual performance impact with realistic loop workloads

Questions for Author

Have you tested crash/restart behavior with loops that run for 10-50 iterations?
What happens to workflows in-flight when this change is deployed? Will they replay from iteration 0?
Was 20 iterations chosen based on profiling data or intuition?
Are there use cases where loops hold critical state that must be persisted every iteration?

Overall Assessment: The optimization is well-intentioned but potentially introduces a subtle correctness issue around crash recovery. I'd recommend either:

Making this configurable with a default that preserves current behavior
OR thoroughly documenting the new recovery semantics and adding tests to validate them

Happy to discuss further! 🤖

Generated with Claude Code

fix(gas): optimize loops

e5b2d7f

vercel bot temporarily deployed to Preview – rivetkit-serverless November 18, 2025 02:30 Inactive

This was referenced Nov 18, 2025

chore: add debug scripts #3467

Draft

chore: write/update docs #3468

Draft

fix(udb): update backoff algo #3469

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gas): optimize loops #3475

fix(gas): optimize loops #3475

MasterPtato commented Nov 18, 2025

Uh oh!

vercel bot commented Nov 18, 2025 •

edited

Loading

Uh oh!

MasterPtato commented Nov 18, 2025

Uh oh!

claude bot commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(gas): optimize loops #3475

Are you sure you want to change the base?

fix(gas): optimize loops #3475

Conversation

MasterPtato commented Nov 18, 2025

Uh oh!

vercel bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Nov 18, 2025

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Nov 18, 2025

PR Review: fix(gas): optimize loops

Summary

Code Quality & Best Practices

Potential Bugs

Performance Considerations

Security Concerns

Test Coverage

Recommendations

Questions for Author

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Nov 18, 2025 •

edited

Loading