Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 18, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

4 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 18, 2025 2:30am
rivet-inspector Ignored Ignored Preview Nov 18, 2025 2:30am
rivet-site Ignored Ignored Preview Nov 18, 2025 2:30am
rivetkit-serverless Skipped Skipped Nov 18, 2025 2:30am

@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless November 18, 2025 02:30 Inactive
Copy link
Contributor Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 18, 2025

PR Review: fix(gas): optimize loops

Summary

This PR optimizes loop performance in the workflow engine by:

  1. Parallelizing database operations using tokio::join!
  2. Reducing database write frequency from every iteration to every 20 iterations (controlled by LOOP_ITERS_PER_COMMIT)

Code Quality & Best Practices

Strengths:

  • ✅ The optimization approach is sound - parallelizing independent DB operations will reduce latency
  • ✅ Good use of tokio::join! to run loop event commit, branch commit, and user code concurrently
  • ✅ Helpful inline comments explaining the optimization strategy (lines 949-955)
  • ✅ Maintains backward compatibility with existing loop behavior

Concerns:

  1. Missing Safety Documentation ⚠️
    The comment on line 45 mentions "mark previous loop history to forgotten" but this only happens when upsert_workflow_loop_event is called. With the new batching, history is only marked as forgotten every 20 iterations instead of every iteration. This could have implications:

    • The "active" history will accumulate 20 iterations worth of events before being moved to "forgotten" history
    • Consider documenting this behavior change and whether it impacts recovery scenarios
  2. Hard-coded Magic Number ⚠️

    const LOOP_ITERS_PER_COMMIT: usize = 20;

    While defined as a constant, there's no justification for why 20 was chosen. Consider:

    • Adding a comment explaining the rationale (performance vs. recovery granularity tradeoff)
    • Whether this should be configurable based on use case (tight loops with small state vs. expensive operations)
  3. State Loss Risk 🔴 CRITICAL
    If a workflow crashes between iterations (e.g., at iteration 15), the state from iterations 1-14 will be lost since they were never persisted. The workflow will restart from iteration 0 when replayed.

    Example scenario:

    • Loop runs iterations 1-15
    • Workflow crashes at iteration 15
    • State was only committed at iteration 0
    • On replay, workflow restarts from iteration 0, redoing all 15 iterations

    This breaks the workflow engine's durability guarantees. The current behavior ensures every iteration is recoverable.

Potential Bugs

Critical Issue: The optimization introduces a correctness bug. Looking at line 994:

if iteration % LOOP_ITERS_PER_COMMIT == 0 {
    self.db.upsert_workflow_loop_event(...)
}

The modulo check means:

  • Iteration 0: NOT committed (0 % 20 = 0, but handled separately in initial creation)
  • Iteration 1-19: NOT committed
  • Iteration 20: committed (20 % 20 = 0)
  • Iteration 21-39: NOT committed
  • etc.

But the loop event was already created at iteration 0 (lines 892-904), so iteration 20 would be the first update. However, if the loop breaks at iteration 15, the final state IS committed (lines 1021-1032), so that case is handled.

The real issue: If a workflow crashes/restarts between commits (e.g., at iteration 15), it will replay from iteration 0, not iteration 15. This changes the recovery semantics.

Performance Considerations

Positive:

  • Reducing DB writes from N to N/20 for a loop with N iterations will significantly improve throughput
  • Parallelizing the initial loop event commit with the first iteration's branch commit is clever

Concerns:

  • The tradeoff is between performance and crash recovery granularity
  • For long-running loops (thousands of iterations), losing progress on crash could be expensive
  • Consider profiling with realistic workloads to validate the 20-iteration interval

Security Concerns

No security issues identified. The changes are confined to internal workflow state management.

Test Coverage

Missing tests for the optimization:

  1. No test verifies that state commits happen every 20 iterations
  2. No test validates crash/restart behavior with the new batching
  3. No test for the parallelization of DB operations

Recommended tests:

// Test that loop state is persisted every LOOP_ITERS_PER_COMMIT iterations
#[test]
async fn test_loop_commit_batching() {
    // Run a loop for 50 iterations
    // Kill and restart the workflow at iteration 35
    // Verify it resumes from iteration 20 (last commit), not iteration 0
}

// Test that final state is always committed on Loop::Break
#[test]
async fn test_loop_final_state_commit() {
    // Run loop that breaks at arbitrary iteration (e.g., 15)
    // Verify final state is persisted correctly
}

Recommendations

  1. Consider reverting or adding a feature flag - The durability implications need careful evaluation
  2. Document the tradeoff - Add detailed comments explaining why batching is safe (or make it configurable)
  3. Add comprehensive tests - Especially for crash/recovery scenarios
  4. Profile before merging - Measure actual performance impact with realistic loop workloads

Questions for Author

  1. Have you tested crash/restart behavior with loops that run for 10-50 iterations?
  2. What happens to workflows in-flight when this change is deployed? Will they replay from iteration 0?
  3. Was 20 iterations chosen based on profiling data or intuition?
  4. Are there use cases where loops hold critical state that must be persisted every iteration?

Overall Assessment: The optimization is well-intentioned but potentially introduces a subtle correctness issue around crash recovery. I'd recommend either:

  • Making this configurable with a default that preserves current behavior
  • OR thoroughly documenting the new recovery semantics and adding tests to validate them

Happy to discuss further! 🤖

Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants