Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 20, 2025 8:01am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 20, 2025 8:01am
rivet-inspector Ignored Ignored Preview Nov 20, 2025 8:01am
rivet-site Ignored Ignored Preview Nov 20, 2025 8:01am

Copy link
Member Author

NathanFlurry commented Nov 20, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review: Graceful Runner Shutdown

I've reviewed PR #3497 which fixes graceful runner shutdown. Here are my findings:

✅ Strengths

  1. Proper Race Condition Prevention: The new #shuttingDown flag (line 453-458 in mod.ts) correctly prevents concurrent shutdown calls, which is a solid improvement.

  2. Graceful Actor Cleanup: The new #waitForActorsToStop() method (lines 587-672) implements a thoughtful polling mechanism with:

    • 100ms check interval (reasonable for responsiveness)
    • 120s timeout (prevents indefinite hangs)
    • Progress logging every 5s (good observability)
    • Early exit conditions (actors stopped, WebSocket closed, timeout)
  3. Better Async Signal Handling: Converting signal handlers to async (lines 414-447) allows proper awaiting of shutdown, fixing potential race conditions where shutdown might not complete before process exit.

  4. Enhanced Debugging: Added structured logging in actor-conn.ts to track action lifecycle (lines 200-205, 469-474, 587-592) which will help debug action tracking issues.

⚠️ Concerns & Issues

1. CRITICAL: Process Exit Commented Out (Lines 424, 434)

// TODO: Add back
// process.exit(0);

Issue: Signal handlers no longer exit the process. This means SIGTERM/SIGINT won't actually terminate the application after shutdown completes.

Impact: The process will hang after receiving shutdown signals, requiring SIGKILL to terminate.

Recommendation: Either:

  • Re-enable process.exit(0) after awaiting all handlers
  • Add a comment explaining why this was removed and what the new termination mechanism is
  • Ensure the shutdown flow naturally terminates the process

2. Removed Shutdown Guards May Allow Invalid Operations

Multiple shutdown guards were removed from:

  • sleepActor() (lines 263-269)
  • #sendActorIntent() (lines 1109+)
  • #sendActorState() (lines 1150+)
  • #sendCommandAcknowledgment() (lines 1192+)
  • setAlarm() (lines 1497+)
  • #sendKvRequest() (lines 1530+)
  • __sendToServer() (lines 1606+)

Concern: These methods can now be called during shutdown, potentially sending messages to a closed/closing WebSocket.

Questions:

  • Is this intentional to allow final cleanup messages?
  • Should some guards remain to prevent new operations (vs. cleanup operations)?
  • Could this cause "WebSocket is not open" errors or message loss?

Recommendation: Document why these guards were removed or add selective guards that distinguish between:

  • Cleanup operations (allowed during shutdown)
  • New operations (blocked during shutdown)

3. Potential Memory Leak in #waitForActorsToStop()

The interval created at line 658 could leak if the WebSocket closes immediately after the interval is created but before the next tick:

const interval = setInterval(() => {
    if (checkActors()) {
        clearInterval(interval);
        resolve();
    }
}, shutdownCheckInterval);

Issue: If checkActors() never returns true due to an edge case, the interval runs forever.

Recommendation: Clear the interval on WebSocket close event or add it to cleanup:

ws.addEventListener('close', () => {
    clearInterval(interval);
    resolve();
});

4. Missing Test Coverage

No tests found for shutdown behavior. Given the complexity of this change:

Recommendation: Add tests for:

  • Concurrent shutdown calls (should be idempotent)
  • Shutdown with active actors (should wait)
  • Shutdown timeout scenario
  • WebSocket closes during shutdown
  • Signal handler behavior

5. Minor: Inconsistent Logging Levels

In actor-conn.ts, action tracking uses debug level (line 200), but the error case uses error level (line 572). The success case at line 469 also uses debug.

Recommendation: Consider using trace for the frequent success cases to reduce noise in debug logs.

🔍 Questions for Author

  1. What is the plan for process termination now that process.exit(0) is commented out?
  2. Were the removed shutdown guards causing issues during graceful shutdown?
  3. Have you tested this with a large number of actors to ensure the 120s timeout is sufficient?
  4. Should #shutdown and #shuttingDown be consolidated into a single state enum (NOT_STARTED, RUNNING, SHUTTING_DOWN, SHUTDOWN)?

📝 Minor Style Notes

  1. Line 4 in mod.ts: Import order changed (not a real issue, just noting)
  2. The example script change (examples/counter/scripts/connect.ts) removes manual disposal - is this now handled automatically?

Summary

This is a meaningful improvement to shutdown handling, but the commented-out process.exit(0) calls need to be addressed before merging. The removal of shutdown guards should also be documented or reconsidered to prevent potential issues with operations during shutdown.

Overall assessment: Needs revision primarily due to the process exit issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants