fix(rivetkit): handle all top-level errors #3578

NathanFlurry · 2025-12-03T01:40:15Z

No description provided.

vercel · 2025-12-03T01:40:20Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Ready	Preview	Comment	Dec 3, 2025 1:52am

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Dec 3, 2025 1:52am
rivet-inspector	Ignored	Preview	Dec 3, 2025 1:52am
rivet-site	Ignored	Preview	Dec 3, 2025 1:52am

NathanFlurry · 2025-12-03T01:40:36Z

fix(rivetkit): handle all top-level errors #3578 👈 (View in Graphite)
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-12-03T01:42:28Z

Code Review: fix(rivetkit): handle all top-level errors

Summary

This PR improves error handling across the TypeScript runner and RivetKit by adding proper error handlers to async operations, event handlers, and timers. It also introduces a new Biome linting rule to prevent future floating promises and adds a Claude Code agent for auditing error handling.

✅ Strengths

1. Comprehensive Error Handling Coverage

The PR systematically adds .catch() handlers to previously unhandled async operations:

Timer callbacks (setTimeout, setInterval)
Event handlers
Background promise spawns
WebSocket reconnection logic

2. Consistent Error Logging Pattern

All error handlers follow a consistent structured logging pattern:

.catch((err) => {
  this.log?.error({
    msg: "descriptive error message",
    error: stringifyError(err),
  });
});

3. New Error Utility

The stringifyError helper at engine/sdks/typescript/runner/src/utils.ts:161-173 provides robust error serialization that handles various error types (Error objects, strings, objects, primitives).

4. Exported RunnerShutdownError

Moving RunnerShutdownError from tunnel.ts to mod.ts and exporting it allows it to be reused across modules, improving code organization.

5. Proactive Prevention

The Biome linting rules (noFloatingPromises, noMisusedPromises) will prevent these issues from being reintroduced.

🔍 Issues & Concerns

1. Critical: Silent Error Swallowing in State Manager ⚠️

Location: rivetkit-typescript/packages/rivetkit/src/actor/instance/state-manager.ts:259-273

this.#savePersistInner().catch((error) => {
  this.#actor.rLog.error({
    msg: "error saving persist data in scheduled save",
    error: stringifyError(error),
  });
});

Problem: State persistence failures are only logged but not surfaced to the caller or tracked. This could lead to:

Data loss without the system realizing it
Silent corruption if state saves fail repeatedly
Difficult debugging when state isn't being persisted as expected

Recommendation: Consider:

Tracking failed save attempts and implementing retry logic
Exposing save failures through observability metrics
Potentially throwing/propagating critical persistence errors to stop the actor if persistence is consistently failing
At minimum, add a counter or circuit breaker to detect repeated failures

2. Reconnection Error Handling Incomplete

Location: engine/sdks/typescript/runner/src/mod.ts:1785-1790

this.#openPegboardWebSocket().catch((err) => {
  this.log?.error({
    msg: "error during websocket reconnection",
    error: stringifyError(err),
  });
});

Problem: If #openPegboardWebSocket() fails, the error is logged but:

No retry is scheduled
The system may be left in a disconnected state indefinitely
The reconnection backoff counter may not be properly managed

Recommendation: The error handler should likely call #scheduleReconnect() again to ensure the retry loop continues even when connection attempts throw exceptions.

3. Actor Stop Errors May Indicate Resource Leaks

Location: engine/sdks/typescript/runner/src/mod.ts:337-343

this.forceStopActor(actorId).catch((err) => {
  this.log?.error({
    msg: "error stopping actor",
    actorId,
    error: stringifyError(err),
  });
});

Problem: If stopping an actor fails, the actor may remain in an inconsistent state:

Resources might not be cleaned up properly
The actor might still be in this.#actors map
Slots may not be released correctly

Recommendation: Consider tracking actors that fail to stop and implement cleanup fallback logic or forced resource release.

4. Command Handler Error Recovery

Location: engine/sdks/typescript/runner/src/mod.ts:976-992

this.#handleCommandStartActor(commandWrapper).catch((err) => {
  this.log?.error({
    msg: "error handling start actor command",
    actorId: commandWrapper.checkpoint.actorId,
    error: stringifyError(err),
  });
});

Problem: When command handlers fail:

The server may not be notified of the failure
The command checkpoint may not be updated correctly
The actor might be left in an inconsistent state

Recommendation: Consider sending error events back to the server for failed commands so the control plane can take corrective action.

5. Tunnel Message Handler Lacks Context

Location: engine/sdks/typescript/runner/src/mod.ts:852-857

this.#tunnel?.handleTunnelMessage(message.val).catch((err) => {
  this.log?.error({
    msg: "error handling tunnel message",
    error: stringifyError(err),
  });
});

Problem: The error log doesn't include the message type or any identifying information, making debugging difficult.

Recommendation: Add message metadata to the error log:

this.log?.error({
  msg: "error handling tunnel message",
  messageType: message.val.kind?.tag,
  error: stringifyError(err),
});

📋 Minor Observations

1. Biome Schema Update

The schema bump from 2.1.1 to 2.2.3 is good, but ensure all developers update their Biome installations to avoid version conflicts.

2. Claude Code Agent

The new js-error-handler-auditor agent is well-documented with clear examples. Consider running it periodically or in CI to catch new violations.

3. Missing Test Coverage

The PR doesn't include tests for the error handling paths. Consider adding tests that:

Trigger errors in timer callbacks
Simulate WebSocket reconnection failures
Test state persistence failure scenarios

4. Consistent Error Context

Some error handlers include rich context (like actorId) while others don't. Consider standardizing what context should always be included in error logs.

🎯 Recommendations

High Priority

Address state persistence error handling - Implement retry logic or failure tracking
Fix reconnection error loop - Ensure reconnection continues even after errors
Enhance command failure handling - Notify server of command failures

Medium Priority

Add tests for error handling paths
Standardize error log context across all handlers
Document error handling patterns in CLAUDE.md

Low Priority

Run the new error handler auditor agent regularly
Consider adding metrics for error rates by error type

🔒 Security

No security concerns identified. The changes are defensive and improve system resilience.

⚡ Performance

Minimal performance impact. The added error handlers are only executed on error paths.

🧪 Testing

Missing: This PR would benefit from:

Unit tests for stringifyError utility
Integration tests that trigger error conditions
Tests verifying errors are logged with proper context

✨ Overall Assessment

This is a good improvement to the codebase that addresses a real problem (unhandled promise rejections). The systematic approach and consistent patterns are excellent. However, the error handlers should be reviewed to ensure they do more than just log - particularly for critical paths like state persistence and reconnection logic.

Recommendation: Request changes to address the critical state persistence issue and reconnection error handling before merging.

Review generated with Claude Code

graphite-app · 2025-12-03T01:50:43Z

Merge activity

Dec 3, 1:50 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
Dec 3, 1:51 AM UTC: CI is running for this pull request on a draft pull request (#3579) due to your merge queue CI optimization settings.
Dec 3, 1:52 AM UTC: Merged by the Graphite merge queue via draft PR: #3579.

claude · 2025-12-03T01:54:01Z

PR Review: fix(rivetkit): handle all top-level errors

Summary

This PR improves error handling across the TypeScript runner and RivetKit codebase by wrapping async operations and event handlers with proper error handling. It also adds a new Biome linting configuration to catch floating promises and misused promises, along with a new Claude agent for auditing error handling.

Positive Changes

1. Comprehensive Error Handling Coverage ✅

The PR systematically addresses unhandled promises and async operations across critical code paths:

Timer callbacks (setInterval for KV cleanup, command acknowledgment)
WebSocket reconnection logic
Command processing (start/stop actor)
Tunnel message handling
State persistence callbacks

2. Consistent Error Logging ✅

All error handlers use the new stringifyError utility which properly formats different error types (Error objects, strings, plain objects, primitives).

3. Proper Error Types ✅

Introduction of RunnerShutdownError class provides semantic distinction for shutdown-related errors, which are explicitly caught and ignored appropriately.

4. Linting Enforcement ✅

The Biome configuration now enforces noFloatingPromises and noMisusedPromises rules to prevent future issues.

Issues & Concerns

1. Silent Error Handling in Background Operations ⚠️

Severity: Medium

Multiple locations log errors but don't propagate them or trigger recovery logic. For example, in mod.ts:337-343, if forceStopActor fails during shutdown, the actor may remain in an inconsistent state.

Recommendation: Consider tracking failed actor stops and potentially retrying or marking the runner as unhealthy if critical cleanup fails.

2. Race Condition in State Manager ⚠️

Severity: Medium

In state-manager.ts:259-264, 268-273, the error handlers for #savePersistInner use .catch(), but the method also has internal error handling. This creates dual error handling paths that could swallow errors.

Recommendation: Since #savePersistInner already has comprehensive internal error handling, consider re-throwing after logging in the catch handlers to preserve error propagation.

3. Inconsistent Error Handling Pattern ⚠️

Severity: Low

mod.ts:305 uses console.error instead of structured logging.

Recommendation: Change to structured logging with this.log?.error() for consistency.

4. Missing Error Context ℹ️

Severity: Low

Several error logs could benefit from additional context (e.g., reconnection attempt numbers).

Security Considerations

✅ No security issues identified. The changes improve robustness by preventing unhandled promise rejections.

Testing Recommendations

Add tests for error scenarios to verify proper logging without process crashes
Test state manager error handling and retry logic
Verify RunnerShutdownError handling works as expected

Overall Assessment

Verdict: Approve with minor suggestions ✅

This is a solid improvement to error handling. The systematic approach and linting rules will prevent future issues. The concerns raised are mostly about improving observability and error propagation rather than critical bugs.

Priority Fixes:

Fix inconsistent console.error usage (mod.ts:305)
Consider re-throwing in state manager catch handlers
Add more context to error logs

Nice to have:

Add metrics/telemetry for error rates
Consider circuit breaker pattern for repeated failures
Document error handling patterns in CLAUDE.md

vercel bot had a problem deploying to Preview – rivetkit-serverless December 3, 2025 01:40 Failure

fix(rivetkit): handle all top-level errors

f278f7b

NathanFlurry force-pushed the 12-02-fix_rivetkit_handle_all_top-level_errors branch from 2ce0680 to f278f7b Compare December 3, 2025 01:50

graphite-app bot pushed a commit that referenced this pull request Dec 3, 2025

fix(rivetkit): handle all top-level errors (#3578)

d1fda82

graphite-app bot closed this Dec 3, 2025

graphite-app bot deleted the 12-02-fix_rivetkit_handle_all_top-level_errors branch December 3, 2025 01:52

vercel bot deployed to Preview – rivetkit-serverless December 3, 2025 01:52 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(rivetkit): handle all top-level errors #3578

fix(rivetkit): handle all top-level errors #3578

NathanFlurry commented Dec 3, 2025

Uh oh!

vercel bot commented Dec 3, 2025 •

edited

Loading

Uh oh!

NathanFlurry commented Dec 3, 2025

Uh oh!

claude bot commented Dec 3, 2025

Uh oh!

graphite-app bot commented Dec 3, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(rivetkit): handle all top-level errors #3578

fix(rivetkit): handle all top-level errors #3578

Conversation

NathanFlurry commented Dec 3, 2025

Uh oh!

vercel bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented Dec 3, 2025

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Dec 3, 2025

Code Review: fix(rivetkit): handle all top-level errors

Summary

✅ Strengths

1. Comprehensive Error Handling Coverage

2. Consistent Error Logging Pattern

3. New Error Utility

4. Exported RunnerShutdownError

5. Proactive Prevention

🔍 Issues & Concerns

1. Critical: Silent Error Swallowing in State Manager ⚠️

2. Reconnection Error Handling Incomplete

3. Actor Stop Errors May Indicate Resource Leaks

4. Command Handler Error Recovery

5. Tunnel Message Handler Lacks Context

📋 Minor Observations

1. Biome Schema Update

2. Claude Code Agent

3. Missing Test Coverage

4. Consistent Error Context

🎯 Recommendations

High Priority

Medium Priority

Low Priority

🔒 Security

⚡ Performance

🧪 Testing

✨ Overall Assessment

Uh oh!

graphite-app bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

claude bot commented Dec 3, 2025

PR Review: fix(rivetkit): handle all top-level errors

Summary

Positive Changes

1. Comprehensive Error Handling Coverage ✅

2. Consistent Error Logging ✅

3. Proper Error Types ✅

4. Linting Enforcement ✅

Issues & Concerns

1. Silent Error Handling in Background Operations ⚠️

2. Race Condition in State Manager ⚠️

3. Inconsistent Error Handling Pattern ⚠️

4. Missing Error Context ℹ️

Security Considerations

Testing Recommendations

Overall Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel bot commented Dec 3, 2025 •

edited

Loading

graphite-app bot commented Dec 3, 2025 •

edited

Loading