Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Dec 3, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Dec 3, 2025 1:52am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Dec 3, 2025 1:52am
rivet-inspector Ignored Ignored Preview Dec 3, 2025 1:52am
rivet-site Ignored Ignored Preview Dec 3, 2025 1:52am

Copy link
Member Author


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Dec 3, 2025

Code Review: fix(rivetkit): handle all top-level errors

Summary

This PR improves error handling across the TypeScript runner and RivetKit by adding proper error handlers to async operations, event handlers, and timers. It also introduces a new Biome linting rule to prevent future floating promises and adds a Claude Code agent for auditing error handling.

✅ Strengths

1. Comprehensive Error Handling Coverage

The PR systematically adds .catch() handlers to previously unhandled async operations:

  • Timer callbacks (setTimeout, setInterval)
  • Event handlers
  • Background promise spawns
  • WebSocket reconnection logic

2. Consistent Error Logging Pattern

All error handlers follow a consistent structured logging pattern:

.catch((err) => {
  this.log?.error({
    msg: "descriptive error message",
    error: stringifyError(err),
  });
});

3. New Error Utility

The stringifyError helper at engine/sdks/typescript/runner/src/utils.ts:161-173 provides robust error serialization that handles various error types (Error objects, strings, objects, primitives).

4. Exported RunnerShutdownError

Moving RunnerShutdownError from tunnel.ts to mod.ts and exporting it allows it to be reused across modules, improving code organization.

5. Proactive Prevention

The Biome linting rules (noFloatingPromises, noMisusedPromises) will prevent these issues from being reintroduced.

🔍 Issues & Concerns

1. Critical: Silent Error Swallowing in State Manager ⚠️

Location: rivetkit-typescript/packages/rivetkit/src/actor/instance/state-manager.ts:259-273

this.#savePersistInner().catch((error) => {
  this.#actor.rLog.error({
    msg: "error saving persist data in scheduled save",
    error: stringifyError(error),
  });
});

Problem: State persistence failures are only logged but not surfaced to the caller or tracked. This could lead to:

  • Data loss without the system realizing it
  • Silent corruption if state saves fail repeatedly
  • Difficult debugging when state isn't being persisted as expected

Recommendation: Consider:

  1. Tracking failed save attempts and implementing retry logic
  2. Exposing save failures through observability metrics
  3. Potentially throwing/propagating critical persistence errors to stop the actor if persistence is consistently failing
  4. At minimum, add a counter or circuit breaker to detect repeated failures

2. Reconnection Error Handling Incomplete

Location: engine/sdks/typescript/runner/src/mod.ts:1785-1790

this.#openPegboardWebSocket().catch((err) => {
  this.log?.error({
    msg: "error during websocket reconnection",
    error: stringifyError(err),
  });
});

Problem: If #openPegboardWebSocket() fails, the error is logged but:

  • No retry is scheduled
  • The system may be left in a disconnected state indefinitely
  • The reconnection backoff counter may not be properly managed

Recommendation: The error handler should likely call #scheduleReconnect() again to ensure the retry loop continues even when connection attempts throw exceptions.

3. Actor Stop Errors May Indicate Resource Leaks

Location: engine/sdks/typescript/runner/src/mod.ts:337-343

this.forceStopActor(actorId).catch((err) => {
  this.log?.error({
    msg: "error stopping actor",
    actorId,
    error: stringifyError(err),
  });
});

Problem: If stopping an actor fails, the actor may remain in an inconsistent state:

  • Resources might not be cleaned up properly
  • The actor might still be in this.#actors map
  • Slots may not be released correctly

Recommendation: Consider tracking actors that fail to stop and implement cleanup fallback logic or forced resource release.

4. Command Handler Error Recovery

Location: engine/sdks/typescript/runner/src/mod.ts:976-992

this.#handleCommandStartActor(commandWrapper).catch((err) => {
  this.log?.error({
    msg: "error handling start actor command",
    actorId: commandWrapper.checkpoint.actorId,
    error: stringifyError(err),
  });
});

Problem: When command handlers fail:

  • The server may not be notified of the failure
  • The command checkpoint may not be updated correctly
  • The actor might be left in an inconsistent state

Recommendation: Consider sending error events back to the server for failed commands so the control plane can take corrective action.

5. Tunnel Message Handler Lacks Context

Location: engine/sdks/typescript/runner/src/mod.ts:852-857

this.#tunnel?.handleTunnelMessage(message.val).catch((err) => {
  this.log?.error({
    msg: "error handling tunnel message",
    error: stringifyError(err),
  });
});

Problem: The error log doesn't include the message type or any identifying information, making debugging difficult.

Recommendation: Add message metadata to the error log:

this.log?.error({
  msg: "error handling tunnel message",
  messageType: message.val.kind?.tag,
  error: stringifyError(err),
});

📋 Minor Observations

1. Biome Schema Update

The schema bump from 2.1.1 to 2.2.3 is good, but ensure all developers update their Biome installations to avoid version conflicts.

2. Claude Code Agent

The new js-error-handler-auditor agent is well-documented with clear examples. Consider running it periodically or in CI to catch new violations.

3. Missing Test Coverage

The PR doesn't include tests for the error handling paths. Consider adding tests that:

  • Trigger errors in timer callbacks
  • Simulate WebSocket reconnection failures
  • Test state persistence failure scenarios

4. Consistent Error Context

Some error handlers include rich context (like actorId) while others don't. Consider standardizing what context should always be included in error logs.

🎯 Recommendations

High Priority

  1. Address state persistence error handling - Implement retry logic or failure tracking
  2. Fix reconnection error loop - Ensure reconnection continues even after errors
  3. Enhance command failure handling - Notify server of command failures

Medium Priority

  1. Add tests for error handling paths
  2. Standardize error log context across all handlers
  3. Document error handling patterns in CLAUDE.md

Low Priority

  1. Run the new error handler auditor agent regularly
  2. Consider adding metrics for error rates by error type

🔒 Security

No security concerns identified. The changes are defensive and improve system resilience.

⚡ Performance

Minimal performance impact. The added error handlers are only executed on error paths.

🧪 Testing

Missing: This PR would benefit from:

  • Unit tests for stringifyError utility
  • Integration tests that trigger error conditions
  • Tests verifying errors are logged with proper context

✨ Overall Assessment

This is a good improvement to the codebase that addresses a real problem (unhandled promise rejections). The systematic approach and consistent patterns are excellent. However, the error handlers should be reviewed to ensure they do more than just log - particularly for critical paths like state persistence and reconnection logic.

Recommendation: Request changes to address the critical state persistence issue and reconnection error handling before merging.


Review generated with Claude Code

@NathanFlurry NathanFlurry force-pushed the 12-02-fix_rivetkit_handle_all_top-level_errors branch from 2ce0680 to f278f7b Compare December 3, 2025 01:50
@graphite-app
Copy link
Contributor

graphite-app bot commented Dec 3, 2025

Merge activity

  • Dec 3, 1:50 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Dec 3, 1:51 AM UTC: CI is running for this pull request on a draft pull request (#3579) due to your merge queue CI optimization settings.
  • Dec 3, 1:52 AM UTC: Merged by the Graphite merge queue via draft PR: #3579.

graphite-app bot pushed a commit that referenced this pull request Dec 3, 2025
@graphite-app graphite-app bot closed this Dec 3, 2025
@graphite-app graphite-app bot deleted the 12-02-fix_rivetkit_handle_all_top-level_errors branch December 3, 2025 01:52
@claude
Copy link

claude bot commented Dec 3, 2025

PR Review: fix(rivetkit): handle all top-level errors

Summary

This PR improves error handling across the TypeScript runner and RivetKit codebase by wrapping async operations and event handlers with proper error handling. It also adds a new Biome linting configuration to catch floating promises and misused promises, along with a new Claude agent for auditing error handling.

Positive Changes

1. Comprehensive Error Handling Coverage

The PR systematically addresses unhandled promises and async operations across critical code paths:

  • Timer callbacks (setInterval for KV cleanup, command acknowledgment)
  • WebSocket reconnection logic
  • Command processing (start/stop actor)
  • Tunnel message handling
  • State persistence callbacks

2. Consistent Error Logging

All error handlers use the new stringifyError utility which properly formats different error types (Error objects, strings, plain objects, primitives).

3. Proper Error Types

Introduction of RunnerShutdownError class provides semantic distinction for shutdown-related errors, which are explicitly caught and ignored appropriately.

4. Linting Enforcement

The Biome configuration now enforces noFloatingPromises and noMisusedPromises rules to prevent future issues.

Issues & Concerns

1. Silent Error Handling in Background Operations ⚠️

Severity: Medium

Multiple locations log errors but don't propagate them or trigger recovery logic. For example, in mod.ts:337-343, if forceStopActor fails during shutdown, the actor may remain in an inconsistent state.

Recommendation: Consider tracking failed actor stops and potentially retrying or marking the runner as unhealthy if critical cleanup fails.

2. Race Condition in State Manager ⚠️

Severity: Medium

In state-manager.ts:259-264, 268-273, the error handlers for #savePersistInner use .catch(), but the method also has internal error handling. This creates dual error handling paths that could swallow errors.

Recommendation: Since #savePersistInner already has comprehensive internal error handling, consider re-throwing after logging in the catch handlers to preserve error propagation.

3. Inconsistent Error Handling Pattern ⚠️

Severity: Low

mod.ts:305 uses console.error instead of structured logging.

Recommendation: Change to structured logging with this.log?.error() for consistency.

4. Missing Error Context ℹ️

Severity: Low

Several error logs could benefit from additional context (e.g., reconnection attempt numbers).

Security Considerations

No security issues identified. The changes improve robustness by preventing unhandled promise rejections.

Testing Recommendations

  1. Add tests for error scenarios to verify proper logging without process crashes
  2. Test state manager error handling and retry logic
  3. Verify RunnerShutdownError handling works as expected

Overall Assessment

Verdict: Approve with minor suggestions

This is a solid improvement to error handling. The systematic approach and linting rules will prevent future issues. The concerns raised are mostly about improving observability and error propagation rather than critical bugs.

Priority Fixes:

  1. Fix inconsistent console.error usage (mod.ts:305)
  2. Consider re-throwing in state manager catch handlers
  3. Add more context to error logs

Nice to have:

  • Add metrics/telemetry for error rates
  • Consider circuit breaker pattern for repeated failures
  • Document error handling patterns in CLAUDE.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants