Skip to content

fix(rivetkit): restore hibernatable sockets and hydrate serverless starts#4658

Draft
NathanFlurry wants to merge 1 commit intobreak-up/expose-hibernation-metadatafrom
break-up/restore-hibernating-sockets
Draft

fix(rivetkit): restore hibernatable sockets and hydrate serverless starts#4658
NathanFlurry wants to merge 1 commit intobreak-up/expose-hibernation-metadatafrom
break-up/restore-hibernating-sockets

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 14, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude bot commented Apr 14, 2026

PR Review: fix(rivetkit): restore hibernatable sockets and hydrate serverless starts

Note: This PR is currently in DRAFT state.

Overview

This PR fixes two related hibernation issues in the engine actor driver:

  1. Hibernatable socket restoration - Properly rebinds in-flight WebSocket connections when an actor wakes from hibernation.
  2. Serverless start hydration - Hydrates the CommandStartActor payload with persisted hibernating-socket metadata at serverless start time.

It also refactors the test suite to use retry-based polling (readAfterSleepCycle) instead of fixed waitFor delays, and updates gateway routing test paths from /api/hello to /request/api/hello.


actor-driver.ts

Good patterns:

  • The new disposeDynamicRuntime and disposeAllDynamicRuntimes methods replace a bare .clear() call with proper async dispose() calls.
  • Force-stopping remaining actors when ENVOY_STOP_WAIT_MS expires prevents the process from hanging indefinitely.
  • The detach pattern on binding objects cleanly decouples listener registration from binding lifetime.

Issues and suggestions:

1. Error path leaves dangling binding in bindHibernatableConnectSocket - If routeWebSocket throws, the binding is stored in hibernatableConnectBindings with no listeners. Consider wrapping in try/catch and calling deleteHibernatableConnectBinding + websocket.close on failure.

2. changed mutation inside Promise.all - Setting changed = true inside async callbacks is safe but non-obvious. Consider deriving it after Promise.all resolves: commands.some((c, i) => c !== decoded.val[i]).

3. O(n) scan in rebind methods - Both rebindHibernatableConnectSockets and rebindDynamicHibernatableRunnerWebSockets filter all bindings for a single actorId. With many actors this is O(n). Consider indexing by actorId.

4. Stale proxyToActorWs between rebinds - When isRestoringHibernatable and event.reason === dynamic.runtime.disposed, onProxyClose returns early. The old proxyToActorWs may remain open between rebinds.

5. Logging inconsistency - One log site uses err: instead of error: (around the dynamic hibernating requests restore block). All other log sites use error:.


Test changes (actor-sleep.ts, actor-conn-hibernation.ts)

Good patterns:

  • readAfterSleepCycle with bounded retries replaces fragile fixed-wait waitFor calls.
  • The invariant startCount === sleepCount + 1 is more correct than the old startCount === 2 && sleepCount === 1.
  • Making the hibernation suite .sequential prevents shared-state race conditions.
  • Adding a 90-second suite timeout avoids silent hangs.

Minor observations:

1. waitForHibernatableRegistration magic number - The 100ms wait is undocumented. A brief comment explaining what it is waiting for would help future readers.

2. readAfterSleepCycle retries on all errors - The function always retries maxAttempts times regardless of error type. For permanent failures this wastes time.

3. alarms keep actor awake test semantic change - The original asserted exactly sleepCount === 1, verifying that the alarm delayed the first sleep. The new version asserts sleepCount >= 1. Probably fine, but a comment noting this trade-off would be helpful.


gateway-routing.ts

All URL changes consistently add a /request/ prefix segment. No issues, but the PR description could note what router change this corresponds to.

@NathanFlurry NathanFlurry force-pushed the break-up/expose-hibernation-metadata branch from 26f98bc to fde1e0b Compare April 15, 2026 02:40
@NathanFlurry NathanFlurry force-pushed the break-up/restore-hibernating-sockets branch from ccc38b5 to 8293235 Compare April 15, 2026 02:40
@NathanFlurry NathanFlurry force-pushed the break-up/expose-hibernation-metadata branch from fde1e0b to 789b9cd Compare April 15, 2026 02:50
@NathanFlurry NathanFlurry force-pushed the break-up/restore-hibernating-sockets branch from 8293235 to 1ba6a3b Compare April 15, 2026 02:50
@NathanFlurry NathanFlurry force-pushed the break-up/restore-hibernating-sockets branch from 1ba6a3b to de87a8b Compare April 15, 2026 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant