Skip to content

chore(rivetkit): rewrite work registry + fix waituntil not preventing sleep#4990

Draft
NathanFlurry wants to merge 1 commit intomainfrom
05-06-chore_rivetkit_rewrite_work_registry_fix_waituntil_not_preventing_sleep
Draft

chore(rivetkit): rewrite work registry + fix waituntil not preventing sleep#4990
NathanFlurry wants to merge 1 commit intomainfrom
05-06-chore_rivetkit_rewrite_work_registry_fix_waituntil_not_preventing_sleep

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 7, 2026

🚅 Deployed to the rivet-pr-4990 environment in rivet-frontend

Service Status Web Updated (UTC)
frontend-inspector ❌ Build Failed (View Logs) Web May 7, 2026 at 4:18 am
frontend-cloud ❌ Build Failed (View Logs) Web May 7, 2026 at 4:15 am
kitchen-sink ❌ Build Failed (View Logs) Web May 7, 2026 at 4:15 am
website 😴 Sleeping (View Logs) Web May 7, 2026 at 4:06 am
ladle ❌ Build Failed (View Logs) Web May 7, 2026 at 12:24 am
mcp-hub ✅ Success (View Logs) Web May 7, 2026 at 12:24 am

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 7, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 7, 2026

Code Review

Overview

This PR does two things in one: it fixes a real bug (waitUntil work registered inside onSleep races with native runtime state cleanup, making c.vars undefined during the grace period) and it pays down the resulting tech-debt by collapsing five separate per-kind guard types (KeepAwakeGuard, DisconnectCallbackGuard, WebSocketCallbackGuard, etc.) into a single ActorWorkKind enum + ActorWorkPolicy + ActorWorkGuard. The refactor is a clear improvement.


Correctness

Dead code in the teardown loop (sleep.rs ~L768)

abortable_shutdown_tasks.0.shutdown().await;
// This loop is unreachable — shutdown() already drains the JoinSet
while let Some(result) = abortable_shutdown_tasks.0.join_next().await {
    ...
}

JoinSet::shutdown() aborts all remaining tasks and then joins them all before returning. The join_next loop that follows will immediately return None. The second loop over .1 (unabortable tasks) is correct.

wait_for_tracked_shutdown_work uses a 1-year dummy deadline (sleep.rs ~L499)

let deadline = Instant::now() + Duration::from_secs(60 * 60 * 24 * 365);

The intent is "wait forever, but cancel via shutdown_deadline." Using Instant::now() + Duration::MAX (or a named constant) would be more honest than a magic year. A one-line comment explaining why this is an epoch rather than a direct shutdown_deadline.cancelled() await would help future readers understand the wait_for_shutdown_tasks polling architecture.

Return value of wait_for_tracked_shutdown_work is always discarded

The function returns bool (drain completed vs. cancelled by deadline), but every call site does:

await runtime.actorWaitForTrackedShutdownWork(ctx);

Either document that callers are expected to ignore the value, or switch to Promise<void> in the TypeScript interface (or () in Rust) to avoid the implicit "I don't care" contract.

abortable_shutdown_tasks tuple naming is confusing

The local variable returned from the lock-swap block is abortable_shutdown_tasks but holds both abortable (JoinSet<()>) and unabortable (JoinSet<()>) task sets as .0/.1. Destructuring immediately would make the code clearer:

let (mut abortable, mut unabortable) = { ... };

TOCTOU in native spawn_work_inner (minor, matches existing track_shutdown_task pattern)

The teardown_started check happens without holding the JoinSet lock, so teardown could begin between the check and the spawn. The wasm path correctly holds the lock across both operations. This is an existing pattern gap rather than a regression, but worth noting.


Design

  • The ActorWorkPolicy::aborts_at_shutdown_deadline field is only meaningful for detached (spawned) work, but it is also set on KeepAwake and InternalKeepAwake which only ever use begin_work_region. The field is dead for those variants. A doc comment clarifying "only applies to spawn_work paths" would prevent future misuse.

  • idle_work_region has an early if !kind.policy().blocks_idle_sleep { return None; } guard followed by explicit return None arms in the match for WaitUntil and RegisteredTask. The two-pass redundancy is fine for exhaustiveness safety (new enum variants won't silently skip), but a comment noting that intent would help.

  • The change from keepAwake to waitUntil in TrackedWebSocketHandleAdapter (native.ts ~L2161) is correct — the WebSocketCallbackRegion already holds idle sleep, so the promise chain only needs to drain during grace. Confirmed that JS finally blocks run regardless of Rust-side abort, so endWebSocketCallback is always called.


Test Coverage

The new sleepWaitUntilVarsDuringGrace actor and "waitUntil in onSleep keeps c.vars available during grace" test directly reproduce the bug and verify the fix. The negative assertion pattern (expect(output).not.toContain("Cannot set properties of undefined")) is pragmatic given that the bug manifests as a runtime crash string.

One thing to check: waitFor is a custom helper from shared-utils, not vi.waitFor, so it should not trigger pnpm run check:wait-for-comments. Confirm this is the case if CI runs that lint.


Minor style

  • Variable rename shutdown_tasksabortable_shutdown_tasks at ~L653 is an improvement but the tuple approach (abortable_shutdown_tasks.0 / .1) makes the unabortable side hard to spot — prefer destructuring.
  • The PR description is the default boilerplate with all boxes unchecked. Since this is a draft, that's expected, but the summary section should be filled in before moving to review.

Summary

The refactor is architecturally sound and the bug fix is correct. The two actionable items before merge are:

  1. Remove the dead join_next loop after shutdown().await (or change to not call shutdown() and let the existing join_next loop abort-then-drain).
  2. Clarify the wait_for_tracked_shutdown_work return type / discard intent.

Everything else is polish.

@abcxff abcxff mentioned this pull request May 7, 2026
11 tasks
@NathanFlurry NathanFlurry changed the base branch from 05-05-fix_rivetkit_use_keepawake_for_websocket_callback_tracking_to_prevent_c.vars_crash_after_grace_deadline to graphite-base/4990 May 7, 2026 04:14
@NathanFlurry NathanFlurry force-pushed the graphite-base/4990 branch from 0499832 to 395aa83 Compare May 7, 2026 04:14
@NathanFlurry NathanFlurry force-pushed the 05-06-chore_rivetkit_rewrite_work_registry_fix_waituntil_not_preventing_sleep branch from 11f2601 to 61f01e4 Compare May 7, 2026 04:14
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4990 to main May 7, 2026 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant