Skip to content

fix(rivetkit-core): fail closed sqlite during shutdown#4912

Closed
NathanFlurry wants to merge 4 commits into05-03-chore_depot-client_migrate_to_worker_threadfrom
driver-test-complaints/fail-closed-sqlite-shutdown
Closed

fix(rivetkit-core): fail closed sqlite during shutdown#4912
NathanFlurry wants to merge 4 commits into05-03-chore_depot-client_migrate_to_worker_threadfrom
driver-test-complaints/fail-closed-sqlite-shutdown

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 3, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 3, 2026

Code Review: fix(rivetkit-core): fail closed sqlite during shutdown

Overview

This PR fixes a correctness bug where stale cloned c.db handles could reopen a local SQLite worker or keep sending remote SQL after actor cleanup. The fix adds a shared closed: Arc<AtomicBool> flag to SqliteDb, moves SQLite cleanup before final state serialization, parallelizes the remaining shutdown tasks, and adds a malloc_trim call after actor shutdown to reclaim native allocator memory. It also ships new tooling (memory soak harness, VFS metrics, registry diagnostics) and bumps the serverless start payload default to 16 MiB.


Correctness Issues

1. sync_alarm_logged() dropped on the Destroy path (potential bug)

In the old finish_shutdown_cleanup_with_ctx, sync_alarm_logged() was called unconditionally before the match reason arm, so it ran for both Sleep and Destroy. In the new run_shutdown_task_cleanup_alarm, it only runs inside the ShutdownKind::Sleep arm. The Destroy path now calls only cancel_driver_alarm_logged() + wait_for_pending_alarm_writes. Whether omitting sync_alarm_logged on Destroy is intentional should be verified — if not, this is a silent regression.

2. Double impl EnvoyHandle block in handle.rs

After extracting decode_serverless_actor_start_payload as a free function the diff adds a second impl EnvoyHandle { ... } block. This compiles fine in Rust but is clearly unintentional — the trailing methods (send_kv_request, etc.) should stay inside the original impl block rather than starting a new one.

3. onSleep/onDestroy are now unconditionally registered

Previously these callbacks were only registered when the user provided them (onSleep ? wrapNativeCallback(...) : undefined). They are now always registered as live NAPI callbacks. Core is expected to always fire them regardless. Confirm that rivetkit-core never optimised the None callback path in a way that produced different observable behaviour (e.g. skip the callback entirely, skipping the new closeDatabase/clearNativeRuntimeState side-effects).


Design & Convention Notes

4. trim_native_allocator_after_shutdown on every shutdown

Calling malloc_trim(0) after every actor stop is intentionally aggressive and was validated against the soak results. The unsafe extern "C" block is sound (standard glibc symbol). The concern is CPU overhead under high-concurrency shutdown bursts — each call can scan the full arena freelist. Consider a heuristic (e.g. only trim when the actor had SQLite enabled, or rate-limit trims) if the per-shutdown cost shows up in profiling.

5. JsSqliteVfsMetrics counter fields are f64 in NAPI

This is correct for NAPI (JS numbers are all f64), but the field names (commitCount, pageCacheEntries, etc.) suggest integer semantics. Documenting that these are counts-as-floats would help callers who might do integer arithmetic on them.

6. normalizeNativeMetrics dual-key fallback (camelCase + snake_case)

The helper reads both raw[camelKey] and raw[snakeKey] to support both NAPI (camelCase) and wasm (snake_case) backends. This works but is worth a one-line comment explaining the dual naming — a future reader may wonder why the fallback exists.

7. MALLOC_ARENA_MAX / MALLOC_TRIM_THRESHOLD_ defaults set only on the spawned engine

set_default_allocator_env applies only to the child engine process. The kitchen-sink serverless runner itself only gets these variables from dev:serverless or the user's environment. Consider noting in a comment that the caller process must set these independently if the allocator savings matter for the serverless host too.


Test Coverage

8. No test for ensure_open() fail-closed invariant

The shared closed flag is the core correctness fix. There's no new test asserting that a SqliteDb returns Closed errors after close() is called, or that a second open() attempt after close fails. The new action_dispatch_blocks_idle_sleep_until_reply test covers the keep-awake fix well, but the fail-closed invariant deserves its own unit test.

9. wait_actor_registered_then_stopped has no test

This new method drives the serverless SSE stopping event. A unit test that verifies the function returns on actor deregistration (and does not hang when the actor never registers) would give confidence in the notification race.


Minor / Style

10. Inconsistent indentation in proc-metrics-report.ts — the cpuChart, ioChart, and cycle latency scatter trace have inner properties indented one extra level relative to the rest of the file. No functional issue, but worth cleaning up before merge.

11. Checklist not filled out — the PR description checklist is blank. For a correctness fix touching the shutdown critical path, ticking at least the self-review and test items would be appropriate.


Summary

The core SQLite fail-closed fix and the shutdown ordering change are sound. The main items to verify before merge are: (1) whether sync_alarm_logged being dropped on the Destroy path is intentional, (2) the duplicate impl block in handle.rs, and (3) adding a test for the ensure_open invariant. Everything else is minor or advisory.

@NathanFlurry NathanFlurry changed the base branch from 05-03-chore_depot-client_migrate_to_worker_thread to graphite-base/4912 May 3, 2026 23:40
@NathanFlurry NathanFlurry force-pushed the driver-test-complaints/fail-closed-sqlite-shutdown branch 2 times, most recently from bec22e6 to 707cf74 Compare May 4, 2026 00:04
@NathanFlurry NathanFlurry reopened this May 4, 2026
This was referenced May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant