fix(rivetkit-core): fail closed sqlite during shutdown by NathanFlurry · Pull Request #4921 · rivet-dev/rivet

NathanFlurry · 2026-05-04T00:46:14Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

NathanFlurry · 2026-05-04T00:47:02Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

fix(rivetkit-core): fail closed sqlite during shutdown #4921 👈 (View in Graphite)
docs(rivetkit): raise serverless start payload limit #4920
test(kitchen-sink): add sqlite memory soak harness #4919
feat(rivetkit): expose runtime diagnostics #4918
chore(depot-client): migrate to worker thread #4907 : 2 other dependent PRs (#4909 , #4912 )
test(driver): select matrix cells with vitest filters #4906
fix(pegboard): restore hibernating requests on serverful start #4904 : 1 other dependent PR (#4905 )
fix(rivetkit): drain shutdown work before sleep #4902 : 1 other dependent PR (#4903 )
feat(sqlite): wire remote sqlite execute handlers #4897
test(serverless): avoid pinning header validation order #4896
chore(driver-test-runner): add native/wasm runtime matrix to skill #4884 : 3 other dependent PRs (#4892 , #4894 , #4901 )
chroe: rename rivetkit-sqlite to depot-client #4883
fix(rivetkit-wasm): fix mem leaks #4880
chore(rivetkit): wasm support #4860
feat(sqlite): add cold read benchmarks and simplify optimizations #4857
feat(sqlite): benchmark cold reads #4848 : 1 other dependent PR (#4864 )
docs(actors): document manual lifecycle controls #4847
test(sqlite): expand db stress coverage #4846 : 1 other dependent PR (#4849 )
test(raw-websocket): add serverless smoke coverage #4845
feat(serverless): configure drain grace period #4844
fix(publish): enable frontend build for engine target and use turbo build instead of turbo build:engine #4843
feat(rivetkit): expose conn handle in onWebSocket context #4842
fix(rivetkit): normalize waitUntil/keepAwake promises to null to avoid serde undefined error #4841
fix(rivetkit): remove requirement for token #4839 : 1 other dependent PR (#4840 )
fix(pegboard-envoy): allow sqlite requests across lifecycle states #4838
fix(pegboard): route scoped actor key reads by runner dc #4837
fix(sqlite-storage): let sqlite open take over stale owners #4834
fix(engine-tests): parallelize multi-DC startup + enable 6 multi-DC tests #4833
fix(api): honor cursor in actor_ids list path #4832
fix(api): silently filter invalid IDs from actor_ids query instead of 400 #4831
fix(api-peer): make actors_delete idempotent for already-destroyed actors #4830
fix(pegboard): use Iterator streaming mode in list_names so limit param is honored #4829
chore(engine-tests): convert envoy tests to Sleep crash policy + triage ignored #4828
feat(engine-tests): wire envoy actor names through prepopulate_actor_names #4827
chore(depot): fault injection tests #4868
feat(sqlite): pitr & forking #4859 : 1 other dependent PR (#4863 )
chore(sqlite): stateless storage refactor #4856
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-05-04T00:53:51Z

Code Review

Overview

This PR fixes a shutdown-ordering race where "late callbacks" from a dying actor generation could issue SQLite work after the final state was serialized. The fix has two parts:

sqlite.rs — adds a shared Arc<AtomicBool> closed-flag to SqliteDb so all operations fail with sqlite.closed once close() is called, even on cloned handles.
task.rs — moves ctx.sql().cleanup() to run before save_final_state() (previously it ran after in finish_shutdown_cleanup_with_ctx), and parallelizes the three independent post-state teardown tasks with tokio::join!.

The core logic is correct, well-reasoned, and consistent with the project's fail-by-default and fail-closed design principles.

Correctness

Double-check in open() is correct. The outer ensure_open() provides a fast-path rejection; the inner one (inside open_lock) closes the TOCTOU window where close() could fire between the two checks. This is the right pattern.

close() idempotency. Using swap(true, Ordering::AcqRel) to make close() a no-op on the second call is clean and thread-safe. The Acquire ordering in ensure_open() correctly pairs with this.

All DB paths go through guarded helpers. Local operations call open() (which calls ensure_open()), then native_db_handle() (which also calls ensure_open()). Remote operations call remote_config() (which calls ensure_open()). After close(), no new work can start.

sync_alarm_logged() removed from Destroy path. The old code called sync_alarm_logged() for both Sleep and Destroy, immediately followed by cancel_driver_alarm_logged() for Destroy. Since cancel_driver_alarm_logged() sets the engine alarm to None anyway, the preceding sync_alarm_logged() was redundant for Destroy. The new per-arm behavior is semantically correct.

Potential Issues

local_exec / local_query / local_run / local_execute re-open the DB after close.
These helpers call self.open().await? before native_db_handle(). open() now checks ensure_open() at the top, so a closed DB is rejected. However, if close() fires after open() returns Ok but before native_db_handle() is called, the second ensure_open() inside native_db_handle() catches it. No correctness issue, but worth being aware that the "DB handle is gone" branch of native_db_handle() is now reachable when db.lock() is None even when closed is still false (e.g. before the first open()). The existing ok_or_else(|| SqliteRuntimeError::Closed.build()) on line 364 handles this by returning Closed — arguably this should be a distinct NotOpen or NotInitialized error variant to distinguish "was closed" from "was never opened," but this is a minor point.

cleanup() is now a trivial alias for close().

pub(crate) async fn cleanup(&self) -> Result<()> {
    self.close().await
}

Since both close() and cleanup() are now identical and close() is already pub, the cleanup() wrapper adds indirection without adding meaning. Consider either inlining the close() call at the one call site in task.rs, or keeping cleanup() but adding a comment explaining the semantic distinction (if one is intended).

Arc<AtomicBool> is correct but deserves a comment on why Arc.
The comment on the field explains stale cloned handles, which is the right reason. Worth noting that Default::default() for Arc<AtomicBool> yields Arc::new(AtomicBool::new(false)) — this is correct but non-obvious to a new reader.

Shutdown Sequence Ordering

The new sequence is:

1. cleanup() → close SQLite (fail-closed from here on)
2. save_final_state()
3. tokio::join!(teardown_sleep_state, wait_for_pending_state_writes, cleanup_alarm)
4. mark_destroy_completed() (if Destroy)

The rationale is sound: step 1 ensures any late DB callback from this generation fails immediately, preventing a corrupt write that could race with step 2's state serialization. The parallelization in step 3 is safe because the three tasks are independent.

One thing to verify: does save_final_state() itself (or any callback it triggers) attempt SQLite access? If so, it would now fail with Closed. From the code, save_final_state() serializes in-memory actor state to KV — it does not directly use SQLite — so this should be fine. Worth adding a brief comment on that invariant since the ordering is load-bearing.

Test Coverage

The .agent/notes/ file documents that all actor-db* and actor-sleep-db driver tests pass, including the stress case on rerun. However, per CLAUDE.md, tests should prove the fix is effective. There is no new targeted test for the specific race this fixes (late callback after shutdown start, before the old cleanup). The existing stress tests may cover this incidentally, but a deterministic unit test for the ensure_open gate (call close(), then assert subsequent exec/query/execute return sqlite.closed) would make the invariant explicit and prevent regressions.

Minor Style Notes

Log messages "actor shutdown cleanup step completed" are repeated verbatim across all three new run_shutdown_task_* helpers. A step field differentiates them, which is fine, but the repetition across five nearly-identical tracing::debug! blocks is noisy. These could be extracted into a small log_shutdown_step(actor_id, reason, step) helper to reduce duplication.
The agent note file (.agent/notes/db-driver-test-rerun-2026-05-03.md) documents the ltx trailer checksums must be zeroed transient failure. Since it only appeared once and isolated reruns passed, it appears to be pre-existing flakiness unrelated to this change — no action needed here, but worth tracking separately if it recurs.

Summary

The fix is correct and addresses a real bug. The Arc<AtomicBool> closed-flag approach is idiomatic and the fail-closed semantics are consistent with the project's design principles. The main follow-up items are: (1) add a deterministic test for the ensure_open gate, and (2) consider whether cleanup() should be collapsed into close() at the call site.

fix(rivetkit-core): fail closed sqlite during shutdown

fc1d02f

This was referenced May 4, 2026

feat(rivetkit): expose runtime diagnostics #4918

Closed

test(kitchen-sink): add sqlite memory soak harness #4919

Closed

NathanFlurry mentioned this pull request May 4, 2026

docs(rivetkit): raise serverless start payload limit #4920

Closed

11 tasks

NathanFlurry closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rivetkit-core): fail closed sqlite during shutdown#4921

fix(rivetkit-core): fail closed sqlite during shutdown#4921
NathanFlurry wants to merge 1 commit intodriver-test/complaints/raise-serverless-start-payload-limitfrom
driver-test-complaints/fail-closed-sqlite-during-shtudown

NathanFlurry commented May 4, 2026

Uh oh!

NathanFlurry commented May 4, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NathanFlurry commented May 4, 2026

Description

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

NathanFlurry commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented May 4, 2026

Code Review

Overview

Correctness

Potential Issues

Shutdown Sequence Ordering

Test Coverage

Minor Style Notes

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NathanFlurry commented May 4, 2026 •

edited

Loading