Skip to content

fix(rivetkit-core): fail closed sqlite during shutdown#4921

Closed
NathanFlurry wants to merge 1 commit intodriver-test/complaints/raise-serverless-start-payload-limitfrom
driver-test-complaints/fail-closed-sqlite-during-shtudown
Closed

fix(rivetkit-core): fail closed sqlite during shutdown#4921
NathanFlurry wants to merge 1 commit intodriver-test/complaints/raise-serverless-start-payload-limitfrom
driver-test-complaints/fail-closed-sqlite-during-shtudown

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 4, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 4, 2026

Code Review

Overview

This PR fixes a shutdown-ordering race where "late callbacks" from a dying actor generation could issue SQLite work after the final state was serialized. The fix has two parts:

  1. sqlite.rs — adds a shared Arc<AtomicBool> closed-flag to SqliteDb so all operations fail with sqlite.closed once close() is called, even on cloned handles.
  2. task.rs — moves ctx.sql().cleanup() to run before save_final_state() (previously it ran after in finish_shutdown_cleanup_with_ctx), and parallelizes the three independent post-state teardown tasks with tokio::join!.

The core logic is correct, well-reasoned, and consistent with the project's fail-by-default and fail-closed design principles.


Correctness

Double-check in open() is correct. The outer ensure_open() provides a fast-path rejection; the inner one (inside open_lock) closes the TOCTOU window where close() could fire between the two checks. This is the right pattern.

close() idempotency. Using swap(true, Ordering::AcqRel) to make close() a no-op on the second call is clean and thread-safe. The Acquire ordering in ensure_open() correctly pairs with this.

All DB paths go through guarded helpers. Local operations call open() (which calls ensure_open()), then native_db_handle() (which also calls ensure_open()). Remote operations call remote_config() (which calls ensure_open()). After close(), no new work can start.

sync_alarm_logged() removed from Destroy path. The old code called sync_alarm_logged() for both Sleep and Destroy, immediately followed by cancel_driver_alarm_logged() for Destroy. Since cancel_driver_alarm_logged() sets the engine alarm to None anyway, the preceding sync_alarm_logged() was redundant for Destroy. The new per-arm behavior is semantically correct.


Potential Issues

local_exec / local_query / local_run / local_execute re-open the DB after close.
These helpers call self.open().await? before native_db_handle(). open() now checks ensure_open() at the top, so a closed DB is rejected. However, if close() fires after open() returns Ok but before native_db_handle() is called, the second ensure_open() inside native_db_handle() catches it. No correctness issue, but worth being aware that the "DB handle is gone" branch of native_db_handle() is now reachable when db.lock() is None even when closed is still false (e.g. before the first open()). The existing ok_or_else(|| SqliteRuntimeError::Closed.build()) on line 364 handles this by returning Closed — arguably this should be a distinct NotOpen or NotInitialized error variant to distinguish "was closed" from "was never opened," but this is a minor point.

cleanup() is now a trivial alias for close().

pub(crate) async fn cleanup(&self) -> Result<()> {
    self.close().await
}

Since both close() and cleanup() are now identical and close() is already pub, the cleanup() wrapper adds indirection without adding meaning. Consider either inlining the close() call at the one call site in task.rs, or keeping cleanup() but adding a comment explaining the semantic distinction (if one is intended).

Arc<AtomicBool> is correct but deserves a comment on why Arc.
The comment on the field explains stale cloned handles, which is the right reason. Worth noting that Default::default() for Arc<AtomicBool> yields Arc::new(AtomicBool::new(false)) — this is correct but non-obvious to a new reader.


Shutdown Sequence Ordering

The new sequence is:

1. cleanup() → close SQLite (fail-closed from here on)
2. save_final_state()
3. tokio::join!(teardown_sleep_state, wait_for_pending_state_writes, cleanup_alarm)
4. mark_destroy_completed() (if Destroy)

The rationale is sound: step 1 ensures any late DB callback from this generation fails immediately, preventing a corrupt write that could race with step 2's state serialization. The parallelization in step 3 is safe because the three tasks are independent.

One thing to verify: does save_final_state() itself (or any callback it triggers) attempt SQLite access? If so, it would now fail with Closed. From the code, save_final_state() serializes in-memory actor state to KV — it does not directly use SQLite — so this should be fine. Worth adding a brief comment on that invariant since the ordering is load-bearing.


Test Coverage

The .agent/notes/ file documents that all actor-db* and actor-sleep-db driver tests pass, including the stress case on rerun. However, per CLAUDE.md, tests should prove the fix is effective. There is no new targeted test for the specific race this fixes (late callback after shutdown start, before the old cleanup). The existing stress tests may cover this incidentally, but a deterministic unit test for the ensure_open gate (call close(), then assert subsequent exec/query/execute return sqlite.closed) would make the invariant explicit and prevent regressions.


Minor Style Notes

  • Log messages "actor shutdown cleanup step completed" are repeated verbatim across all three new run_shutdown_task_* helpers. A step field differentiates them, which is fine, but the repetition across five nearly-identical tracing::debug! blocks is noisy. These could be extracted into a small log_shutdown_step(actor_id, reason, step) helper to reduce duplication.
  • The agent note file (.agent/notes/db-driver-test-rerun-2026-05-03.md) documents the ltx trailer checksums must be zeroed transient failure. Since it only appeared once and isolated reruns passed, it appears to be pre-existing flakiness unrelated to this change — no action needed here, but worth tracking separately if it recurs.

Summary

The fix is correct and addresses a real bug. The Arc<AtomicBool> closed-flag approach is idiomatic and the fail-closed semantics are consistent with the project's design principles. The main follow-up items are: (1) add a deterministic test for the ensure_open gate, and (2) consider whether cleanup() should be collapsed into close() at the call site.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant