Skip to content

fix(sqlite): stop actors on head fence mismatch#4977

Merged
NathanFlurry merged 1 commit intomainfrom
sqlite-fencing/head-fence-fatal-stop
May 5, 2026
Merged

fix(sqlite): stop actors on head fence mismatch#4977
NathanFlurry merged 1 commit intomainfrom
sqlite-fencing/head-fence-fatal-stop

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

@NathanFlurry NathanFlurry commented May 5, 2026

Stack Context

This stack hardens Rivet Actor SQLite storage against lifecycle overlap, where multiple actor instances can observe or write the same Depot-backed SQLite database concurrently.

What?

  • Add Depot head-fence checks for SQLite reads and commits.
  • Preserve structured depot.head_fence_mismatch errors through pegboard-envoy and envoy protocol v5.
  • Mark native SQLite VFS instances fatal on head-fence mismatch and surface that at the worker boundary.
  • Stop the actor generation on fatal SQLite storage errors for both local native SQLite and remote SQLite.

Why?

SQLite corruption symptoms around actor lifecycle churn point to overlapping actor instances accessing the same database. Depot is the authority for the current SQLite head, so this PR checks expected vs actual head inside Depot transactions and treats mismatches as a fatal lifecycle invariant violation instead of allowing continued reads or writes.

Validation

  • cargo check -p depot-client -p rivetkit-core
  • cargo check -p depot-client-embedded -p pegboard-envoy -p rivetkit-core
  • cargo check -p rivetkit-core
  • cargo test -p depot-client fatal_sqlite_error_reports_first_message_through_worker_boundary -- --nocapture
  • cargo test -p depot-client head_fence_ioerr_maps_to_fatal_worker_error_and_future_operations_fail_closed -- --nocapture
  • cargo test -p depot-client resolve_pages_sends_known_head_txid_as_read_fence -- --nocapture
  • cargo test -p depot --test conveyer_commit head_fence -- --nocapture
  • cargo test -p rivetkit-core remote_head_fence_mismatch_stops_actor_once -- --nocapture
  • git diff --check

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 5, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry marked this pull request as ready for review May 5, 2026 14:23
@NathanFlurry NathanFlurry force-pushed the sqlite-fencing/head-fence-fatal-stop branch from 4fb3acf to 337772d Compare May 5, 2026 14:30
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 5, 2026

Code Review: fix(sqlite): stop actors on head fence mismatch

Overview

This PR hardens Rivet Actor SQLite storage against lifecycle overlap by adding Depot head-fence checks for SQLite reads and commits. The core idea is sound: Depot is the authority for head_txid, the VFS tracks it locally, and any mismatch is treated as a fatal lifecycle invariant violation that stops the actor. The new protocol version bump (v4 to v5) correctly adds structured group/code fields to SqliteErrorResponse plus head_txid fields to read/commit responses.

The design is well-conceived and the test coverage is solid. A few things worth addressing before this lands:


Issues

1. Missing log in io_write fatal path (unlike io_read)

In vfs.rs, io_read logs a tracing::error! with the actor ID, requested pages, and error message before calling ctx.mark_fatal(message). The io_write fatal arm is silent — no log, context is lost. io_read should be the model. A fatal write-path fence mismatch is equally worth logging.

2. Duplicated is_head_fence_mismatch_response and depot_error helpers

is_head_fence_mismatch_response appears in three places with inconsistent style: depot-client/src/vfs.rs uses constants, rivetkit-core/src/actor/sqlite.rs uses raw string literals, and ws_to_tunnel_task.rs inlines the comparison without a helper. Similarly depot_error (the .chain().find_map(downcast) helper) is duplicated across depot-client-embedded, vfs_support.rs, and ws_to_tunnel_task.rs. Both are small enough that duplication is not a hard blocker, but the inconsistency is worth tidying.

3. commit_too_large hardcodes group/code while head_fence_mismatch goes through sqlite_error_response

The CommitTooLarge arm in ws_to_tunnel_task.rs hardcodes group: "depot".to_string(), code: "commit_too_large".to_string() while head_fence_mismatch is routed through sqlite_error_response(err) which uses RivetError::extract. This pre-existed the PR, but adding group/code to the hardcoded path while leaving it inconsistent with the general path is a smell. If SqliteStorageError grows more variants, the pattern to follow is not obvious.

4. Missing newline at EOF in depot.head_fence_mismatch.json

The diff shows no newline at end of file. All other artifact files end with a newline.

5. Missing blank line before impl Debug for SqliteDb in sqlite.rs

The is_head_fence_mismatch_response function is immediately followed by impl std::fmt::Debug for SqliteDb without a blank line.


Observations (non-blocking)

head_txid update in resolve_pages takes two separate write locks. The first write lock reads state.head_txid into expected_head_txid, the transport call runs, then a second write lock updates state.head_txid from the response. In practice this is safe (single writer, Depot reflects current head, fence check catches staleness), but the reasoning is not immediately obvious and a brief comment would help.

saturating_add(1) fallback for head_txid on v4 transports. When talking to a v4 server, the VFS increments its local counter to keep a fence active. This is a reasonable heuristic but provides weaker guarantees during a rolling upgrade window. Worth documenting explicitly.

No VFS-level test for read-path fence rejection. head_fence_ioerr_maps_to_fatal_worker_error_and_future_operations_fail_closed covers the commit-path. The read-path (io_read to GetPagesError::Fatal) is tested indirectly at the depot level. A VFS-level integration test for the read fence path would close the gap.


Protocol / Backward Compatibility

The v4 to v5 conversion looks correct: SqliteCommitOk unit to struct with head_txid: None, SqliteErrorResponse gets group: "core", code: "internal_error" when upgrading from v4, and both ToEnvoy and ToRivet converters enumerate all variants with no _ fall-through. The expected_head_txid: None default means the fence is disabled until the first round-trip, which is the safe backward-compatible behavior.


Summary

The core invariant (Depot is the authority, mismatch is fatal, actor stops) is correctly implemented end-to-end. The test coverage is good. The blocking items are the missing io_write log, the missing EOF newline, and the style nit in sqlite.rs. The duplication of the helpers is worth addressing but not a hard blocker.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 5, 2026

Code Review: fix(sqlite): stop actors on head fence mismatch

Overview

This PR hardens Rivet Actor SQLite storage against concurrent access by introducing a head fence check using a monotonically-increasing head_txid that Depot validates on every read and commit. Mismatches indicate overlapping actor instances and are treated as fatal lifecycle violations. The change includes protocol v5, full backward-compatibility converters, and the propagation path for fatal errors from VFS to worker to actor lifecycle.

The approach is sound and the implementation is generally clean. A few items worth attention below.


Issues

Missing newline in error artifact

engine/artifacts/errors/depot.head_fence_mismatch.json has no trailing newline. All other error JSON files should have one.

depot_error fix not consistently applied to CommitTooLarge

pegboard-envoy::depot_error was correctly fixed to use err.chain().find_map(...) so wrapped errors are matched. However, the CommitTooLarge branch in handle_sqlite_commit still hardcodes group and code as raw string literals (group: "depot".to_string(), code: "commit_too_large".to_string()), bypassing the error system. The HeadFenceMismatch path correctly sets these via RivetError::extract. CommitTooLarge should go through sqlite_error_response(err) for consistency.

HeadFenceMismatch in handle_sqlite_commit (pegboard-envoy) propagates as Err, not as SqliteErrorResponse

When commit_with_options returns HeadFenceMismatch, the _ => Err(err) arm propagates it as a hard error, closing the WebSocket rather than sending a structured SqliteErrorResponse back to the actor. The local VFS path surfaces fence mismatches as CommitBufferError::FenceMismatch and calls mark_fatal. The remote path just kills the connection. Both terminate the actor, but the remote path does not go through mark_fatal, so SqliteWorkerFatalError is never emitted and subsequent operations see Closed rather than fatal. This is unlikely to cause a correctness issue in practice, but it means the stale-writer test does not cover the remote path end-to-end.

sqlite_error_response logic duplicated in three crates

Identical logic appears in depot-client-embedded/src/lib.rs, depot-client/tests/inline/vfs_support.rs, and pegboard-envoy/src/ws_to_tunnel_task.rs. The depot-client-types crate already holds shared constants and would be a natural home for a shared helper to avoid future drift.


Minor Observations

  • Blank line removed in vfs.rs between pub type SqliteTransportHandle and fn sqlite_now_ms() is a minor style inconsistency.

  • saturating_add(1) fallback for head_txid: When outcome.head_txid is None (mirror transport), the VFS increments its local estimate. Correct for transports that do not participate in fence checking, but the rationale is non-obvious. A short inline comment would help.

  • worker_fatal_reported.store(false, ...) placement fix: Moving the reset before open_database_from_transport closes a narrow window where a fatal error on the very first operation would not trigger the actor stop signal. Good catch.

  • v4 to v5 downgrade loses fence info: Intentional and correctly handled. head_txid: None disables fence checking on old-version peers. convert_sqlite_error_response_v4_to_v5 uses core.internal_error so v4 errors do not accidentally trigger fence handling when upgraded.


Test Coverage

Good. Tests cover:

  • resolve_pages_sends_known_head_txid_as_read_fence: verifies the fence txid is forwarded in get-pages requests
  • fatal_sqlite_error_reports_first_message_through_worker_boundary: first-message semantics and clean close after fatal
  • head_fence_ioerr_maps_to_fatal_worker_error_and_future_operations_fail_closed: end-to-end with a real stale writer
  • commit_head_fence_rejects_stale_writer and get_pages_head_fence_rejects_stale_reader: Depot-level fence enforcement
  • remote_head_fence_mismatch_stops_actor_once: rivetkit-core remote path and AtomicBool deduplication

Missing: a test verifying the remote fence path through pegboard-envoy produces a structured SqliteErrorResponse back to the actor transport rather than just closing the connection.


Summary

The core logic is correct and well-tested. Three items worth addressing before merge:

  1. Trailing newline in depot.head_fence_mismatch.json
  2. CommitTooLarge in handle_sqlite_commit should use sqlite_error_response rather than hardcoded string literals
  3. The pegboard-envoy HeadFenceMismatch commit path should serialize a structured SqliteErrorResponse rather than propagating Err, for consistency with the local VFS behavior

Base automatically changed from 05-05-fix_rivetkit_fix_crypto_dep_in_drizzle to main May 5, 2026 14:58
@NathanFlurry NathanFlurry merged commit 337772d into main May 5, 2026
10 of 22 checks passed
@NathanFlurry NathanFlurry deleted the sqlite-fencing/head-fence-fatal-stop branch May 5, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant