Skip to content

fix(depot-client): only fatal sqlite vfs on fence mismatch#5007

Merged
NathanFlurry merged 1 commit into
mainfrom
sqlite-vfs/fatal-only-fence-mismatch
May 9, 2026
Merged

fix(depot-client): only fatal sqlite vfs on fence mismatch#5007
NathanFlurry merged 1 commit into
mainfrom
sqlite-vfs/fatal-only-fence-mismatch

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 9, 2026

🚅 Deployed to the rivet-pr-5007 environment in rivet-frontend

Service Status Web Updated (UTC)
kitchen-sink ❌ Build Failed (View Logs) Web May 9, 2026 at 7:33 am
website 😴 Sleeping (View Logs) Web May 9, 2026 at 6:34 am
frontend-inspector ❌ Build Failed (View Logs) Web May 9, 2026 at 6:26 am
frontend-cloud ❌ Build Failed (View Logs) Web May 9, 2026 at 6:25 am
ladle ❌ Build Failed (View Logs) Web May 9, 2026 at 6:25 am
mcp-hub ✅ Success (View Logs) Web May 9, 2026 at 6:25 am

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 9, 2026

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 9, 2026

Code Review

Overview

This PR changes the SQLite VFS error-handling model so that only confirmed head-fence mismatches permanently kill the VFS, while plain transport failures (network drops, lost responses) are treated as transient and allow retries. The key insight is sound: if a commit never reached the depot, the head fence has not advanced, so a retry from the same head will succeed; if the commit was applied but the response was lost, the next attempt will get a fence mismatch from the server, which is caught and made fatal at that point.


Positives

  • Semantic rename GetPagesError::Fatal to GetPagesError::FenceMismatch is a clear improvement.
  • Removing mark_dead and collapsing the dead/fatal distinction simplifies reasoning: dead now means permanently broken due to a consistency violation, not encountered any error.
  • New tests are well-structured. lost_commit_response_fails_later_on_head_fence_mismatch and unapplied_commit_transport_failure_can_retry_from_same_head cleanly demonstrate the two key scenarios and use the fail_next_commit_after_apply hook to simulate the applied-but-response-lost case.
  • Early return plus null-pointer in NativeDatabase::Drop after a failed flush (self.db = ptr::null_mut(); return;) is the right move to prevent further use of a handle that may be in an ambiguous state.

Issues / Questions

1. Test name mismatch (direct_engine_fresh_reopen_recovers_after_poisoned_handle)

The test was updated to assert !is_dead() but the name still references a "poisoned_handle". This will mislead future readers. Consider renaming to something like direct_engine_fresh_reopen_survives_transient_transport_error.

2. StageNotFound variant removed silently

CommitBufferError::StageNotFound(u64) is removed from the enum with no explanation. If the depot can still return a "stage not found" response, this needs to be handled somewhere (perhaps folded into CommitBufferError::Other). Worth a note explaining whether this was dead code or is now folded into another variant.

3. transient_commit_error deferred path only covers atomic-write (io_file_control)

The defer is stored in io_file_control for CommitBufferError::Other and consumed in io_sync. However, the non-atomic write path in handle_non_finalize_commit_error now calls set_last_error directly, not defer_transient_commit_error. This means a non-atomic commit transport failure sets last_error immediately but io_sync will proceed normally, potentially masking the error at the SQLite level. Is this intentional? If so, the distinction between the two paths deserves a comment.

4. transient_commit_error not cleared on VFS reset / reopen

If a transient error is deferred but io_sync is never called (e.g., an early drop), the next database open on a re-registered VFS context would start with a stale transient_commit_error. Worth confirming that VfsContext is always freshly constructed on open and not reused across opens.

5. fail_next_commit_after_apply hook semantics

take_commit_after_apply_error() is called on every successful commit_buffer result (the Ok(result) arm). If a commit fails before being applied, the hook will not fire. This is correct, but a one-line comment explaining the intent (fires after the commit is applied server-side, before the response is returned) would help future test authors.


Minor Notes

  • Per CLAUDE.md the PR checklist is entirely unchecked. Expected for a draft, but fill it before marking ready.
  • The new fail_next_commit_after_apply hook in vfs_support.rs could use a brief comment explaining its semantics.

Summary

The design is correct and the new test coverage for the two critical scenarios (pre-apply drop vs. post-apply lost response) is solid. Main items before merging: clarify the StageNotFound removal, document or resolve the asymmetry between atomic and non-atomic commit error-deferral paths, and rename the stale test.

@NathanFlurry NathanFlurry marked this pull request as ready for review May 9, 2026 07:00
@NathanFlurry NathanFlurry changed the base branch from 05-07-do_not_merge_serverless_restart_race_condition to graphite-base/5007 May 9, 2026 07:33
@NathanFlurry NathanFlurry force-pushed the sqlite-vfs/fatal-only-fence-mismatch branch from f2a28ab to 24c5d49 Compare May 9, 2026 07:33
@NathanFlurry NathanFlurry force-pushed the graphite-base/5007 branch from 011798b to 0889a3a Compare May 9, 2026 07:33
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-5007 May 9, 2026 07:33 Destroyed
@NathanFlurry NathanFlurry changed the base branch from graphite-base/5007 to runtime-options/preserve-bridge-errors May 9, 2026 07:33
Base automatically changed from runtime-options/preserve-bridge-errors to main May 9, 2026 07:50
@NathanFlurry NathanFlurry merged commit 24c5d49 into main May 9, 2026
10 of 19 checks passed
@NathanFlurry NathanFlurry deleted the sqlite-vfs/fatal-only-fence-mismatch branch May 9, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant