Skip to content

fix(sqlite): tolerate startup takeover races#4779

Draft
NathanFlurry wants to merge 1 commit intodriver-fixes/refresh-runner-config-after-envoy-connectfrom
driver-fixes/sqlite-startup-takeover-races
Draft

fix(sqlite): tolerate startup takeover races#4779
NathanFlurry wants to merge 1 commit intodriver-fixes/refresh-runner-config-after-envoy-connectfrom
driver-fixes/sqlite-startup-takeover-races

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 26, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

Code Review

Good targeted fix for a real race condition. The two-part approach (fixing the reset_v1_migration guard + adding retry logic in maybe_load_sqlite_startup_data) is logically sound. A few things worth discussing:


takeover.rstracing::error! becomes misleading after this fix

The CAS failure inside takeover() still fires at error level even though the new retry logic in maybe_load_sqlite_startup_data now treats this as a recoverable race:

// sqlite-storage/src/takeover.rs ~L206
tracing::error!(
    actor_id = %actor_id,
    "meta changed during takeover, concurrent writer detected"
);
return Err(SqliteStorageError::ConcurrentTakeover.into());

After this PR, ConcurrentTakeover is expected during startup races and will be silently retried or resolved. Keeping it at error will produce misleading noise in logs for every race that the new code successfully handles. Consider downgrading to debug! or warn! to match the actual severity of this now-recoverable case.


sqlite_runtime.rsobserved_generation = 0 when no existing meta

let mut observed_generation = 0;
if let Some(meta) = sqlite_engine.try_load_meta(&actor_id).await? {
    ...
    observed_generation = meta.generation;
}

DBHead::new() initializes generation to 1, so any successful first-ever takeover on a fresh actor returns generation >= 1 > 0. This means a racing first-boot will always satisfy current_meta.generation > observed_generation and return without preloaded pages. That is correct behavior, but the intent is non-obvious. A comment explaining why 0 is the right sentinel would help a future reader.


sqlite_runtime.rs — no preloaded pages on the fast-path return

When current_meta.generation > observed_generation, the code returns:

preloaded_pages: Vec::new(),

This is correct for correctness, but the actor now needs to fetch page 1 (and any other hot pages) lazily. This is a performance regression vs. the winner path, but acceptable for a rare race. Consider adding a comment that this degraded path is intentional.


Minor: loop structure / unreachable!

The loop 0..=SQLITE_STARTUP_TAKEOVER_RETRIES (i.e. 4 iterations: 0, 1, 2, 3) plus the attempt == SQLITE_STARTUP_TAKEOVER_RETRIES early-return is correct but requires two mental passes to verify. An alternative that reads more directly:

for attempt in 0..SQLITE_STARTUP_TAKEOVER_RETRIES {
    // try; on success return; on ConcurrentTakeover check gen + maybe sleep; on other err return
}
// final attempt: no retry, propagate error

Not a blocker — the current form is correct — but worth a note for future readers.


Test coverage gap

The invalidate_v1_migration_ignores_native_v2_heads test is a nice addition. The retry logic in maybe_load_sqlite_startup_data is the more complex new path, but it sits in pegboard-envoy where integration tests are harder to write. At minimum, a comment in the function body explaining the two distinct cases handled by the retry loop (racing concurrent startup vs. in-flight meta update) would help future readers understand what the test coverage gap is and why the code is written the way it is.


Overall: The fix is correct and the scope is appropriately tight. The main thing I'd want addressed before merge is the tracing::error! level in takeover.rs — it will pollute monitoring alerts for a race that is now fully expected and handled.

@github-actions
Copy link
Copy Markdown
Contributor

Preview packages published to npm

Install with:

npm install rivetkit@pr-4779

All packages published as 0.0.0-pr.4779.35d4cca with tag pr-4779.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-35d4cca
docker pull rivetdev/engine:full-35d4cca
Individual packages
npm install rivetkit@pr-4779
npm install @rivetkit/react@pr-4779
npm install @rivetkit/rivetkit-napi@pr-4779
npm install @rivetkit/workflow-engine@pr-4779

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant