Skip to content

chore: sqlite v1 to v2 data migration#4692

Draft
NathanFlurry wants to merge 1 commit into04-19-chore_move_rivetkit_to_task_modelfrom
04-21-chore_sqlite_v1_to_v2_data_migration
Draft

chore: sqlite v1 to v2 data migration#4692
NathanFlurry wants to merge 1 commit into04-19-chore_move_rivetkit_to_task_modelfrom
04-21-chore_sqlite_v1_to_v2_data_migration

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented Apr 22, 2026

🚅 Deployed to the rivet-pr-4692 environment in rivet-frontend

Service Status Web Updated (UTC)
website 😴 Sleeping (View Logs) Web Apr 22, 2026 at 9:14 am
kitchen-sink ❌ Build Failed (View Logs) Web Apr 22, 2026 at 9:12 am
frontend-cloud ❌ Build Failed (View Logs) Web Apr 22, 2026 at 9:04 am
frontend-inspector ❌ Build Failed (View Logs) Web Apr 22, 2026 at 9:04 am
mcp-hub ✅ Success (View Logs) Web Apr 22, 2026 at 9:03 am
ladle ❌ Build Failed (View Logs) Web Apr 22, 2026 at 9:03 am

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 22, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry changed the base branch from 04-19-chore_move_rivetkit_to_task_model to graphite-base/4692 April 22, 2026 09:04
@claude
Copy link
Copy Markdown

claude Bot commented Apr 22, 2026

Code Review: SQLite v1 to v2 Data Migration

This is a solid, well-tested migration path. The core logic is correct and the test coverage is comprehensive. A few items to address before merging:


Bug / Correctness

Indentation in prepare_v1_migration breaks the if let block
engine/packages/sqlite-storage/src/takeover.rs

The body of if let Some(existing_meta) = ... is not indented relative to the if-let arm, which is both a formatting violation and risks future misreads:

if let Some(existing_meta) =
    udb::tx_get_value_serializable(&tx, &subspace, &meta_storage_key).await?
{
let existing_head = decode_db_head(&existing_meta)?;   // ← should be one tab deeper
ensure!(
    matches!(existing_head.origin, SqliteOrigin::MigratingFromV1),
    SqliteStorageError::ConcurrentTakeover
);
}

Should be:

if let Some(existing_meta) =
    udb::tx_get_value_serializable(&tx, &subspace, &meta_storage_key).await?
{
    let existing_head = decode_db_head(&existing_meta)?;
    ensure!(
        matches!(existing_head.origin, SqliteOrigin::MigratingFromV1),
        SqliteStorageError::ConcurrentTakeover
    );
}

Performance Concerns

Every actor start scans v1 KV keys even after a successful migration
engine/packages/pegboard-envoy/src/sqlite_runtime.rsmaybe_migrate_v1_to_v2

After migration succeeds, the v1 KV keys are intentionally not deleted. As a result, every subsequent actor start still runs:

  1. sqlite_v1_data_exists → true (KV prefix scan)
  2. Acquire migration lock
  3. sqlite_v1_data_exists again → true (second KV prefix scan)
  4. try_load_headMigratedFromV1 → early return

This is 2 extra KV round-trips on every warm start for any actor that was ever on v1. If v1 actors are numerous, this adds up at scale. Consider adding a test for idempotent call behavior, and a TODO comment explaining why v1 data is not cleaned up post-migration (or clean it up in a follow-up).

Wrong histogram bucket set for SQLITE_MIGRATION_PAGES
engine/packages/pegboard-envoy/src/metrics.rs

SQLITE_MIGRATION_PAGES reuses the latency-oriented BUCKETS (max 500.0). The migration limit is 128 MB / 4 KB = 32 768 pages, so nearly all migrations will fall into the overflow bucket. A custom exponential bucket set covering 1–40 000 would make this histogram useful:

pub static ref SQLITE_MIGRATION_PAGES: Histogram = register_histogram_with_registry!(
    "pegboard_envoy_sqlite_migration_pages",
    "Number of pages imported during sqlite v1 to v2 migration.",
    vec![1.0, 4.0, 16.0, 64.0, 256.0, 1024.0, 4096.0, 16384.0, 32768.0],
    *REGISTRY
).unwrap();

CLAUDE.md Convention Violations

eprintln! in test functions
rivetkit-rust/packages/rivetkit-sqlite/src/vfs.rsprofile_large_tx_insert_5mb, profile_hot_row_updates, profile_large_tx_insert_1mb, profile_large_tx_insert_1mb_preloaded

CLAUDE.md: "Never use eprintln! or println! for logging in Rust code. Always use tracing macros."

These profile tests print timing summaries with eprintln!. Either convert to tracing::info! calls, or mark the eprintln! blocks with a comment explaining they are intentional test-only diagnostic output and use println! (which the cargo test runner captures by default).

anyhow! macro usage (prefer .context() / .ok_or_else(|| ...) with .context())
rivetkit-rust/packages/rivetkit-sqlite/src/database.rs and engine/packages/sqlite-storage/src/udb.rs

CLAUDE.md: "Prefer anyhow's .context() over anyhow! macro."

Affected lines:

// database.rs
.ok_or_else(|| anyhow!("missing sqlite startup data for actor {actor_id}"))?;
.map_err(|e| anyhow!("failed to register sqlite VFS: {e}"))?;
.map_err(|e| anyhow!("failed to open sqlite database: {e}"))

// udb.rs
return Err(anyhow::anyhow!(
    "unknown sqlite-storage value marker {other} for key {:?}", key
));

Minor / Design Notes (no action required)

decode_db_head fallback is silently lenient on parse errors
engine/packages/sqlite-storage/src/types.rs

When the new DBHead (with origin) fails to deserialize, the code silently retries with LegacyDBHead. If the failure is due to data corruption (not just a missing trailing field), LegacyDBHead might still succeed and return silently wrong data. Adding an explicit check that the legacy decode only fires when the error looks like a trailing-field mismatch (e.g. checking byte length) would make this more robust, but the current approach is probably acceptable given BARE's strict length encoding.

split_bytes edge case with empty max_chunk_bytes = 0
engine/packages/pegboard-envoy/src/sqlite_runtime.rs

When bytes.is_empty() || max_chunk_bytes == 0, the function returns vec![bytes.to_vec()] — a single empty chunk. Callers protect against this (max_delta_bytes coming from SqliteMeta is always positive), but an assert!(max_chunk_bytes > 0) guard would make the invariant explicit.

v1 data never explicitly cleaned up post-migration
The design decision to leave v1 KV keys in place after migration appears intentional (as evidenced by skips_native_v2_state_even_if_v1_tombstone_exists). A short comment in maybe_migrate_v1_to_v2 explaining this choice would help future readers.


Test Coverage Assessment

Coverage is excellent:

  • migrates_v1_sqlite_into_v2_storage — happy path
  • retries_cleanly_after_stale_partial_v1_import — lease/retry logic
  • rejects_fresh_in_progress_v1_migrations — in-progress guard
  • skips_native_v2_state_even_if_v1_tombstone_exists — origin check
  • bails_when_v2_meta_is_unreadable — corrupt meta handling
  • recovers_a_pending_v1_journal_before_import — journal rollback
  • migrates_zero_size_v1_state_without_pages — empty DB
  • rejects_v1_databases_with_unsupported_page_size — page size guard
  • rejects_v1_wal_sidecars / rejects_v1_shm_sidecars — WAL/SHM guards
  • rejects_v1_files_with_missing_chunks — chunk integrity
  • rejects_v1_files_that_exceed_migration_limit — size guard
  • reads_v1_files_beyond_the_default_kv_list_limit — pagination

Missing: a test that calls maybe_migrate_v1_to_v2 twice on an already-migrated actor and asserts it returns Ok(false) on the second call (idempotency path through MigratedFromV1).


Summary

The architecture is sound. The key fixes before merging are:

  1. Fix the if let body indentation in prepare_v1_migration (formatting/readability bug)
  2. Replace eprintln! calls in profile tests with tracing::info! or println!
  3. Fix the SQLITE_MIGRATION_PAGES histogram buckets (observability bug)
  4. Replace anyhow! macro usages with .context()/.ok_or_else(...).context(...) per project conventions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant