Skip to content

feat(flatkv): add snapshot, WAL catchup, and rollback support#2972

Merged
blindchaser merged 25 commits intomainfrom
yiren/flatkv-snapshot
Feb 27, 2026
Merged

feat(flatkv): add snapshot, WAL catchup, and rollback support#2972
blindchaser merged 25 commits intomainfrom
yiren/flatkv-snapshot

Conversation

@blindchaser
Copy link
Contributor

Describe your changes and provide context

Introduce a snapshot-based lifecycle for FlatKV so that restarts replay
only from the nearest PebbleDB checkpoint instead of the full WAL.

Key changes:

  • Snapshot management: immutable PebbleDB checkpoints created via
    Checkpoint(), managed through a "current" symlink and atomic
    directory operations. Configurable interval, retention, and
    minimum time between snapshots.
  • Working directory: mutable clone of the baseline snapshot (hardlinks
    for .sst files) so writes never mutate snapshot dirs.
  • WAL catchup: on open, replay changelog entries from the snapshot
    version to the target version using O(1) arithmetic + O(log N)
    binary search for offset resolution.
  • Rollback: rewind to the best snapshot <= target, truncate WAL,
    replay to exact version, and prune future snapshots.
  • File lock: prevent concurrent access from multiple processes.
  • Migration: automatically move pre-snapshot flat layout into a
    versioned snapshot directory on first open.
  • Auto WAL truncation: periodically discard WAL entries older than
    the earliest retained snapshot.
  • Fix account LtHash baseline capture to use pre-batch state when
    multiple ApplyChangeSets calls precede a single Commit.
  • Add legacyDB to flushAllDBs.
  • Mark Iterator/IteratorByPrefix as EXPERIMENTAL (unused in production).

Testing performed to validate your change

@github-actions
Copy link

github-actions bot commented Feb 24, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedFeb 27, 2026, 6:20 PM

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bccc292045

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +521 to +523
if err := s.open(); err != nil {
return fmt.Errorf("open for rollback: %w", err)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reclone baseline snapshot before rollback catchup

When Rollback updates current and immediately calls open, it does not invalidate working/SNAPSHOT_BASE, so after a restart (where working was already cloned from that same snapshot) createWorkingDir reuses a working DB that may already be at a higher version than targetVersion. In that case catchup skips all entries (entry.Version <= committedVersion), rollback fails with a version mismatch, and this can happen after the WAL has already been truncated, leaving rollback in a partially-mutated state. Force a fresh clone of working from the selected snapshot before opening during rollback (like LoadVersion(target>0) already does).

Useful? React with 👍 / 👎.

s.pruneSnapshots(dir, version)

success = true
s.lastSnapshotTime = time.Now()

Check warning

Code scanning / CodeQL

Calling the system time Warning

Calling the system time may be a possible source of non-determinism
@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 63.18408% with 222 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.16%. Comparing base (ff1fcea) to head (8ea2bfe).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
sei-db/state_db/sc/flatkv/snapshot.go 66.86% 59 Missing and 50 partials ⚠️
sei-db/state_db/sc/flatkv/store.go 57.60% 30 Missing and 23 partials ⚠️
sei-db/state_db/sc/flatkv/store_catchup.go 58.33% 23 Missing and 17 partials ⚠️
sei-db/state_db/sc/flatkv/store_write.go 37.50% 8 Missing and 7 partials ⚠️
sei-db/state_db/sc/flatkv/store_lifecycle.go 76.47% 2 Missing and 2 partials ⚠️
sei-db/state_db/sc/flatkv/store_read.go 50.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2972      +/-   ##
==========================================
+ Coverage   58.14%   58.16%   +0.01%     
==========================================
  Files        2111     2113       +2     
  Lines      173562   174071     +509     
==========================================
+ Hits       100924   101247     +323     
- Misses      63683    63780      +97     
- Partials     8955     9044      +89     
Flag Coverage Δ
sei-chain-pr 65.85% <63.18%> (?)
sei-db 69.50% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-db/db_engine/pebbledb/db.go 93.50% <100.00%> (+0.35%) ⬆️
sei-db/state_db/sc/flatkv/config.go 100.00% <100.00%> (ø)
sei-db/state_db/sc/flatkv/iterator.go 39.65% <100.00%> (+1.63%) ⬆️
sei-db/state_db/sc/flatkv/keys.go 100.00% <ø> (+4.70%) ⬆️
sei-db/state_db/sc/flatkv/store_meta.go 67.56% <100.00%> (ø)
sei-db/state_db/sc/flatkv/store_read.go 58.27% <50.00%> (+1.51%) ⬆️
sei-db/state_db/sc/flatkv/store_lifecycle.go 52.38% <76.47%> (+3.99%) ⬆️
sei-db/state_db/sc/flatkv/store_write.go 68.75% <37.50%> (-3.13%) ⬇️
sei-db/state_db/sc/flatkv/store_catchup.go 58.33% <58.33%> (ø)
sei-db/state_db/sc/flatkv/store.go 59.68% <57.60%> (-4.97%) ⬇️
... and 1 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines +58 to +60
// Checkpointable is an optional capability for DB engines that support
// efficient point-in-time snapshots via filesystem hardlinks.
type Checkpointable interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you augment the godoc with information about concurrency? For example, is it safe to call this method concurrently with updates in other threads? When this method returns, is the checkpoint capable of surviving a host OS crash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, added concurrency and durability documentation to the Checkpointable

Comment on lines +20 to +24
snapshotPrefix = "snapshot-"
snapshotDirLen = len(snapshotPrefix) + 20

currentLink = "current"
currentTmpLink = "current-tmp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short descriptions for what each constant is for might be helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to document this, feel free to push back if you think it's too much overhead.

At past companies, when documenting this sort of file layout, I sometimes find it useful to do the following:

  • write a simple unit test that generates a basic file structure
  • pause the unit test before it deletes its data
  • run tree on the directory created by the test
  • edit the result for readability copy-paste the rest somewhere

Here's an example: https://github.com/Layr-Labs/eigenda/blob/master/litt/docs/filesystem_layout.md

Image

If the file layout is to big, it might make sense to split it out into a markdown file that you can just reference in the godoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion. adding inline godoc with the directory layout (ASCII tree) above the constants in snapshot.go. this covers the logical structure (which directories exist and what they store).If we add more complexity later (e.g. sharded storage), we may revist and split it out into a dedicated doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that layer document might help people understand FlatKV better, but agree we can work on a ReadMe of this whole FlatKV and include there in the future

// not a full path.
func updateCurrentSymlink(root, snapshotDir string) error {
tmpPath := filepath.Join(root, currentTmpLink)
_ = os.Remove(tmpPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this removal fails due to invalid file permissions? I'm guessing you aren't checking the error in case it fails due to the path not existing. Perhaps this can first check if the file exists, delete it if it exists, and then return an error if that deletion fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding error handling here

snapshot-interval = {{ .StateCommit.FlatKVConfig.SnapshotInterval }}

# SnapshotKeepRecent defines how many old snapshots to keep besides the latest one.
# 0 = keep only the current snapshot. Default: 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need more than 1 snapshot here in order to do state export

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. bump the default to be 2

Comment on lines +85 to +87
# SnapshotMinTimeInterval is the minimum wall-clock seconds between consecutive
# auto-snapshots. Prevents dense snapshots during catch-up. Default: 3600 (1 hour).
snapshot-min-time-interval = {{ .StateCommit.FlatKVConfig.SnapshotMinTimeInterval }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a memiavl specific optimization, and not needed by flatKV, since FlatKV snapshot is so cheap, would rather remove the config for simplicity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. I removed snapshot-min-time-interval from FlatKV config/template/code, and now auto-snapshot is triggered only by snapshot-interval.

Comment on lines +20 to +24
snapshotPrefix = "snapshot-"
snapshotDirLen = len(snapshotPrefix) + 20

currentLink = "current"
currentTmpLink = "current-tmp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that layer document might help people understand FlatKV better, but agree we can work on a ReadMe of this whole FlatKV and include there in the future


if replayed > 0 {
if !s.config.Fsync {
if err := s.flushAllDBs(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we flush here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during catch-up with NoSync, per-entry batch commits may still be in OS/page cache. If we advance global metadata immediately, a crash can leave the global watermark ahead of durable data.

so we do a single flush before committing global metadata to preserve durability ordering. I added this rationale as an inline comment.

}

// flushAllDBs flushes all data DBs to ensure data is on disk.
func (s *CommitStore) flushAllDBs() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only place where need to call flush is right before we take a new snapshot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments above

@blindchaser blindchaser enabled auto-merge (squash) February 27, 2026 18:25
@blindchaser blindchaser merged commit 3990ce9 into main Feb 27, 2026
38 checks passed
@blindchaser blindchaser deleted the yiren/flatkv-snapshot branch February 27, 2026 18:34
yzang2019 added a commit that referenced this pull request Feb 27, 2026
* main:
  ERC20 simulation benchmark (#2979)
  feat(flatkv): add snapshot, WAL catchup, and rollback support (#2972)
yzang2019 pushed a commit that referenced this pull request Feb 27, 2026
## Describe your changes and provide context
Introduce a snapshot-based lifecycle for FlatKV so that restarts replay
only from the nearest PebbleDB checkpoint instead of the full WAL.

Key changes:
- Snapshot management: immutable PebbleDB checkpoints created via
  Checkpoint(), managed through a "current" symlink and atomic
  directory operations. Configurable interval, retention, and
  minimum time between snapshots.
- Working directory: mutable clone of the baseline snapshot (hardlinks
  for .sst files) so writes never mutate snapshot dirs.
- WAL catchup: on open, replay changelog entries from the snapshot
  version to the target version using O(1) arithmetic + O(log N)
  binary search for offset resolution.
- Rollback: rewind to the best snapshot <= target, truncate WAL,
  replay to exact version, and prune future snapshots.
- File lock: prevent concurrent access from multiple processes.
- Migration: automatically move pre-snapshot flat layout into a
  versioned snapshot directory on first open.
- Auto WAL truncation: periodically discard WAL entries older than
  the earliest retained snapshot.
- Fix account LtHash baseline capture to use pre-batch state when
  multiple ApplyChangeSets calls precede a single Commit.
- Add legacyDB to flushAllDBs.
- Mark Iterator/IteratorByPrefix as EXPERIMENTAL (unused in production).
## Testing performed to validate your change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants