feat(flatkv): add snapshot, WAL catchup, and rollback support#2972
feat(flatkv): add snapshot, WAL catchup, and rollback support#2972blindchaser merged 25 commits intomainfrom
Conversation
- Use getAccountValue (cache-aware) instead of getAccountValueFromDB for LtHash old-value reads, fixing incorrect deltas across multiple ApplyChangeSets calls before Commit - Clear pending write buffers in open() to prevent stale writes after LoadVersion - Fix misleading "always sync" comment on WAL writes (NoSync=true)
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bccc292045
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err := s.open(); err != nil { | ||
| return fmt.Errorf("open for rollback: %w", err) | ||
| } |
There was a problem hiding this comment.
Reclone baseline snapshot before rollback catchup
When Rollback updates current and immediately calls open, it does not invalidate working/SNAPSHOT_BASE, so after a restart (where working was already cloned from that same snapshot) createWorkingDir reuses a working DB that may already be at a higher version than targetVersion. In that case catchup skips all entries (entry.Version <= committedVersion), rollback fails with a version mismatch, and this can happen after the WAL has already been truncated, leaving rollback in a partially-mutated state. Force a fresh clone of working from the selected snapshot before opening during rollback (like LoadVersion(target>0) already does).
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2972 +/- ##
==========================================
+ Coverage 58.14% 58.16% +0.01%
==========================================
Files 2111 2113 +2
Lines 173562 174071 +509
==========================================
+ Hits 100924 101247 +323
- Misses 63683 63780 +97
- Partials 8955 9044 +89
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| // Checkpointable is an optional capability for DB engines that support | ||
| // efficient point-in-time snapshots via filesystem hardlinks. | ||
| type Checkpointable interface { |
There was a problem hiding this comment.
Can you augment the godoc with information about concurrency? For example, is it safe to call this method concurrently with updates in other threads? When this method returns, is the checkpoint capable of surviving a host OS crash?
There was a problem hiding this comment.
sure, added concurrency and durability documentation to the Checkpointable
| snapshotPrefix = "snapshot-" | ||
| snapshotDirLen = len(snapshotPrefix) + 20 | ||
|
|
||
| currentLink = "current" | ||
| currentTmpLink = "current-tmp" |
There was a problem hiding this comment.
Short descriptions for what each constant is for might be helpful.
There was a problem hiding this comment.
One way to document this, feel free to push back if you think it's too much overhead.
At past companies, when documenting this sort of file layout, I sometimes find it useful to do the following:
- write a simple unit test that generates a basic file structure
- pause the unit test before it deletes its data
- run
treeon the directory created by the test - edit the result for readability copy-paste the rest somewhere
Here's an example: https://github.com/Layr-Labs/eigenda/blob/master/litt/docs/filesystem_layout.md
If the file layout is to big, it might make sense to split it out into a markdown file that you can just reference in the godoc.
There was a problem hiding this comment.
good suggestion. adding inline godoc with the directory layout (ASCII tree) above the constants in snapshot.go. this covers the logical structure (which directories exist and what they store).If we add more complexity later (e.g. sharded storage), we may revist and split it out into a dedicated doc.
There was a problem hiding this comment.
Yeah, I think that layer document might help people understand FlatKV better, but agree we can work on a ReadMe of this whole FlatKV and include there in the future
| // not a full path. | ||
| func updateCurrentSymlink(root, snapshotDir string) error { | ||
| tmpPath := filepath.Join(root, currentTmpLink) | ||
| _ = os.Remove(tmpPath) |
There was a problem hiding this comment.
What if this removal fails due to invalid file permissions? I'm guessing you aren't checking the error in case it fails due to the path not existing. Perhaps this can first check if the file exists, delete it if it exists, and then return an error if that deletion fails.
There was a problem hiding this comment.
adding error handling here
sei-db/config/toml.go
Outdated
| snapshot-interval = {{ .StateCommit.FlatKVConfig.SnapshotInterval }} | ||
|
|
||
| # SnapshotKeepRecent defines how many old snapshots to keep besides the latest one. | ||
| # 0 = keep only the current snapshot. Default: 1. |
There was a problem hiding this comment.
We probably need more than 1 snapshot here in order to do state export
There was a problem hiding this comment.
i see. bump the default to be 2
sei-db/config/toml.go
Outdated
| # SnapshotMinTimeInterval is the minimum wall-clock seconds between consecutive | ||
| # auto-snapshots. Prevents dense snapshots during catch-up. Default: 3600 (1 hour). | ||
| snapshot-min-time-interval = {{ .StateCommit.FlatKVConfig.SnapshotMinTimeInterval }} |
There was a problem hiding this comment.
I think this is a memiavl specific optimization, and not needed by flatKV, since FlatKV snapshot is so cheap, would rather remove the config for simplicity
There was a problem hiding this comment.
agree. I removed snapshot-min-time-interval from FlatKV config/template/code, and now auto-snapshot is triggered only by snapshot-interval.
| snapshotPrefix = "snapshot-" | ||
| snapshotDirLen = len(snapshotPrefix) + 20 | ||
|
|
||
| currentLink = "current" | ||
| currentTmpLink = "current-tmp" |
There was a problem hiding this comment.
Yeah, I think that layer document might help people understand FlatKV better, but agree we can work on a ReadMe of this whole FlatKV and include there in the future
|
|
||
| if replayed > 0 { | ||
| if !s.config.Fsync { | ||
| if err := s.flushAllDBs(); err != nil { |
There was a problem hiding this comment.
during catch-up with NoSync, per-entry batch commits may still be in OS/page cache. If we advance global metadata immediately, a crash can leave the global watermark ahead of durable data.
so we do a single flush before committing global metadata to preserve durability ordering. I added this rationale as an inline comment.
| } | ||
|
|
||
| // flushAllDBs flushes all data DBs to ensure data is on disk. | ||
| func (s *CommitStore) flushAllDBs() error { |
There was a problem hiding this comment.
I think the only place where need to call flush is right before we take a new snapshot?
## Describe your changes and provide context Introduce a snapshot-based lifecycle for FlatKV so that restarts replay only from the nearest PebbleDB checkpoint instead of the full WAL. Key changes: - Snapshot management: immutable PebbleDB checkpoints created via Checkpoint(), managed through a "current" symlink and atomic directory operations. Configurable interval, retention, and minimum time between snapshots. - Working directory: mutable clone of the baseline snapshot (hardlinks for .sst files) so writes never mutate snapshot dirs. - WAL catchup: on open, replay changelog entries from the snapshot version to the target version using O(1) arithmetic + O(log N) binary search for offset resolution. - Rollback: rewind to the best snapshot <= target, truncate WAL, replay to exact version, and prune future snapshots. - File lock: prevent concurrent access from multiple processes. - Migration: automatically move pre-snapshot flat layout into a versioned snapshot directory on first open. - Auto WAL truncation: periodically discard WAL entries older than the earliest retained snapshot. - Fix account LtHash baseline capture to use pre-batch state when multiple ApplyChangeSets calls precede a single Commit. - Add legacyDB to flushAllDBs. - Mark Iterator/IteratorByPrefix as EXPERIMENTAL (unused in production). ## Testing performed to validate your change
Describe your changes and provide context
Introduce a snapshot-based lifecycle for FlatKV so that restarts replay
only from the nearest PebbleDB checkpoint instead of the full WAL.
Key changes:
Checkpoint(), managed through a "current" symlink and atomic
directory operations. Configurable interval, retention, and
minimum time between snapshots.
for .sst files) so writes never mutate snapshot dirs.
version to the target version using O(1) arithmetic + O(log N)
binary search for offset resolution.
replay to exact version, and prune future snapshots.
versioned snapshot directory on first open.
the earliest retained snapshot.
multiple ApplyChangeSets calls precede a single Commit.
Testing performed to validate your change