Conversation
Populating a database key by key pays the insert path on every key: descend, split the leaf, post a separator, split interior nodes. For a cold load of a sorted dump that work is wasted, because the final tree shape is known up front. Add db.Load(next), a streaming pull API that builds the tree bottom-up. A new optional engine.BulkLoader capability carries the build; the B-tree core packs each leaf once, links the B-link sibling chain as it goes, and stacks the interior levels on top, writing every page exactly once instead of O(n log n) splits. Loading into a non-empty database, or an engine without the capability, falls back to chunked WriteBatch commits, which accept any key order. The fast path writes pages straight through the pager and makes the build durable with one checkpoint after it returns. The oracle is advanced only on success and before the checkpoint, so a failed or crashed load leaves the database empty: the load is atomic at the checkpoint boundary. It requires strictly ascending keys, which is exactly the order kv dump emits, so dump | load round-trips by construction. kv load now drives db.Load directly, scanning JSONL one line at a time so a huge import streams rather than buffering.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Populating a database one key at a time pays the insert path on every key: descend from the root, split the leaf when it fills, post a separator up the spine, split interior nodes. For a cold load of a sorted dump that work is wasted, because the final tree shape is knowable up front. This adds the bottom-up fast path spec 15 §6 names, the engine behind
kv dump | kv load.API
db.Load(next)(and the publickv.DB.Load) takes a pull function delivering key/value pairs in ascending order and returns the commit version the data is visible at. The pull shape keeps memory flat: the loader consumes one pair at a time, so a multi-gigabyte dump never lands in memory at once.The build
A new optional
engine.BulkLoadercapability carries the bottom-up build, so an engine that can do it implements it and one that cannot is still complete. The B-tree core:Every page is written once: O(n) instead of the O(n log n) the insert path pays on splits. A
TestBulkLoadMatchesInserttest asserts the bottom-up tree resolves every key identically to the same keys inserted one at a time.Durability
The loader writes pages straight through the pager, bypassing the per-batch WAL, which is why the fast path is restricted to an empty database (no prior state to protect, and per-key logging would defeat the streaming).
db.Loadmakes the build durable with a single checkpoint after the build returns. The oracle is advanced to the load version only on success and before the checkpoint, so a failed or crashed load leaves the database empty: the half-built pages live only in the buffer pool and are dropped, and recovery comes back to the empty image. The load is atomic at the checkpoint boundary, the same guarantee a transaction gives by a different route.Two paths, one call
When the engine supports the capability and the database is empty,
Loadtakes the fast path. Otherwise it falls back to chunkedWriteBatchcommits, which accept any key order and merge into existing data. The fast path requires strictly ascending keys and reports a violation rather than build a mis-ordered tree; a dump emits ascending keys, sodump | loadround-trips by construction.CLI
kv loadnow drivesdb.Loaddirectly, scanning JSONL one line at a time so a huge import streams.kv dumpis unchanged. The existing dump-load round-trip test now exercises the fast path end to end through the binary.Implementation notes:
~/notes/Spec/2059/implementation/31-bulk-load.md.go build,gofmt -l,go vet, andgo test -race ./...all clean.