Skip to content

Add a bottom-up bulk-load fast path#30

Merged
tamnd merged 1 commit into
mainfrom
bulk-load
Jun 20, 2026
Merged

Add a bottom-up bulk-load fast path#30
tamnd merged 1 commit into
mainfrom
bulk-load

Conversation

@tamnd

@tamnd tamnd commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Populating a database one key at a time pays the insert path on every key: descend from the root, split the leaf when it fills, post a separator up the spine, split interior nodes. For a cold load of a sorted dump that work is wasted, because the final tree shape is knowable up front. This adds the bottom-up fast path spec 15 §6 names, the engine behind kv dump | kv load.

API

db.Load(next) (and the public kv.DB.Load) takes a pull function delivering key/value pairs in ascending order and returns the commit version the data is visible at. The pull shape keeps memory flat: the loader consumes one pair at a time, so a multi-gigabyte dump never lands in memory at once.

The build

A new optional engine.BulkLoader capability carries the bottom-up build, so an engine that can do it implements it and one that cannot is still complete. The B-tree core:

  • packs cells into a leaf until the page would overflow, sealing at a user-key boundary so a version group is never split across leaves;
  • defers each sealed leaf by one step so its B-link right-sibling pointer is set before it is written;
  • stacks interior levels from the leaf separators, promoting the separator between two full nodes up to the parent, until a single root remains.

Every page is written once: O(n) instead of the O(n log n) the insert path pays on splits. A TestBulkLoadMatchesInsert test asserts the bottom-up tree resolves every key identically to the same keys inserted one at a time.

Durability

The loader writes pages straight through the pager, bypassing the per-batch WAL, which is why the fast path is restricted to an empty database (no prior state to protect, and per-key logging would defeat the streaming). db.Load makes the build durable with a single checkpoint after the build returns. The oracle is advanced to the load version only on success and before the checkpoint, so a failed or crashed load leaves the database empty: the half-built pages live only in the buffer pool and are dropped, and recovery comes back to the empty image. The load is atomic at the checkpoint boundary, the same guarantee a transaction gives by a different route.

Two paths, one call

When the engine supports the capability and the database is empty, Load takes the fast path. Otherwise it falls back to chunked WriteBatch commits, which accept any key order and merge into existing data. The fast path requires strictly ascending keys and reports a violation rather than build a mis-ordered tree; a dump emits ascending keys, so dump | load round-trips by construction.

CLI

kv load now drives db.Load directly, scanning JSONL one line at a time so a huge import streams. kv dump is unchanged. The existing dump-load round-trip test now exercises the fast path end to end through the binary.

Implementation notes: ~/notes/Spec/2059/implementation/31-bulk-load.md.

go build, gofmt -l, go vet, and go test -race ./... all clean.

Populating a database key by key pays the insert path on every key: descend,
split the leaf, post a separator, split interior nodes. For a cold load of a
sorted dump that work is wasted, because the final tree shape is known up front.

Add db.Load(next), a streaming pull API that builds the tree bottom-up. A new
optional engine.BulkLoader capability carries the build; the B-tree core packs
each leaf once, links the B-link sibling chain as it goes, and stacks the
interior levels on top, writing every page exactly once instead of O(n log n)
splits. Loading into a non-empty database, or an engine without the capability,
falls back to chunked WriteBatch commits, which accept any key order.

The fast path writes pages straight through the pager and makes the build
durable with one checkpoint after it returns. The oracle is advanced only on
success and before the checkpoint, so a failed or crashed load leaves the
database empty: the load is atomic at the checkpoint boundary. It requires
strictly ascending keys, which is exactly the order kv dump emits, so dump |
load round-trips by construction.

kv load now drives db.Load directly, scanning JSONL one line at a time so a
huge import streams rather than buffering.
@tamnd tamnd merged commit a3f7fbd into main Jun 20, 2026
1 check passed
@tamnd tamnd deleted the bulk-load branch June 20, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant