Add a bottom-up bulk-load fast path by tamnd · Pull Request #30 · tamnd/kv

tamnd · 2026-06-20T14:17:16Z

Populating a database one key at a time pays the insert path on every key: descend from the root, split the leaf when it fills, post a separator up the spine, split interior nodes. For a cold load of a sorted dump that work is wasted, because the final tree shape is knowable up front. This adds the bottom-up fast path spec 15 §6 names, the engine behind kv dump | kv load.

API

db.Load(next) (and the public kv.DB.Load) takes a pull function delivering key/value pairs in ascending order and returns the commit version the data is visible at. The pull shape keeps memory flat: the loader consumes one pair at a time, so a multi-gigabyte dump never lands in memory at once.

The build

A new optional engine.BulkLoader capability carries the bottom-up build, so an engine that can do it implements it and one that cannot is still complete. The B-tree core:

packs cells into a leaf until the page would overflow, sealing at a user-key boundary so a version group is never split across leaves;
defers each sealed leaf by one step so its B-link right-sibling pointer is set before it is written;
stacks interior levels from the leaf separators, promoting the separator between two full nodes up to the parent, until a single root remains.

Every page is written once: O(n) instead of the O(n log n) the insert path pays on splits. A TestBulkLoadMatchesInsert test asserts the bottom-up tree resolves every key identically to the same keys inserted one at a time.

Durability

The loader writes pages straight through the pager, bypassing the per-batch WAL, which is why the fast path is restricted to an empty database (no prior state to protect, and per-key logging would defeat the streaming). db.Load makes the build durable with a single checkpoint after the build returns. The oracle is advanced to the load version only on success and before the checkpoint, so a failed or crashed load leaves the database empty: the half-built pages live only in the buffer pool and are dropped, and recovery comes back to the empty image. The load is atomic at the checkpoint boundary, the same guarantee a transaction gives by a different route.

Two paths, one call

When the engine supports the capability and the database is empty, Load takes the fast path. Otherwise it falls back to chunked WriteBatch commits, which accept any key order and merge into existing data. The fast path requires strictly ascending keys and reports a violation rather than build a mis-ordered tree; a dump emits ascending keys, so dump | load round-trips by construction.

CLI

kv load now drives db.Load directly, scanning JSONL one line at a time so a huge import streams. kv dump is unchanged. The existing dump-load round-trip test now exercises the fast path end to end through the binary.

Implementation notes: ~/notes/Spec/2059/implementation/31-bulk-load.md.

go build, gofmt -l, go vet, and go test -race ./... all clean.

Populating a database key by key pays the insert path on every key: descend, split the leaf, post a separator, split interior nodes. For a cold load of a sorted dump that work is wasted, because the final tree shape is known up front. Add db.Load(next), a streaming pull API that builds the tree bottom-up. A new optional engine.BulkLoader capability carries the build; the B-tree core packs each leaf once, links the B-link sibling chain as it goes, and stacks the interior levels on top, writing every page exactly once instead of O(n log n) splits. Loading into a non-empty database, or an engine without the capability, falls back to chunked WriteBatch commits, which accept any key order. The fast path writes pages straight through the pager and makes the build durable with one checkpoint after it returns. The oracle is advanced only on success and before the checkpoint, so a failed or crashed load leaves the database empty: the load is atomic at the checkpoint boundary. It requires strictly ascending keys, which is exactly the order kv dump emits, so dump | load round-trips by construction. kv load now drives db.Load directly, scanning JSONL one line at a time so a huge import streams rather than buffering.

tamnd merged commit a3f7fbd into main Jun 20, 2026
1 check passed

tamnd deleted the bulk-load branch June 20, 2026 14:18

tamnd mentioned this pull request Jun 20, 2026

Add a full vacuum that rebuilds and swaps the file #31

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a bottom-up bulk-load fast path#30

Add a bottom-up bulk-load fast path#30
tamnd merged 1 commit into
mainfrom
bulk-load

tamnd commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tamnd commented Jun 20, 2026

API

The build

Durability

Two paths, one call

CLI

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant