Disk based joins, #4592 by jarohen · Pull Request #4801 · xtdb/xtdb

jarohen · 2025-09-22T16:08:58Z

resolves #4592

This PR spills join build-sides to disk when they go over 100k rows.

First, we use Spill to write pages of the build-side to disk every time the accumulated relation goes over 100k rows. This writes one Arrow file, with one page for every 100k batch ('B' in total).
We then choose a partition count P. This aims to keep the number of batches consistent - so if there are 1M rows in the input, B≈10, so we'll choose partition count P = 16 (power of two).
Then, we 'Shuffle' the spilled file to another Arrow file. For each batch, we write P pages - so this second file will contain B * P = 160 pages of ~62k rows each.
When we read this back in, Shuffle then combines the pages for each partition - i.e. partition 0 contains pages 0, 16, 32, etc; partition 1 is 1, 17, 33, etc.
When the JoinCursor has completed the build phase, it looks to see whether the build-side was shuffled. If so, it uses DiskHashJoin (DHJ), otherwise MemoryHashJoin (MHJ).
- See bb58216 for the introduction of MHJ, already on main.
DHJ is itself a cursor. This first spills and shuffles the probe side to disk in a similar way, so that we can align partitions with the build-side.
- The probe side must also have a power-of-two partitions, and additionally must have at least as many partitions as the build-side. This is a reasonable assumption, because we always try to choose the build-side as the smaller side.
We then stream through the shuffled build/probe side files:
1. We pull in a build-side part to begin with.
2. Then, for each probe-side part (recall that there's at least one for each build-side part), we do the actual join, and append any null RHS rows for that probe partition if applicable
3. At the end of the build-side part, we yield another relation of null rows, if the LHS was an outer join.
...
Profit.

One implementation complexity here is around the alignment of the parts within DHJ:
- In Shuffle, I chose to hash based on the lower bits of the hash-code - partly because it makes for easier bit-maths, partly because if we do recommend that people use the top bits of an IID for application-level partitioning, those bits may not be as well distributed.
- So, if the build and probe side have different partition counts (let's say 4 and 8), let's say you have a hash that ends in 6 (mod 8) == 2 (mod 4).
- partition 2 on the build side matches 2 and 6 on the probe side, not 4 (2n) and 5 (2n + 1) as you might expect.
Outer joins also make this more fun. We solve this by adding an extra iteration at the end of every build-part, which the DHJ spots and yields the unmatched rows at that point.

It seems that other disk-based hash-join impls choose a partition count ahead-of-time, and then opt to re-partition if required. I elected not to write a dynamic re-partitioner, but it did mean for a slightly more complex alignment algo 😅

…k in checkpoint: build-side spills to disk if it goes above a threshold, but then just reads it all back in checkpoint: only do hashing after spill checkpoint: write hashes out during shuffle shuffle/spill spill and shuffle out to their own isolated classes checkpoint: shuffle now pipes the data through too checkpoint: shuffle doing a shuffle checkpoint: Shuffle can reload a part at a time

… hash join

jarohen added 2 commits September 22, 2025 16:57

spilling join to disk when it goes over 100k rows, performing a grace…

cdc018a

… hash join

jarohen self-assigned this Sep 22, 2025

jarohen merged commit a5b9aca into xtdb:main Sep 22, 2025
1 check passed

jarohen deleted the disk-based-joins-4592 branch September 25, 2025 14:35

This was referenced Sep 29, 2025

OOM TPC-H SF1 Q4 -Xmx6g #3283

Closed

Query engine able to cope with larger-than-memory intermediate results #1939

Open

jarohen mentioned this pull request Nov 3, 2025

DBJ increment - pre-sorting by hash #4931

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk based joins, #4592#4801

Disk based joins, #4592#4801
jarohen merged 2 commits intoxtdb:mainfrom
jarohen:disk-based-joins-4592

jarohen commented Sep 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jarohen commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jarohen commented Sep 22, 2025 •

edited

Loading