Skip to content

Disk based joins, #4592#4801

Merged
jarohen merged 2 commits intoxtdb:mainfrom
jarohen:disk-based-joins-4592
Sep 22, 2025
Merged

Disk based joins, #4592#4801
jarohen merged 2 commits intoxtdb:mainfrom
jarohen:disk-based-joins-4592

Conversation

@jarohen
Copy link
Copy Markdown
Member

@jarohen jarohen commented Sep 22, 2025

resolves #4592

This PR spills join build-sides to disk when they go over 100k rows.

  1. First, we use Spill to write pages of the build-side to disk every time the accumulated relation goes over 100k rows. This writes one Arrow file, with one page for every 100k batch ('B' in total).
  2. We then choose a partition count P. This aims to keep the number of batches consistent - so if there are 1M rows in the input, B≈10, so we'll choose partition count P = 16 (power of two).
  3. Then, we 'Shuffle' the spilled file to another Arrow file. For each batch, we write P pages - so this second file will contain B * P = 160 pages of ~62k rows each.
  4. When we read this back in, Shuffle then combines the pages for each partition - i.e. partition 0 contains pages 0, 16, 32, etc; partition 1 is 1, 17, 33, etc.
  5. When the JoinCursor has completed the build phase, it looks to see whether the build-side was shuffled. If so, it uses DiskHashJoin (DHJ), otherwise MemoryHashJoin (MHJ).
    • See bb58216 for the introduction of MHJ, already on main.
  6. DHJ is itself a cursor. This first spills and shuffles the probe side to disk in a similar way, so that we can align partitions with the build-side.
    • The probe side must also have a power-of-two partitions, and additionally must have at least as many partitions as the build-side. This is a reasonable assumption, because we always try to choose the build-side as the smaller side.
  7. We then stream through the shuffled build/probe side files:
    1. We pull in a build-side part to begin with.
    2. Then, for each probe-side part (recall that there's at least one for each build-side part), we do the actual join, and append any null RHS rows for that probe partition if applicable
    3. At the end of the build-side part, we yield another relation of null rows, if the LHS was an outer join.
  8. ...
  9. Profit.
  • One implementation complexity here is around the alignment of the parts within DHJ:
    • In Shuffle, I chose to hash based on the lower bits of the hash-code - partly because it makes for easier bit-maths, partly because if we do recommend that people use the top bits of an IID for application-level partitioning, those bits may not be as well distributed.
    • So, if the build and probe side have different partition counts (let's say 4 and 8), let's say you have a hash that ends in 6 (mod 8) == 2 (mod 4).
    • partition 2 on the build side matches 2 and 6 on the probe side, not 4 (2n) and 5 (2n + 1) as you might expect.
  • Outer joins also make this more fun. We solve this by adding an extra iteration at the end of every build-part, which the DHJ spots and yields the unmatched rows at that point.

It seems that other disk-based hash-join impls choose a partition count ahead-of-time, and then opt to re-partition if required. I elected not to write a dynamic re-partitioner, but it did mean for a slightly more complex alignment algo 😅

…k in

checkpoint: build-side spills to disk if it goes above a threshold, but then just reads it all back in

checkpoint: only do hashing after spill

checkpoint: write hashes out during shuffle

shuffle/spill spill and shuffle out to their own isolated classes

checkpoint: shuffle now pipes the data through too

checkpoint: shuffle doing a shuffle

checkpoint: Shuffle can reload a part at a time
@jarohen jarohen self-assigned this Sep 22, 2025
@jarohen jarohen merged commit a5b9aca into xtdb:main Sep 22, 2025
1 check passed
@jarohen jarohen deleted the disk-based-joins-4592 branch September 25, 2025 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disk Based Join

1 participant