[FEA] Dynamic filtering during joins

**Is your feature request related to a problem? Please describe.**

When performing a join, once we have built the first hash table, we potentially have useful information for filtering the probe table. Specifically, if the keys admit an ordering, then if we know the range of the keys, then we can discard any rows from the probe table that are not in this range.

This might be a win even in the case of a single GPU join, but when introducing multiple GPUs, or a query engine, it seems likely to be of significant benefit.

**query engine**

If we can build the hash table and get statistics for the build table, then we can potentially push filters all the way down to the IO stage for the probe table. Since there might be arbitrary compute before getting to the join, this could make a significant difference to performance (in a query-dependent manner).

**multi GPU**

Prefiltering before doing a join, especially if we have to do a shuffle join, can (depending on key values) significantly reduce the amount of data that we need to move around. It might even make the table small enough that it makes sense to do a broadcast join rather than a shuffle. In this scenario, we probably want to compute statistics before any hashing takes place.

I think this could be implemented in dask-cudf right now by just manually computing min/max ranges for the join keys and then pre-filtering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Dynamic filtering during joins #16989

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Dynamic filtering during joins #16989

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions