Skip to content

[FEA] Dynamic filtering during joins #16989

@wence-

Description

@wence-

Is your feature request related to a problem? Please describe.

When performing a join, once we have built the first hash table, we potentially have useful information for filtering the probe table. Specifically, if the keys admit an ordering, then if we know the range of the keys, then we can discard any rows from the probe table that are not in this range.

This might be a win even in the case of a single GPU join, but when introducing multiple GPUs, or a query engine, it seems likely to be of significant benefit.

query engine

If we can build the hash table and get statistics for the build table, then we can potentially push filters all the way down to the IO stage for the probe table. Since there might be arbitrary compute before getting to the join, this could make a significant difference to performance (in a query-dependent manner).

multi GPU

Prefiltering before doing a join, especially if we have to do a shuffle join, can (depending on key values) significantly reduce the amount of data that we need to move around. It might even make the table small enough that it makes sense to do a broadcast join rather than a shuffle. In this scenario, we probably want to compute statistics before any hashing takes place.

I think this could be implemented in dask-cudf right now by just manually computing min/max ranges for the join keys and then pre-filtering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions