-
Notifications
You must be signed in to change notification settings - Fork 997
Description
Is your feature request related to a problem? Please describe.
When performing a join, once we have built the first hash table, we potentially have useful information for filtering the probe table. Specifically, if the keys admit an ordering, then if we know the range of the keys, then we can discard any rows from the probe table that are not in this range.
This might be a win even in the case of a single GPU join, but when introducing multiple GPUs, or a query engine, it seems likely to be of significant benefit.
query engine
If we can build the hash table and get statistics for the build table, then we can potentially push filters all the way down to the IO stage for the probe table. Since there might be arbitrary compute before getting to the join, this could make a significant difference to performance (in a query-dependent manner).
multi GPU
Prefiltering before doing a join, especially if we have to do a shuffle join, can (depending on key values) significantly reduce the amount of data that we need to move around. It might even make the table small enough that it makes sense to do a broadcast join rather than a shuffle. In this scenario, we probably want to compute statistics before any hashing takes place.
I think this could be implemented in dask-cudf right now by just manually computing min/max ranges for the join keys and then pre-filtering.