-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support shuffle-based groupby aggregations in dask_cudf #11800
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, sorry, all of my suggestions have resulted in disaster. Not sure if the better option given we're close to release is go back to the old code and go round again, or to try and fix things up.
I'm inclined perhaps to revert and try again more systematically, WDYT?
I think the safest path forward is to stick with the ugly What we want for the impending release is to support |
One more thing we can do for 22.10 is to improve the explicit-comms shuffle logic a bit to avoid creating/removing a "_partitions" column when it already exists. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #11800 +/- ##
===============================================
Coverage ? 87.52%
===============================================
Files ? 133
Lines ? 21801
Branches ? 0
===============================================
Hits ? 19081
Misses ? 2720
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Update: All changes related to explicit-comms have been removed from this PR - The plan for 22.10 is to support |
Reverts #992, which had led to unexpected issues. See rapidsai/cudf#11800 (review) cc @wence- Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #1001
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, and sorry for all the going around in circles
@gpucibot merge |
## Description This PR fixes a subtle bug introduced in #11800. While working on the corresponding dask-cuda benchmark for that PR rapidsai/dask-cuda#979, we discovered that non-deterministic column ordering in `_groupby_partition_agg` and `_tree_node_agg` can trigger metadata-enforcement errors in follow-up operations. This PR simply sorts the output column ordering in those functions (so that the column ordering is always deterministic). Note that this bug is difficult to reproduce in a pytest, because it rarely occurs with a smaller number of devices (I need to use a full dgx machine to consistently trigger the error). ## Checklist - [ ] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [ ] New or existing tests cover these changes. - [ ] The documentation is up to date with these changes. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ashwin Srinath (https://github.com/shwina)
Description
This PR corresponds to the
dask_cudf
version of dask/dask#9302 (adding a shuffle-based algorithm for high-cardinality groupby aggregations). The benefits of this algorithm are most significant for cases wheresplit_out>1
is necessary:NOTES:
shuffle="explicit-comms"
is also supported (whendask_cuda
is installed)dask.dataframe
anddask_cudf
, the specialized_shuffle_aggregate
is currently necessary.Checklist