Support shuffle-based groupby aggregations in dask_cudf #11800

rjzamora · 2022-09-28T02:54:47Z

Description

This PR corresponds to the dask_cudf version of dask/dask#9302 (adding a shuffle-based algorithm for high-cardinality groupby aggregations). The benefits of this algorithm are most significant for cases where split_out>1 is necessary:

agg = ddf.groupby("id").agg({"x": "mean", "y": "max"}, split_out=4, shuffle=True)

NOTES:

~~shuffle="explicit-comms" is also supported (when dask_cuda is installed)~~
It should be possible to refactor remove some of this code in the future. However, due to some subtle differences between the groupby code in dask.dataframe and dask_cudf, the specialized _shuffle_aggregate is currently necessary.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…option

…upby

python/dask_cudf/dask_cudf/core.py

wence-

Ugh, sorry, all of my suggestions have resulted in disaster. Not sure if the better option given we're close to release is go back to the old code and go round again, or to try and fix things up.

I'm inclined perhaps to revert and try again more systematically, WDYT?

python/dask_cudf/dask_cudf/core.py

rjzamora · 2022-09-28T12:52:19Z

I'm inclined perhaps to revert and try again more systematically, WDYT?

I think the safest path forward is to stick with the ugly shuffle work-around poposed in this PR (for the 22.10 release), but to rip it out as soon as something like dask/dask#9521 is supported upstream.

What we want for the impending release is to support ddf.shuffle/sort_values/set_index/merge/join(..., shuffle="explicit-comms") and ddf.groupby(...).agg(..., shuffle="explicit-comms"). The current design ended up being uglier than we wanted (and not particularly extensible), but it does "accomplish" this. Therefore, I think we should target a more-extensible redesign for 22.12 (to provide enough time for upstream buy-in and code review).

rjzamora · 2022-09-28T12:56:47Z

One more thing we can do for 22.10 is to improve the explicit-comms shuffle logic a bit to avoid creating/removing a "_partitions" column when it already exists.

python/dask_cudf/dask_cudf/core.py

codecov · 2022-09-28T14:55:13Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@bcf361f). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-22.10   #11800   +/-   ##
===============================================
  Coverage                ?   87.52%           
===============================================
  Files                   ?      133           
  Lines                   ?    21801           
  Branches                ?        0           
===============================================
  Hits                    ?    19081           
  Misses                  ?     2720           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

rjzamora · 2022-09-28T17:52:00Z

Update: All changes related to explicit-comms have been removed from this PR - The plan for 22.10 is to support shuffle=True or shuffle="tasks" (and to leave out shuffle="explicit-comms").

@wence-

Reverts #992, which had led to unexpected issues. See rapidsai/cudf#11800 (review) cc @wence- Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #1001

wence-

Thanks, and sorry for all the going around in circles

shwina · 2022-09-28T20:45:47Z

@gpucibot merge

rjzamora · 2022-09-28T20:48:50Z

Thanks @wence- and @quasiben for helping to get this in!

## Description This PR fixes a subtle bug introduced in #11800. While working on the corresponding dask-cuda benchmark for that PR rapidsai/dask-cuda#979, we discovered that non-deterministic column ordering in `_groupby_partition_agg` and `_tree_node_agg` can trigger metadata-enforcement errors in follow-up operations. This PR simply sorts the output column ordering in those functions (so that the column ordering is always deterministic). Note that this bug is difficult to reproduce in a pytest, because it rarely occurs with a smaller number of devices (I need to use a full dgx machine to consistently trigger the error). ## Checklist - [ ] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [ ] New or existing tests cover these changes. - [ ] The documentation is up to date with these changes. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ashwin Srinath (https://github.com/shwina)

rjzamora added 18 commits August 22, 2022 07:59

add explicit-comms shuffle option for sort_values and shuffle

edca3aa

use npartitions=4 in test

750a30b

use npartitions=4 in test

76d8faa

add shuffle-based groupby

4f30fe1

Merge remote-tracking branch 'upstream/branch-22.10' into ec-shuffle-…

b3de1a8

…option

use dask-config to support explicit-comms shuffle

eccf5a6

move _shuffle_context

b01eef9

fix test

694b25b

add test coverage

04d80b3

remove commented lines

4c2530b

align with groupby benchmark again

a611c29

use ec shuffle directly for repartitioning

9137662

Merge branch 'ec-shuffle-option' into shuffle-groupby

885506f

Merge remote-tracking branch 'upstream/branch-22.10' into shuffle-gro…

be1bd3b

…upby

remove stale code

482b21a

simplify _shuffle_aggregate a bit

35f7fb3

use rearrange_by_column

437cf66

add explicit-comms test coverage

dbbdc7c

rjzamora added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 28, 2022

github-actions bot added the Python Affects Python cuDF API. label Sep 28, 2022

rjzamora commented Sep 28, 2022

View reviewed changes

python/dask_cudf/dask_cudf/core.py Outdated Show resolved Hide resolved

rjzamora marked this pull request as ready for review September 28, 2022 02:58

rjzamora requested a review from a team as a code owner September 28, 2022 02:58

simplify further

e732baf

wence- requested changes Sep 28, 2022

View reviewed changes

python/dask_cudf/dask_cudf/core.py Outdated Show resolved Hide resolved

rjzamora added 2 commits September 28, 2022 05:40

fix args bug

6f56cef

add note

b698e9e

rjzamora commented Sep 28, 2022

View reviewed changes

python/dask_cudf/dask_cudf/core.py Outdated Show resolved Hide resolved

rjzamora mentioned this pull request Sep 28, 2022

Revert "Update rearrange_by_column patch for explicit comms" rapidsai/dask-cuda#1001

Merged

revert changes related to explicit-comms

7ca1996

rjzamora requested a review from wence- September 28, 2022 18:36

quasiben approved these changes Sep 28, 2022

View reviewed changes

wence- approved these changes Sep 28, 2022

View reviewed changes

rapids-bot bot merged commit 5a4afec into rapidsai:branch-22.10 Sep 28, 2022

rjzamora deleted the shuffle-groupby branch September 28, 2022 20:48

rjzamora mentioned this pull request Sep 30, 2022

Fix bug in new shuffle-based groupby implementation #11836

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support shuffle-based groupby aggregations in dask_cudf #11800

Support shuffle-based groupby aggregations in dask_cudf #11800

rjzamora commented Sep 28, 2022 •

edited

Loading

wence- left a comment

rjzamora commented Sep 28, 2022

rjzamora commented Sep 28, 2022

codecov bot commented Sep 28, 2022 •

edited

Loading

rjzamora commented Sep 28, 2022

wence- left a comment

shwina commented Sep 28, 2022

rjzamora commented Sep 28, 2022

Support shuffle-based groupby aggregations in dask_cudf #11800

Support shuffle-based groupby aggregations in dask_cudf #11800

Conversation

rjzamora commented Sep 28, 2022 • edited Loading

Description

Checklist

wence- left a comment

Choose a reason for hiding this comment

rjzamora commented Sep 28, 2022

rjzamora commented Sep 28, 2022

codecov bot commented Sep 28, 2022 • edited Loading

Codecov Report

rjzamora commented Sep 28, 2022

wence- left a comment

Choose a reason for hiding this comment

shwina commented Sep 28, 2022

rjzamora commented Sep 28, 2022

rjzamora commented Sep 28, 2022 •

edited

Loading

codecov bot commented Sep 28, 2022 •

edited

Loading