Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Optimize groupby-agg in dask_cudf #6248

Merged
merged 20 commits into from Sep 24, 2020

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Sep 16, 2020

Experimental groupby-aggregation optimizations. New algorithm leverages groupby(...).agg(..) in cudf, rather than looping through each column. New backend applies to operations like the following:

# List of aggregations
ddf.groupby("id").agg(["min", "max", "mean", "count"], split_out=1, split_every=8)

# Dictionary of aggregations
ddf.groupby("id").agg({"x": ["min", "max", "mean", "count"]}, split_out=1, split_every=8)

Note that the new backend can also be used for a pandas-backed Dask-DataFrame (e.g. dask_cudf.groupby_agg(ddf, ...)). However, the new algorithm does not seem to benefit performance in pandas.

cc @pentschev @kkraus14

TODO:

  • Add tests
  • Clean up, validate, harden
  • Add "std" and "var" support
  • Handle aggregation dict

@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@codecov
Copy link

codecov bot commented Sep 17, 2020

Codecov Report

Merging #6248 into branch-0.16 will increase coverage by 0.45%.
The diff coverage is 95.16%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.16    #6248      +/-   ##
===============================================
+ Coverage        84.42%   84.87%   +0.45%     
===============================================
  Files               82       83       +1     
  Lines            13857    14383     +526     
===============================================
+ Hits             11699    12208     +509     
- Misses            2158     2175      +17     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/groupby.py 95.05% <95.05%> (ø)
python/dask_cudf/dask_cudf/__init__.py 82.35% <100.00%> (+1.10%) ⬆️
python/dask_cudf/dask_cudf/core.py 76.14% <100.00%> (+0.23%) ⬆️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/custreamz/custreamz/_version.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_orc.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_json.py 100.00% <0.00%> (ø)
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 555b91a...09b5e11. Read the comment docs.

@rjzamora rjzamora self-assigned this Sep 17, 2020
@rjzamora rjzamora changed the title [WIP] Optimized DataFrame.groupby_agg API [WIP] Optimized dask_cudf groupby-agg API Sep 17, 2020
@rjzamora rjzamora marked this pull request as ready for review September 19, 2020 16:09
@rjzamora rjzamora requested a review from a team as a code owner September 19, 2020 16:09
@rjzamora rjzamora changed the title [WIP] Optimized dask_cudf groupby-agg API [REVIEW] Optimize groupby-agg in dask_cudf Sep 19, 2020
@kkraus14
Copy link
Collaborator

rerun tests

@kkraus14 kkraus14 added the dask Dask issue label Sep 21, 2020
@kkraus14 kkraus14 added this to PR-WIP in v0.16 Release via automation Sep 21, 2020
python/dask_cudf/dask_cudf/groupby.py Outdated Show resolved Hide resolved
python/dask_cudf/dask_cudf/groupby.py Outdated Show resolved Hide resolved
v0.16 Release automation moved this from PR-WIP to PR-Needs review Sep 21, 2020
v0.16 Release automation moved this from PR-Needs review to PR-Reviewer approved Sep 23, 2020
@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Dask Reviewer labels Sep 23, 2020
@kkraus14
Copy link
Collaborator

rerun tests

@kkraus14 kkraus14 merged commit 0bbb1c6 into rapidsai:branch-0.16 Sep 24, 2020
v0.16 Release automation moved this from PR-Reviewer approved to Done Sep 24, 2020
rapids-bot bot pushed a commit that referenced this pull request Jan 23, 2021
Adds a `CudfSeriesGroupBy` class to dask_cudf.  This allows the optimizations from #6248 to be used for `CudfSeriesGroupBy.mean` (in addition to `CudfDataFrameGroupBy.aggregate`).

Authors:
  - Richard (Rick) Zamora (@rjzamora)

Approvers:
  - Keith Kraus (@kkraus14)

URL: #7194
@rjzamora rjzamora deleted the groupby-agg-api branch December 9, 2022 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge dask Dask issue
Projects
No open projects
v0.16 Release
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants