New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Optimize groupby-agg in dask_cudf #6248
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Codecov Report
@@ Coverage Diff @@
## branch-0.16 #6248 +/- ##
===============================================
+ Coverage 84.42% 84.87% +0.45%
===============================================
Files 82 83 +1
Lines 13857 14383 +526
===============================================
+ Hits 11699 12208 +509
- Misses 2158 2175 +17
Continue to review full report at Codecov.
|
…_groupby_multiindex_reset_index
rerun tests |
rerun tests |
Experimental groupby-aggregation optimizations. New algorithm leverages
groupby(...).agg(..)
incudf
, rather than looping through each column. New backend applies to operations like the following:Note that the new backend can also be used for a pandas-backed Dask-DataFrame (e.g.
dask_cudf.groupby_agg(ddf, ...)
). However, the new algorithm does not seem to benefit performance in pandas.cc @pentschev @kkraus14
TODO:
dict