Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add handling for nested dicts in dask-cudf groupby #9054

Merged
merged 5 commits into from
Aug 26, 2021

Conversation

charlesbluca
Copy link
Member

Closes #9017

Adds handling for nested dict (renamed) aggregations supplied to dask-cudf's groupby, by storing the new aggregation names when standardizing the aggs input and applying them in _finalize_gb_agg().

@charlesbluca charlesbluca added feature request New feature or request 3 - Ready for Review Ready for review by team dask Dask issue non-breaking Non-breaking change labels Aug 17, 2021
@charlesbluca charlesbluca requested review from a team as code owners August 17, 2021 21:26
@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 17, 2021
agg_array.append(
aggs_renames.get(_make_name(col, agg, sep=sep), agg)
)
_meta.columns = pd.MultiIndex.from_arrays([col_array, agg_array])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that we have to do the aggregation renames for both _meta and the groupby result, but this is required so that we have the correct final_columns for the last step of _finalize_gb_agg(). It would be nice if we also supported nested dict aggregations in cuDF's groupby so that _meta would have the correct index without any additional steps in dask-cuDF.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @shwina said this could be done but would require some effort. Since pandas does not support nested dicts it seemed like cuDF did not have to go down this path. We could be wrong and if you feel strongly you should speak up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since pandas does not support nested dicts it seemed like cuDF did not have to go down this path

If pandas doesn't support something ugly, I'd lean away from doing it in cudf for the sake of dask-cudf logic :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I generally agree - there could be larger motivations to want nested renaming support for groupby in cuDF, but I don't think this case alone is a good enough reason to work on it

@codecov
Copy link

codecov bot commented Aug 17, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@04b7027). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 479c3da differs from pull request most recent head 0954d23. Consider uploading reports for the commit 0954d23 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9054   +/-   ##
===============================================
  Coverage                ?   10.78%           
===============================================
  Files                   ?      114           
  Lines                   ?    18716           
  Branches                ?        0           
===============================================
  Hits                    ?     2018           
  Misses                  ?    16698           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04b7027...0954d23. Read the comment docs.

@quasiben
Copy link
Member

This looks cleaner than #9033 and would be in favor of this PR. @rjzamora if you have a few minutes your thoughts/reviews would be appreciated

@@ -367,6 +382,8 @@ def _is_supported(arg, supported: set):
for col in arg:
if isinstance(arg[col], list):
_global_set = _global_set.union(set(arg[col]))
elif isinstance(arg[col], dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does order matter for _global_set? If it does, using set can sometimes change the order and give unexpected results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't matter here, since we only need _global_set to check if our aggs are a subset of supported. Ordering is more of a concern with _redirect_aggs() since that function returns a copy of the aggs that's used for the remainder of the groupby. AFAIK that should be good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, sounds good! For the most part the code looks pretty good to me!

@charlesbluca
Copy link
Member Author

rerun tests

@quasiben
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4e0584b into rapidsai:branch-21.10 Aug 26, 2021
@charlesbluca charlesbluca deleted the fix-9017 branch July 19, 2022 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team dask Dask issue feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Groupby Agg with Dict-of-dicts
4 participants