Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "collect" aggregation support to dask-cudf #15593

Merged
merged 13 commits into from
May 1, 2024

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Apr 24, 2024

Currently blocked by dask/dask#11064

Description

This PR (along with it's upstream dependency) enables "collect" aggregations in dask-cudf when query-planning is enabled. It also adds an clearer error message for as_index usage (which is not supported in dask-dataframe, but was supported in legacy dask-cudf)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Apr 24, 2024
@rjzamora rjzamora self-assigned this Apr 24, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Apr 24, 2024
@rjzamora rjzamora marked this pull request as ready for review April 26, 2024 14:34
@rjzamora rjzamora requested a review from a team as a code owner April 26, 2024 14:34
@rjzamora rjzamora added 4 - Needs Review Waiting for reviewer to review or respond and removed 2 - In Progress Currently a work in progress labels Apr 26, 2024
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor requests for clarification in the doc aspect.

Comment on lines 84 to 89
if as_index is not True:
raise NotImplementedError(
f"`as_index` is not supported by dask-expr. Please disable "
"query planning, or reset the index after aggregating.\n"
f"{_LEGACY_WORKAROUND}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is a somewhat confusing error message. The only way to get past it with query planning enabled is to say as_index=True, but the error message seems to say "as_index=True` is not handled by dask-expr.

Do you mean:

dask-expr only supports as_index=True. For as_index=False either disable query planning or reset the index with reset_index after aggregating.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like dask-expr doesn't actually have support for the as_index keyword arg in general and always follows the behavior of as_index=False, so perhaps we should consider:

  • checking if the kwarg is provided at all, emitting a FutureWarning if so
  • if the kwarg isn't as_index=True, raise the error description suggested above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upstream dask.dataframe API has always raised an error when as_index is used, but dask-cudf has used a distinct groupby API until now.

I agree that the real problem is that as_index is not supported at all by dask-expr. Therefore @charlesbluca's suggestion is probably the most "correct". With that said, I'm feeling a bit hesitant to add more noise for something that technically "works fine" :/

python/dask_cudf/dask_cudf/expr/_groupby.py Outdated Show resolved Hide resolved
Comment on lines 84 to 96
if "as_index" in kwargs:
warnings.warn(
"The `as_index` argument is no longer supported in "
"dask-cudf when query-planning is enabled.",
FutureWarning,
)

if kwargs.pop("as_index", True) is not True:
raise NotImplementedError(
f"`as_index=False` is not supported. Please disable "
"query planning, or reset the index after aggregating.\n"
f"{_LEGACY_WORKAROUND}"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This essentially does what @charlesbluca suggested - The message is still slightly confusing, but so is the behavior I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something to the effect of "as_index is not supported with query-planning enabled this will behave consistently with as_index=True" for both messages, with an additional blurb for the error showing the non-true value passed by the user?

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, looks good. I agree the message is still a bit confusing but I do not obviously have a better suggestion :(

@rjzamora
Copy link
Member Author

rjzamora commented May 1, 2024

I really appreciate the reviews @wence- @charlesbluca ! I just cleanup up the messaging a bit - Definitely not perfect, but hopefully clear enough for most users.

@wence-
Copy link
Contributor

wence- commented May 1, 2024

/merge

@rapids-bot rapids-bot bot merged commit 7458a6e into rapidsai:branch-24.06 May 1, 2024
69 checks passed
@rjzamora rjzamora deleted the groupby-updates branch May 2, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond bug Something isn't working dask Dask issue non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants