Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nested types as groupby keys in libcudf #11792

Merged
merged 31 commits into from
Nov 16, 2022

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Sep 27, 2022

Related to #8039

This PR replaces old row operators in list groupby with new ones thus nested types like lists and structs can be used as groupby keys in libcudf. It partially fixes the issue and comprehensive python refactoring requires non-trivial future work. It is a breaking change since libcudf will no longer throw an error when nulls are excluded and groupby keys are of nested types. Comprehensive tests for complex nested types like structs of lists of lists of structs depend on #11222.

Closes #10181

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@PointKernel PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Sep 27, 2022
@PointKernel PointKernel self-assigned this Sep 27, 2022
@PointKernel PointKernel added this to PR-WIP in v22.12 Release via automation Sep 27, 2022
@codecov
Copy link

codecov bot commented Oct 27, 2022

Codecov Report

Base: 87.47% // Head: 88.10% // Increases project coverage by +0.62% 🎉

Coverage data is based on head (270593f) compared to base (f817d96).
Patch has no changes to coverable lines.

❗ Current head 270593f differs from pull request most recent head d94e452. Consider uploading reports for the commit d94e452 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #11792      +/-   ##
================================================
+ Coverage         87.47%   88.10%   +0.62%     
================================================
  Files               133      135       +2     
  Lines             21826    22133     +307     
================================================
+ Hits              19093    19500     +407     
+ Misses             2733     2633     -100     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/interval.py 85.45% <0.00%> (-9.10%) ⬇️
python/cudf/cudf/io/text.py 91.66% <0.00%> (-8.34%) ⬇️
python/strings_udf/strings_udf/__init__.py 76.47% <0.00%> (-7.85%) ⬇️
python/cudf/cudf/core/_base_index.py 81.28% <0.00%> (-4.27%) ⬇️
python/cudf/cudf/io/json.py 92.06% <0.00%> (-2.68%) ⬇️
python/cudf/cudf/utils/utils.py 89.91% <0.00%> (-0.69%) ⬇️
python/cudf/cudf/core/column/timedelta.py 90.17% <0.00%> (-0.58%) ⬇️
python/cudf/cudf/core/column/datetime.py 89.21% <0.00%> (-0.51%) ⬇️
python/cudf/cudf/core/column/column.py 87.96% <0.00%> (-0.46%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.72% <0.00%> (-0.41%) ⬇️
... and 43 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ttnghia ttnghia self-requested a review October 28, 2022 04:27
@github-actions github-actions bot added the cuDF (Python) Affects Python cuDF API. label Oct 31, 2022
@shwina
Copy link
Contributor

shwina commented Nov 1, 2022

@PointKernel as we saw yesterday, when attempting to do a groupby-agg with a list column as keys, we see the following error:

RuntimeError: cuDF failure at: cudf/cpp/src/groupby/hash/groupby.cu:661: Null keys of nested type cannot be excluded.

This is because the default option in Python is dropna=True which asks libcudf to exclude null keys. Now, should the above error be raised in libcudf only when there are actually nulls in the list column, or should it be the caller's (i.e., Cython's) responsibility to check for nulls before passing the exclude policy to libcudf?

@PointKernel
Copy link
Member Author

@PointKernel as we saw yesterday, when attempting to do a groupby-agg with a list column as keys, we see the following error:

RuntimeError: cuDF failure at: cudf/cpp/src/groupby/hash/groupby.cu:661: Null keys of nested type cannot be excluded.

This is because the default option in Python is dropna=True which asks libcudf to exclude null keys. Now, should the above error be raised in libcudf only when there are actually nulls in the list column, or should it be the caller's (i.e., Cython's) responsibility to check for nulls before passing the exclude policy to libcudf?

IIRC, The error message was added to align with Spark's behavior thus we don't have to throw the error in libcudf. I will just remove it and see if anyone complains.

@PointKernel PointKernel requested review from a team as code owners November 15, 2022 20:02
@github-actions github-actions bot added CMake CMake build issue conda conda issue cuDF (Java) Affects Java cuDF API. cuDF (Python) Affects Python cuDF API. gpuCI labels Nov 15, 2022
@github-actions github-actions bot removed CMake CMake build issue cuDF (Java) Affects Java cuDF API. gpuCI cuDF (Python) Affects Python cuDF API. conda conda issue labels Nov 15, 2022
@PointKernel PointKernel removed request for a team November 15, 2022 20:06
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see my comment regarding ordering of outputs in the test addressed. However, given the tight time constraints (code freeze starts tomorrow) I don't want to block this PR on that. If you have time to make that change in this PR that's great, but if not please file it away for future work. Thanks!

v22.12 Release automation moved this from PR-Needs review to PR-Reviewer approved Nov 16, 2022
@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit afb3c97 into rapidsai:branch-22.12 Nov 16, 2022
v22.12 Release automation moved this from PR-Reviewer approved to Done Nov 16, 2022
@PointKernel PointKernel deleted the support-list-groupby branch November 16, 2022 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA] group by aggregations that include a list of strings as the grouping type
4 participants