Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support percent_rank() aggregation #10227

Merged
merged 13 commits into from Feb 10, 2022

Conversation

mythrocks
Copy link
Contributor

@mythrocks mythrocks commented Feb 4, 2022

Fixes #9644.

The percent_rank() aggregation is a ranking function, similar
to rank() and dense_rank(). It calculates the fractional/relative
rank of a row within a group of rows in a column, and returns a double
value in the range [0.0, 1.0] for each row in the group/column.

This commit includes the libcudf changes to support percent_rank()
aggregations, and the supporting JNI bindings.

Note that percent_rank() is typically used as a window aggregation. It
is implemented as a (grouped) scan aggregation, just as rank() and
dense_rank() because it operates across the whole group of rows. (i.e.
the window specification if fixed, spanning the entire group.)

References:

  1. SQL Server
  2. PostgreSQL
  3. SparkSQL

The `percent_rank()` aggregation is a ranking function, similar
to `rank()` and `dense_rank()`. It calculates the fractional/relative
rank of a row within a group of rows in a column, and returns a double
value in the range [0.0, 1.0] for each row in the group/column.

This commit includes the `libcudf` changes to support `percent_rank()`
aggregations, and the supporting JNI bindings.

Note that `percent_rank()` is typically used as a window aggregation. It
is implemented as a (grouped) scan aggregation, just as `rank()` and
`dense_rank()` because it operates across the whole group of rows. (i.e.
the window specification if fixed, spanning the entire group.)

References:
1. [SQL Server](https://docs.microsoft.com/en-us/sql/t-sql/functions/percent-rank-transact-sql)
2. [PostgreSQL](https://www.postgresql.org/docs/10/functions-window.html)
3. [SparkSQL](https://sparkbyexamples.com/spark/spark-sql-window-functions/#ranking-functions)
@mythrocks mythrocks self-assigned this Feb 4, 2022
@mythrocks mythrocks requested review from a team as code owners February 4, 2022 22:27
@github-actions github-actions bot added cuDF (Java) Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Feb 4, 2022
@mythrocks mythrocks marked this pull request as draft February 4, 2022 22:27
@mythrocks
Copy link
Contributor Author

This PR is currently in draft. Some doxygen updates are yet to be done.

@mythrocks
Copy link
Contributor Author

mythrocks commented Feb 5, 2022

I have taken the liberty of changing the example used for RANK and DENSE_RANK aggregations. I submit that this captures the ranking example (with ties) more realistically.

@mythrocks mythrocks added this to PR-WIP in v22.04 Release via automation Feb 5, 2022
@mythrocks mythrocks added feature request New feature or request non-breaking Non-breaking change labels Feb 5, 2022
@mythrocks mythrocks marked this pull request as ready for review February 5, 2022 00:07
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work for a relatively large PR. I have attached feedback.

cpp/src/aggregation/aggregation.cpp Outdated Show resolved Hide resolved
cpp/src/groupby/sort/group_rank_scan.cu Outdated Show resolved Hide resolved
cpp/src/groupby/sort/group_rank_scan.cu Outdated Show resolved Hide resolved
cpp/src/reductions/scan/rank_scan.cu Outdated Show resolved Hide resolved
cpp/src/reductions/scan/rank_scan.cu Outdated Show resolved Hide resolved
cpp/src/reductions/scan/rank_scan.cu Outdated Show resolved Hide resolved
cpp/src/reductions/scan/scan.cpp Outdated Show resolved Hide resolved
cpp/tests/groupby/rank_scan_tests.cpp Show resolved Hide resolved
@codecov

This comment was marked as off-topic.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the java/JNI side this looks fine.

1. Bug fix in rank_scan.cu, for single element groups.
2. Disambiguate ternary if.
3. Use the right memory resource.
4. Removed unused function parameters in tests.
5. Fix device lambdas to explicitly return ints instead of bools.
@mythrocks
Copy link
Contributor Author

Argh. I haven't modified the rank_tests.cpp yet. I'm working on this now.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, no additional comments from me. Thanks!

edit: Looks like some tests depend on the error message text, which was changed in this PR.

/workspace/.conda-bld/work/cpp/tests/reductions/rank_tests.cpp:248
Value of: e.what()
Expected: ends with "Unsupported dense rank aggregation operator for exclusive scan"
  Actual: 0x29f3f018 pointing to "cuDF failure at: /workspace/.conda-bld/work/cpp/src/reductions/scan/scan.cpp:41: Dense rank aggregation operator requires an inclusive scan" (of type char const*)

cpp/include/cudf/aggregation.hpp Outdated Show resolved Hide resolved
v22.04 Release automation moved this from PR-WIP to PR-Reviewer approved Feb 7, 2022
@mythrocks
Copy link
Contributor Author

Sorry for the delay. I've modified rank_tests.cpp to remove cruft, and update for percent_rank().

1. Pass primitive function arguments by copy.
2. Non-negative placeholder values for null rows.
3. Clearer variable names.
4. Unused variables are now unnamed.
@mythrocks
Copy link
Contributor Author

Rerun tests.

Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good 👍

@mythrocks
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8f0f6f8 into rapidsai:branch-22.04 Feb 10, 2022
v22.04 Release automation moved this from PR-Reviewer approved to Done Feb 10, 2022
@mythrocks
Copy link
Contributor Author

Thank you for the reviews, @bdice, @revans2, @codereport, and @ttnghia. This change has now been merged.

@karthikeyann
Copy link
Contributor

@mythrocks RANK should be unified into single aggregation with type of the rank as parameter (FIRST, AVERAGE, MIN, MAX, DENSE).
I shelved a PR #9569 for bringing this together with another feature request.
Could percent be made a parameter in RANK aggregation?

@mythrocks
Copy link
Contributor Author

@karthikeyann, sorry for the delayed response. A couple of things to consider:

  1. As ranking functions go, I'd put ROW_NUMBER and eventually NTILE with them as well. I'm supportive of grouping them together (please excuse the pun. :]).
  2. I haven't seen MIN, MAX, AVG usually listed among ranking functions. But those are indeed grouped broadly under "analytic" functions. Would there be value in separating the ranking aggregations from the other analytic ones?

@revans2
Copy link
Contributor

revans2 commented Feb 22, 2022

@mythrocks pandas has ranking functions for min, max, and average that SQL does not have

@mythrocks
Copy link
Contributor Author

mythrocks commented Feb 24, 2022

@mythrocks pandas has ranking functions for min, max, and average that SQL does not have

On first read, I thought this meant that Pandas treats MIN() as a ranking function. That didn't compute. Reading the Pandas docs makes it clear.

To close the loop on this, the use of min, max, avg in Pandas is for how to break ties, when multiple rows occupy the same position on sorting:

  1. Using min produces the same results as SQL rank().
  2. Using dense produces the same results as SQL dense_rank().
  3. Avg is interesting: If there are 4 rows tied for 2nd place, their rank becomes 2.25.

Could percent be made a parameter in RANK aggregation?

I think that should be possible. We can take this discussion over to #9569, to explore our options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuDF (Java) Affects Java cuDF API. feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

[FEA] percent_rank in window operations
6 participants