Support `percent_rank()` aggregation #10227

mythrocks · 2022-02-04T22:27:11Z

The percent_rank() aggregation is a ranking function, similar
to rank() and dense_rank(). It calculates the fractional/relative
rank of a row within a group of rows in a column, and returns a double
value in the range [0.0, 1.0] for each row in the group/column.

This commit includes the libcudf changes to support percent_rank()
aggregations, and the supporting JNI bindings.

Note that percent_rank() is typically used as a window aggregation. It
is implemented as a (grouped) scan aggregation, just as rank() and
dense_rank() because it operates across the whole group of rows. (i.e.
the window specification if fixed, spanning the entire group.)

References:

The `percent_rank()` aggregation is a ranking function, similar to `rank()` and `dense_rank()`. It calculates the fractional/relative rank of a row within a group of rows in a column, and returns a double value in the range [0.0, 1.0] for each row in the group/column. This commit includes the `libcudf` changes to support `percent_rank()` aggregations, and the supporting JNI bindings. Note that `percent_rank()` is typically used as a window aggregation. It is implemented as a (grouped) scan aggregation, just as `rank()` and `dense_rank()` because it operates across the whole group of rows. (i.e. the window specification if fixed, spanning the entire group.) References: 1. [SQL Server](https://docs.microsoft.com/en-us/sql/t-sql/functions/percent-rank-transact-sql) 2. [PostgreSQL](https://www.postgresql.org/docs/10/functions-window.html) 3. [SparkSQL](https://sparkbyexamples.com/spark/spark-sql-window-functions/#ranking-functions)

mythrocks · 2022-02-04T22:28:55Z

This PR is currently in draft. Some doxygen updates are yet to be done.

mythrocks · 2022-02-05T00:06:51Z

I have taken the liberty of changing the example used for RANK and DENSE_RANK aggregations. I submit that this captures the ranking example (with ties) more realistically.

bdice

Really nice work for a relatively large PR. I have attached feedback.

cpp/src/aggregation/aggregation.cpp

cpp/src/groupby/sort/group_rank_scan.cu

cpp/src/reductions/scan/rank_scan.cu

cpp/src/reductions/scan/scan.cpp

cpp/tests/groupby/rank_scan_tests.cpp

java/src/test/java/ai/rapids/cudf/TableTest.java

revans2

From the java/JNI side this looks fine.

1. Bug fix in rank_scan.cu, for single element groups. 2. Disambiguate ternary if. 3. Use the right memory resource. 4. Removed unused function parameters in tests. 5. Fix device lambdas to explicitly return ints instead of bools.

cpp/include/cudf/aggregation.hpp

mythrocks · 2022-02-07T21:26:51Z

Argh. I haven't modified the rank_tests.cpp yet. I'm working on this now.

bdice

Great job, no additional comments from me. Thanks!

edit: Looks like some tests depend on the error message text, which was changed in this PR.

/workspace/.conda-bld/work/cpp/tests/reductions/rank_tests.cpp:248
Value of: e.what()
Expected: ends with "Unsupported dense rank aggregation operator for exclusive scan"
  Actual: 0x29f3f018 pointing to "cuDF failure at: /workspace/.conda-bld/work/cpp/src/reductions/scan/scan.cpp:41: Dense rank aggregation operator requires an inclusive scan" (of type char const*)

cpp/include/cudf/aggregation.hpp

mythrocks · 2022-02-08T02:39:16Z

Sorry for the delay. I've modified rank_tests.cpp to remove cruft, and update for percent_rank().

cpp/include/cudf/detail/scan.hpp

cpp/src/groupby/sort/group_rank_scan.cu

cpp/src/groupby/sort/scan.cpp

cpp/src/reductions/scan/rank_scan.cu

cpp/src/groupby/sort/group_rank_scan.cu

cpp/src/reductions/scan/scan.cpp

cpp/tests/groupby/rank_scan_tests.cpp

1. Pass primitive function arguments by copy. 2. Non-negative placeholder values for null rows. 3. Clearer variable names. 4. Unused variables are now unnamed.

mythrocks · 2022-02-09T18:00:15Z

Rerun tests.

codereport

Looks really good 👍

cpp/src/groupby/sort/group_rank_scan.cu

cpp/src/groupby/sort/group_scan.hpp

mythrocks · 2022-02-10T22:10:33Z

@gpucibot merge

mythrocks · 2022-02-10T22:11:19Z

Thank you for the reviews, @bdice, @revans2, @codereport, and @ttnghia. This change has now been merged.

karthikeyann · 2022-02-16T16:00:14Z

@mythrocks RANK should be unified into single aggregation with type of the rank as parameter (FIRST, AVERAGE, MIN, MAX, DENSE).
I shelved a PR #9569 for bringing this together with another feature request.
Could percent be made a parameter in RANK aggregation?

mythrocks · 2022-02-21T20:52:52Z

@karthikeyann, sorry for the delayed response. A couple of things to consider:

As ranking functions go, I'd put ROW_NUMBER and eventually NTILE with them as well. I'm supportive of grouping them together (please excuse the pun. :]).
I haven't seen MIN, MAX, AVG usually listed among ranking functions. But those are indeed grouped broadly under "analytic" functions. Would there be value in separating the ranking aggregations from the other analytic ones?

revans2 · 2022-02-22T15:16:20Z

@mythrocks pandas has ranking functions for min, max, and average that SQL does not have

mythrocks · 2022-02-24T05:06:00Z

@mythrocks pandas has ranking functions for min, max, and average that SQL does not have

On first read, I thought this meant that Pandas treats MIN() as a ranking function. That didn't compute. Reading the Pandas docs makes it clear.

To close the loop on this, the use of min, max, avg in Pandas is for how to break ties, when multiple rows occupy the same position on sorting:

Using min produces the same results as SQL rank().
Using dense produces the same results as SQL dense_rank().
Avg is interesting: If there are 4 rows tied for 2nd place, their rank becomes 2.25.

Could percent be made a parameter in RANK aggregation?

I think that should be possible. We can take this discussion over to #9569, to explore our options.

mythrocks self-assigned this Feb 4, 2022

mythrocks requested review from a team as code owners February 4, 2022 22:27

mythrocks requested review from bdice and codereport February 4, 2022 22:27

github-actions bot added cuDF (Java) Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Feb 4, 2022

mythrocks marked this pull request as draft February 4, 2022 22:27

mythrocks added 3 commits February 4, 2022 16:03

Updated ranking function example, documentation.

3450c14

Removed debug code.

f3e52b9

Updated copyright dates.

9dd5f0f

mythrocks added this to PR-WIP in v22.04 Release via automation Feb 5, 2022

mythrocks added feature request New feature or request non-breaking Non-breaking change labels Feb 5, 2022

mythrocks marked this pull request as ready for review February 5, 2022 00:07

bdice reviewed Feb 5, 2022

View reviewed changes

This comment was marked as off-topic.

Sign in to view

sameerz mentioned this pull request Feb 7, 2022

[FEA] percent_rank in window operations #9644

Closed

revans2 approved these changes Feb 7, 2022

View reviewed changes

Review fixes:

241d649

1. Bug fix in rank_scan.cu, for single element groups. 2. Disambiguate ternary if. 3. Use the right memory resource. 4. Removed unused function parameters in tests. 5. Fix device lambdas to explicitly return ints instead of bools.

ttnghia reviewed Feb 7, 2022

View reviewed changes

cpp/include/cudf/aggregation.hpp Outdated Show resolved Hide resolved

Reword the TIE comment in the example.

2f3bc91

bdice approved these changes Feb 7, 2022

View reviewed changes

cpp/include/cudf/aggregation.hpp Outdated Show resolved Hide resolved

v22.04 Release automation moved this from PR-WIP to PR-Reviewer approved Feb 7, 2022

Update rank_tests for percent_rank().

12d4822

mythrocks requested a review from ttnghia February 8, 2022 18:35