New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support percent_rank()
aggregation
#10227
Conversation
The `percent_rank()` aggregation is a ranking function, similar to `rank()` and `dense_rank()`. It calculates the fractional/relative rank of a row within a group of rows in a column, and returns a double value in the range [0.0, 1.0] for each row in the group/column. This commit includes the `libcudf` changes to support `percent_rank()` aggregations, and the supporting JNI bindings. Note that `percent_rank()` is typically used as a window aggregation. It is implemented as a (grouped) scan aggregation, just as `rank()` and `dense_rank()` because it operates across the whole group of rows. (i.e. the window specification if fixed, spanning the entire group.) References: 1. [SQL Server](https://docs.microsoft.com/en-us/sql/t-sql/functions/percent-rank-transact-sql) 2. [PostgreSQL](https://www.postgresql.org/docs/10/functions-window.html) 3. [SparkSQL](https://sparkbyexamples.com/spark/spark-sql-window-functions/#ranking-functions)
This PR is currently in draft. Some doxygen updates are yet to be done. |
I have taken the liberty of changing the example used for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work for a relatively large PR. I have attached feedback.
This comment was marked as off-topic.
This comment was marked as off-topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the java/JNI side this looks fine.
1. Bug fix in rank_scan.cu, for single element groups. 2. Disambiguate ternary if. 3. Use the right memory resource. 4. Removed unused function parameters in tests. 5. Fix device lambdas to explicitly return ints instead of bools.
Argh. I haven't modified the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job, no additional comments from me. Thanks!
edit: Looks like some tests depend on the error message text, which was changed in this PR.
/workspace/.conda-bld/work/cpp/tests/reductions/rank_tests.cpp:248
Value of: e.what()
Expected: ends with "Unsupported dense rank aggregation operator for exclusive scan"
Actual: 0x29f3f018 pointing to "cuDF failure at: /workspace/.conda-bld/work/cpp/src/reductions/scan/scan.cpp:41: Dense rank aggregation operator requires an inclusive scan" (of type char const*)
Sorry for the delay. I've modified |
1. Pass primitive function arguments by copy. 2. Non-negative placeholder values for null rows. 3. Clearer variable names. 4. Unused variables are now unnamed.
Rerun tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good 👍
@gpucibot merge |
Thank you for the reviews, @bdice, @revans2, @codereport, and @ttnghia. This change has now been merged. |
@mythrocks RANK should be unified into single aggregation with type of the rank as parameter (FIRST, AVERAGE, MIN, MAX, DENSE). |
@karthikeyann, sorry for the delayed response. A couple of things to consider:
|
@mythrocks pandas has ranking functions for min, max, and average that SQL does not have |
On first read, I thought this meant that Pandas treats To close the loop on this, the use of
I think that should be possible. We can take this discussion over to #9569, to explore our options. |
Fixes #9644.
The
percent_rank()
aggregation is a ranking function, similarto
rank()
anddense_rank()
. It calculates the fractional/relativerank of a row within a group of rows in a column, and returns a double
value in the range [0.0, 1.0] for each row in the group/column.
This commit includes the
libcudf
changes to supportpercent_rank()
aggregations, and the supporting JNI bindings.
Note that
percent_rank()
is typically used as a window aggregation. Itis implemented as a (grouped) scan aggregation, just as
rank()
anddense_rank()
because it operates across the whole group of rows. (i.e.the window specification if fixed, spanning the entire group.)
References: