Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STRING order-by column for RANGE window functions #13143

Merged

Conversation

mythrocks
Copy link
Contributor

@mythrocks mythrocks commented Apr 14, 2023

Description

This commit adds support to use STRING columns as the order-by column in range-based window functions.

Context

A range-based window function allows a variable-width window to be defined around each row of an input column. The "range" describes the range of values in an order-by column.

Consider the following input:

agg_column = {  1,   1,   1,   1,   1,   1,   1 };   // The aggregation column.
oby_column = { "0", "1", "2", "3", "4", "5", "6" };  // The order-by column.
//                        ^

If the range is defined as [preceding=∞, following=0], then for the row at index 2, the window would be rows 0-2.

The supported range bounds for STRING order-by column are:

  1. UNBOUNDED (i.e. extending to the beginning/end of the column/group)
  2. CURRENT ROW, indicating that all equivalent values of the current row in the oby_column.

Since STRING columns cannot have "delta" values, only UNBOUNDED and CURRENT ROW based window bounds are supported, unlike for numeric types which could specify a specific scalar delta value as a range.

Depends on #13095.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

…nge calculations.

This change is in preparation for supporting `STRING` order-by columns in range window functions.

In the staging step for executing window range queries, the boundaries of each row's window are
calculated. This involves subtracting/adding the `preceding`/`following` values from each
order-by column row, and then searching backwards/forwards for the boundary values.

The staging step has been using `column_device_view.data()` for accessing the order-by rows,
an acceptable approach for when the order-by columns are numeric (e.g. `INT32`).

This approach fails when the order-by column is a `STRING`, because `.data()` is not defined
for such columns. A better approach would be to use `.element()` to directly access the rows,
because it has special handling for `STRING`, among other types, while continuing to work
for numeric primitives.

In a followup to this change, support for `STRING` order-by columns will be added.

Signed-off-by: MithunR <mythrocks@gmail.com>
1. Naming conventions for function/method parameter.
2. More const for identifiers, where applicable.
@mythrocks mythrocks requested a review from a team as a code owner April 14, 2023 23:08
@mythrocks mythrocks marked this pull request as draft April 14, 2023 23:08
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 14, 2023
@mythrocks mythrocks self-assigned this Apr 14, 2023
@mythrocks mythrocks added feature request New feature or request non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS labels Apr 14, 2023
1. Renamed range_window_bounds::extent to extent_type.
2. Punctuation.
3. @brief.
@mythrocks mythrocks requested a review from ttnghia April 19, 2023 20:58
@mythrocks
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 63f0b76 into rapidsai:branch-23.06 Apr 21, 2023
@mythrocks
Copy link
Contributor Author

Thanks for the reviews, chaps. This is now merged.

rapids-bot bot pushed a commit that referenced this pull request Apr 24, 2023
Fixes unused parameter warning/error introduced by #13143 in `grouped_rolling.cu`
```
CMakeFiles/cudf.dir/src/rolling/grouped_rolling.cu.o
/cudf/cpp/src/rolling/grouped_rolling.cu(280): error #177-D: parameter "delta" was declared but never referenced
```
This was found when building with nvcc 11.5.
I was hoping there would be some clever partial specialization or function overloading that could be done here but none were as clean as the current implementation. Solution was to just add `[[maybe_unused]]` to offending parameter declaration.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #13192
mythrocks added a commit to NVIDIA/spark-rapids that referenced this pull request Apr 27, 2023
Closes #7883.
Depends on rapidsai/cudf#13143, rapidsai/cudf#13199.

This commit adds support for `STRING` order-by columns for RANGE window functions.

Before this commit, only numeric and timestamp types were supported as order-by columns in window specifications. However, it is possible to specify window frames such as follows:
```sql
SELECT COUNT(1) OVER( PARTITION BY gid ORDER BY str_col )
```
The implicit range here is `UNBOUNDED PRECEDING AND CURRENT ROW`, although explicit bounds may also be specified.
Note that range values cannot be specified here, because `STRING` does not support intervals.

This change should now allow the plugin to support `UNBOUNDED PRECEDING`, `UNBOUNDED FOLLOWING`, and `CURRENT ROW` as range window bounds, when the order-by column is `STRING`.

Signed-off-by: MithunR <mythrocks@gmail.com>
rapids-bot bot pushed a commit that referenced this pull request Jun 20, 2023
This commit adds support for `FLOAT32` and `FLOAT64` order-by columns for RANGE-based window functions.

## Background
Up until this commit, order-by columns for RANGE window functions were allowed to be integral numerics, timestamps, or strings (for unbounded/current rows).

With this commit, window functions will be permitted to run on floating point value ranges. E.g. This supports windows defined with floating point deltas, like `rows with values exceeding the current row by no more than 3.14f`.

This is in the same vein as the support for `DECIMAL` (#11645) and `STRING` (#13143).

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)

URL: #13512
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants