Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement COLLECT rolling window aggregation #7189

Merged
merged 27 commits into from
Jan 30, 2021

Conversation

mythrocks
Copy link
Contributor

Closes #7133.

This is an implementation of the COLLECT aggregation in the context of rolling window functions. This enables the collection of rows (of type T) within specified window boundaries into a list column (containing elements of type T). In this context, one list row would be generated per input row. E.g. Consider the following example:

auto input_col = fixed_width_column_wrapper<int32_t>{70, 71, 72, 73, 74};

Calling rolling_window() with preceding=2, following=1, min_periods=1 produces the following:

auto output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr);
            // == [ [70,71], [70,71,72], [71,72,73], [72,73,74], [73,74] ]

COLLECT is supported with rolling_window(), grouped_rolling_window(), and grouped_time_range_rolling_window(), across primitive types and arbitrarily nested lists and structs.

min_periods is also honoured: If the number of observations is fewer than min_periods, the resulting list row is null.

@mythrocks mythrocks requested review from a team as code owners January 22, 2021 04:16
@mythrocks mythrocks self-assigned this Jan 22, 2021
@mythrocks
Copy link
Contributor Author

mythrocks commented Jan 22, 2021

Part of the algorithm employed here is a refinement of @harrism's suggestion on #6791 (in a different context). I have tagged @harrism on this PR to confirm that this works as he intended.
(The function in question is get_list_child_to_list_row_mapping(), in rolling_detail.cuh.)

@mythrocks mythrocks added 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request non-breaking Non-breaking change labels Jan 22, 2021
@mythrocks
Copy link
Contributor Author

There are now 4 fairly large tests that pertain to *rolling_window() functions:

  1. ROLLING_TEST
  2. GROUPED_ROLLING_TEST
  3. LEAD_LAG_TEST
  4. COLLECT_LIST_TEST

I intended to combine them all under the tests/rolling/ directory. I will do this in a subsequent PR so as not to detract from this one.

@codecov
Copy link

codecov bot commented Jan 22, 2021

Codecov Report

Merging #7189 (7115460) into branch-0.18 (8860baf) will increase coverage by 0.09%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #7189      +/-   ##
===============================================
+ Coverage        82.09%   82.19%   +0.09%     
===============================================
  Files               97       99       +2     
  Lines            16474    16813     +339     
===============================================
+ Hits             13524    13819     +295     
- Misses            2950     2994      +44     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/parquet.py 0.00% <ø> (ø)
python/cudf/cudf/_lib/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/_typing.py 91.66% <ø> (ø)
python/cudf/cudf/core/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/core/abc.py 91.48% <ø> (+4.25%) ⬆️
python/cudf/cudf/core/buffer.py 80.00% <ø> (+0.95%) ⬆️
python/cudf/cudf/core/column/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/core/column/categorical.py 92.73% <ø> (-0.62%) ⬇️
python/cudf/cudf/core/column/column.py 87.75% <ø> (-0.39%) ⬇️
... and 70 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b608832...7115460. Read the comment docs.

v0.18 Release automation moved this from PR-Needs review to PR-Reviewer approved Jan 29, 2021
@mythrocks mythrocks removed the request for review from harrism January 29, 2021 21:37
@mythrocks mythrocks added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jan 29, 2021
@mythrocks
Copy link
Contributor Author

Thank you for the reviews, @vuule, @rgsl888prabhu! 👏

@kkraus14 kkraus14 removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Jan 30, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 14b0900 into rapidsai:branch-0.18 Jan 30, 2021
v0.18 Release automation moved this from PR-Reviewer approved to Done Jan 30, 2021
@mythrocks
Copy link
Contributor Author

I have filed #7258 for optional null-filtering, to support Spark use-cases.

rapids-bot bot pushed a commit that referenced this pull request Feb 4, 2021
Fixes #7265.

`cudf::detail::get_num_child_rows()` is currently defined in `cudf/lists/detail/utilities.cuh`. The build pipelines for #7189 are fine, but there seem to be build failures in dependent projects such as `spark-rapids`:
```
[2021-01-31T08:12:10.611Z] /.../workspace/spark/cudf18_nightly/cpp/include/cudf/lists/detail/utilities.cuh:31:18: error: 'cudf::size_type cudf::detail::get_num_child_rows(const cudf::column_view&, rmm::cuda_stream_view)' defined but not used [-Werror=unused-function]
[2021-01-31T08:12:10.611Z]  static cudf::size_type get_num_child_rows(cudf::column_view const& list_offsets,
[2021-01-31T08:12:10.611Z]                   ^~~~~~~~~~~~~~~~~~
[2021-01-31T08:12:11.981Z] cc1plus: all warnings being treated as errors
[2021-01-31T08:12:12.238Z] make[2]: *** [CMakeFiles/cudf_hash.dir/build.make:82: CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o] Error 1
[2021-01-31T08:12:12.238Z] make[1]: *** [CMakeFiles/Makefile2:220: CMakeFiles/cudf_hash.dir/all] Error 2
```
In any case, it is less than ideal for the function to be completely defined in the header, especially given that the likes of `hashing.cu` are exposed to it (by way of `scatter.cuh`). 

This commit moves the function definition to a separate translation unit, without changing implementation or interface.

Authors:
  - MithunR (@mythrocks)

Approvers:
  - @nvdbaranec
  - Mike Wilson (@hyperbolic2346)
  - David (@davidwendt)

URL: #7266
rapids-bot bot pushed a commit that referenced this pull request Feb 4, 2021
Add unit tests for aggregate 'collect' with windowing.

This PR depends on the PR #7189 . 

Signed-off-by: Liangcai Li <liangcail@nvidia.com>

Authors:
  - Liangcai Li (@firestarman)

Approvers:
  - MithunR (@mythrocks)
  - Robert (Bobby) Evans (@revans2)

URL: #7121
rapids-bot bot pushed a commit that referenced this pull request Feb 18, 2021
Closes #7258.

#7189 implements `COLLECT` aggregations to be done from window functions. The semantics of how null input rows are handled are consistent with CUDF semantics.
E.g. 
```c++
auto input_col = fixed_width_column_wrapper<int32_t>{70, ∅, 72, 73, 74};
auto output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr);
            // == [ [70,∅], [70,∅,72], [∅,72,73], [72,73,74], [73,74] ]
```
Note that the null element (`∅`) is replicated in the first 3 rows of the output.

SparkSQL (and Hive, and other big data SQL systems) have different semantics, in that all null elements are purged. The output for the same operation should yield the following:
```c++
auto sparkish_output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr);
            // == [ [70], [70,72], [72,73], [72,73,74], [73,74] ]
```

CUDF should allow the `COLLECT` aggregation to be constructed with an optional `null_policy` argument (with default `INCLUDE`). The `COLLECT` window function should check the policy, and filter out null list-elements _a posteriori_.

Authors:
  - MithunR (@mythrocks)

Approvers:
  - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
  - AJ Schmidt (@ajschmidt8)
  - Vukasin Milovanovic (@vuule)
  - Jake Hemstad (@jrhemstad)

URL: #7264
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change
Projects
No open projects
v0.18 Release
  
Done
Development

Successfully merging this pull request may close these issues.

[FEA] COLLECT aggregation for rolling windows
8 participants