Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support null_policy::EXCLUDE for COLLECT rolling aggregation #7264

Merged
merged 10 commits into from
Feb 18, 2021

Conversation

mythrocks
Copy link
Contributor

Closes #7258.

#7189 implements COLLECT aggregations to be done from window functions. The semantics of how null input rows are handled are consistent with CUDF semantics.
E.g.

auto input_col = fixed_width_column_wrapper<int32_t>{70, ∅, 72, 73, 74};
auto output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr);
            // == [ [70,∅], [70,∅,72], [∅,72,73], [72,73,74], [73,74] ]

Note that the null element () is replicated in the first 3 rows of the output.

SparkSQL (and Hive, and other big data SQL systems) have different semantics, in that all null elements are purged. The output for the same operation should yield the following:

auto sparkish_output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr);
            // == [ [70], [70,72], [72,73], [72,73,74], [73,74] ]

CUDF should allow the COLLECT aggregation to be constructed with an optional null_policy argument (with default INCLUDE). The COLLECT window function should check the policy, and filter out null list-elements a posteriori.

@mythrocks mythrocks requested a review from a team as a code owner February 1, 2021 03:17
@mythrocks mythrocks self-assigned this Feb 1, 2021
@mythrocks mythrocks added 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request non-breaking Non-breaking change labels Feb 1, 2021
@mythrocks
Copy link
Contributor Author

The build failure might be because branch-0.19 is quite new:

22:41:05 PackagesNotFoundError: The following packages are not available from current channels:
22:41:05 
22:41:05   - ucx-py=0.19
22:41:05   - rapids-build-env=0.19
22:41:05   - rapids-notebook-env=0.19
22:41:05 
22:41:05 Current channels:...

cpp/src/rolling/rolling_detail.cuh Outdated Show resolved Hide resolved
cpp/src/rolling/rolling_detail.cuh Outdated Show resolved Hide resolved
cpp/src/rolling/rolling_detail.cuh Outdated Show resolved Hide resolved
@mythrocks
Copy link
Contributor Author

I just checked that there isn't a pressing need to similarly exclude NaN values from a collected list, at least from SparkSQL. Spark retains NaN values.

@mythrocks
Copy link
Contributor Author

rerun tests

@codecov
Copy link

codecov bot commented Feb 4, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.19@846e6b6). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff               @@
##             branch-0.19    #7264   +/-   ##
==============================================
  Coverage               ?   82.19%           
==============================================
  Files                  ?      100           
  Lines                  ?    16968           
  Branches               ?        0           
==============================================
  Hits                   ?    13947           
  Misses                 ?     3021           
  Partials               ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 846e6b6...60aa7c6. Read the comment docs.

Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jake has almost covered all essentials, have only minor question.

cpp/src/groupby/sort/groupby.cu Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few suggestions, already looks great!

cpp/tests/collect_list/collect_list_test.cpp Outdated Show resolved Hide resolved
cpp/src/rolling/rolling_detail.cuh Outdated Show resolved Hide resolved
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 13, 2021
@mythrocks mythrocks requested a review from vuule February 13, 2021 04:15
@mythrocks mythrocks requested a review from a team as a code owner February 14, 2021 08:00
@github-actions github-actions bot added the conda label Feb 14, 2021
1. Iterator-based iterator_with_null_at().
2. purge_null_values() interface.
@mythrocks mythrocks added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 18, 2021
@mythrocks
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 9119726 into rapidsai:branch-0.19 Feb 18, 2021
@mythrocks
Copy link
Contributor Author

Thanks for the reviews, all. I've learnt a bit on this one.

firestarman added a commit to firestarman/spark-rapids that referenced this pull request Mar 1, 2021
Since cudf supports skipping null values by PRs
  rapidsai/cudf#7264, and
  rapidsai/cudf#7457.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
rapids-bot bot pushed a commit that referenced this pull request Mar 5, 2021
This PR is to support skipping nulls for `collect ` aggregation in JVM by creating a new class `CollectAggregation` who accepts a `NullPolicy ` argument indicating whether to include nulls. 

Skipping nulls has already been supported by `collect ` aggregation with rolling in native (#7264), so this PR just exposes the feaure in JVM.

This PR also introduces `NullPolicy ` and updates the related aggregates.

Signed-off-by: firestarman <firestarmanllc@gmail.com>

Authors:
  - Liangcai Li (@firestarman)

Approvers:
  - Robert (Bobby) Evans (@revans2)
  - MithunR (@mythrocks)

URL: #7457
firestarman added a commit to firestarman/spark-rapids that referenced this pull request Mar 8, 2021
Since cudf supports skipping null values by PRs
  rapidsai/cudf#7264, and
  rapidsai/cudf#7457.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
firestarman added a commit to NVIDIA/spark-rapids that referenced this pull request Mar 9, 2021
Since cudf supports skipping null values by PRs
  rapidsai/cudf#7264, and
  rapidsai/cudf#7457.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this pull request Mar 25, 2021
This PR is to support skipping nulls for `collect ` aggregation in JVM by creating a new class `CollectAggregation` who accepts a `NullPolicy ` argument indicating whether to include nulls. 

Skipping nulls has already been supported by `collect ` aggregation with rolling in native (rapidsai#7264), so this PR just exposes the feaure in JVM.

This PR also introduces `NullPolicy ` and updates the related aggregates.

Signed-off-by: firestarman <firestarmanllc@gmail.com>

Authors:
  - Liangcai Li (@firestarman)

Approvers:
  - Robert (Bobby) Evans (@revans2)
  - MithunR (@mythrocks)

URL: rapidsai#7457
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Since cudf supports skipping null values by PRs
  rapidsai/cudf#7264, and
  rapidsai/cudf#7457.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Since cudf supports skipping null values by PRs
  rapidsai/cudf#7264, and
  rapidsai/cudf#7457.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond 5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] COLLECT window aggregation should support null_policy::EXCLUDE
5 participants