Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for using tdigests to compute approximate percentiles. #8983

Merged
merged 34 commits into from Sep 24, 2021

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Aug 5, 2021

Addresses #7170

Adds 3 pieces of new functionality:

  • A TDIGEST aggregation which creates a tdigest column (https://arxiv.org/pdf/1902.04023.pdf) from a stream of input scalars.
  • A MERGE_TDIGEST aggregation which merges multiple tdigest columns into a new one.
  • a percentile_approx function which performs percentile queries on tdigest data.

Also exposes several ::detail functions (sort, merge, slice) in detail headers.

Ready for review. I do need to add more tests though.

@nvdbaranec nvdbaranec added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. helps: Spark Functionality that helps Spark RAPIDS labels Aug 5, 2021
@nvdbaranec nvdbaranec requested review from a team as code owners August 5, 2021 21:45
@nvdbaranec nvdbaranec added this to PR-WIP in v21.10 Release via automation Aug 5, 2021
@github-actions github-actions bot added CMake CMake build issue cuDF (Java) Affects Java cuDF API. cuDF (Python) Affects Python cuDF API. labels Aug 5, 2021
@nvdbaranec nvdbaranec added non-breaking Non-breaking change and removed cuDF (Python) Affects Python cuDF API. CMake CMake build issue cuDF (Java) Affects Java cuDF API. labels Aug 5, 2021
@nvdbaranec nvdbaranec marked this pull request as draft August 5, 2021 21:46
@github-actions github-actions bot added CMake CMake build issue cuDF (Python) Affects Python cuDF API. labels Aug 19, 2021
@caryr35 caryr35 moved this from PR-WIP to PR-Needs review in v21.10 Release Sep 23, 2021
v21.10 Release automation moved this from PR-Needs review to PR-Reviewer approved Sep 23, 2021
Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I see the null percentile thing hasn't been resolved yet.

v21.10 Release automation moved this from PR-Reviewer approved to PR-Needs review Sep 23, 2021
@nvdbaranec
Copy link
Contributor Author

Actually, I see the null percentile thing hasn't been resolved yet.

Updated. Null percentile == null output.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is heady stuff. Much to learn.
Again, thank you for being flexible on the null handling.

v21.10 Release automation moved this from PR-Needs review to PR-Reviewer approved Sep 24, 2021
@nvdbaranec
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ba76310 into rapidsai:branch-21.10 Sep 24, 2021
v21.10 Release automation moved this from PR-Reviewer approved to Done Sep 24, 2021
rapids-bot bot pushed a commit that referenced this pull request Sep 24, 2021
This PR builds on #8983 and adds Java bindings.

Authors:
  - Andy Grove (https://github.com/andygrove)
  - https://github.com/nvdbaranec

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)

URL: #9094
rapids-bot bot pushed a commit that referenced this pull request Oct 18, 2021
Addresses comments from initial PR (#8983).  Specifically implementing a tdigest_column_view for more cleanly accessing the various sub-columns of a tdigest column.

Includes several bounds checking fixes for empty groups.  Addresses an issue where entirely empty digests could potentially lead to an incorrect min/max values, which isn't technically _wrong_ but makes constructing test cases tricky.

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Jake Hemstad (https://github.com/jrhemstad)
  - MithunR (https://github.com/mythrocks)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #9403
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue conda conda issue feature request New feature or request helps: Spark Functionality that helps Spark RAPIDS libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

7 participants