Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] help support for distributed approximate_percentile/quantile #7170

Closed
revans2 opened this issue Jan 19, 2021 · 8 comments
Closed

[FEA] help support for distributed approximate_percentile/quantile #7170

revans2 opened this issue Jan 19, 2021 · 8 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Jan 19, 2021

Is your feature request related to a problem? Please describe.
For the Spark Accelerator we would like to be able to support the approximate percentile aggregation approx_percentile. This is not a simple aggregation/reduction.

First quoting from http://spark.apache.org/docs/latest/api/sql/index.html#approx_percentile

approx_percentile(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column col at the given percentage. The value of percentage must be between 0.0 and 1.0. The accuracy parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation. When percentage is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column col at the given percentage array.

And second it does this as a distributed algorithm where there are a few phases to do the aggregation.

  1. Initial pass over the data and compute an intermediate value for each group, unless doing a reduction then it is for all data.
  2. Shuffle the data so all of the values associated with a given key are on the same host (or all values go to a single host for a reduction).
  3. Combine the intermediate results together, grouped by key except in the case of a reduction.
  4. Pull out the percentile values from these combined results.

I don't think we have to match bit for bit with Spark, but we should be able to do produce answers that are relatively close, and ideally space efficient on the GPU.

Describe the solution you'd like

  1. An aggregation/reduction similar to collect_list or collect_set, but with some form of smart lossy compression so we don't use too much memory.
  2. An aggregation/reduction that would essentially be like a concat_lists or concat_sets, but again with some kind of lossy compression like in 1.
  3. an API that would take in the output of step 2 and produce an approximate percentile/quantile value(s).

Describe alternatives you've considered
There really isn't a way to do this without some help from cudf.

Additional context
Like I said initially we don't need to be exact in this. We can store the intermediate data as some kind of a list of bytes or a list of doubles/longs whatever is needed.

The details of the math for the error in the percentile is a bit beyond me without some concentrated work, but I can tell you what other applications do.

Hive uses a numeric histogram for compressing the data.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java

Hive use the equivalent of accuracy to determine the maximum number of buckets that it stores in the histogram.

For Spark they use a more math focused approach based off of

"Space-efficient Online Computation of Quantile Summaries" by Greenwald, Michael and Khanna, Sanjeev. (https://doi.org/10.1145/375663.375670)

I am likely to get this wrong, but from reading the Spark code it looks like they store a count of how many values have been seen, an array of triplets, which holds a value (double), g (long) and delta (long). This array of triplets is called sampled. They also store an array of double values which is referred to as the head. Essentially head is there to get enough data to make it worth compressing the data into sampled. When head has reached a cutoff point compress is called which will sort head and then insert the values into sampled. It will then compress sampled by merging entries that are close to each other. The math of what is close enough is in the paper, and in the Spark code if this is the route we want to go with.

Dask appears to use a combination of T-Digest https://cinc.rud.is/web/packages/tdigest/ and I'm not a python expert so I don't really know but it looks like it calls percentile on each partition of the input column for the map stage and then tries to infer what the percentile would be from all of the other percentiles that it calculated previously, which can result in some serious errors in corner cases.

I would be happy with a t-digest based implementation there are C and C++ implementations as well with favorable licenses (MIT and Apache) that can be used as references. https://github.com/tdunning/t-digest although some of them are a bit old and don't provide any control over accuracy vs the amount of memory used.

I would also be happy with something not listed here so long as we get good performance and good memory usage.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Jan 19, 2021
@github-actions github-actions bot added this to Needs prioritizing in Feature Planning Jan 19, 2021
@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jan 27, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jlowe
Copy link
Member

jlowe commented Feb 26, 2021

This feature is still desired.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@revans2
Copy link
Contributor Author

revans2 commented Mar 29, 2021

this is still desired

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sameerz
Copy link
Contributor

sameerz commented Jun 16, 2021

This feature is still needed.

@harrism harrism added the 0 - Backlog In queue waiting for assignment label Jun 16, 2021
@nvdbaranec nvdbaranec self-assigned this Jun 18, 2021
@nvdbaranec
Copy link
Contributor

Going to start looking at this.

@nvdbaranec nvdbaranec moved this from Needs prioritizing to Next release in Feature Planning Jul 20, 2021
@nvdbaranec nvdbaranec added this to Issue-Needs prioritizing in v21.10 Release via automation Jul 20, 2021
@beckernick beckernick moved this from Issue-Needs prioritizing to Issue-P2 in v21.10 Release Aug 26, 2021
@beckernick beckernick removed 0 - Backlog In queue waiting for assignment inactive-30d labels Aug 26, 2021
rapids-bot bot pushed a commit that referenced this issue Sep 24, 2021
Addresses  #7170

Adds 3 pieces of new functionality:

- A `TDIGEST` aggregation which creates a tdigest column (https://arxiv.org/pdf/1902.04023.pdf) from a stream of input scalars.
- A `MERGE_TDIGEST` aggregation which merges multiple tdigest columns into a new one.
- a `percentile_approx` function which performs percentile queries on tdigest data.

Also exposes several ::detail functions (`sort`, `merge`, `slice`) in detail headers.

Ready for review.  I do need to add more tests though.

Authors:
  - https://github.com/nvdbaranec

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Jake Hemstad (https://github.com/jrhemstad)
  - MithunR (https://github.com/mythrocks)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #8983
@caryr35 caryr35 added this to Issue-Needs prioritizing in v21.12 Release via automation Oct 4, 2021
@caryr35 caryr35 moved this from Issue-Needs prioritizing to Issue-P2 in v21.12 Release Oct 4, 2021
@caryr35 caryr35 removed this from Issue-P2 in v21.10 Release Oct 4, 2021
@jrhemstad jrhemstad removed this from Next release in Feature Planning Nov 10, 2021
@jrhemstad jrhemstad added this to Issue-Needs prioritizing in v22.02 Release via automation Nov 10, 2021
@jrhemstad jrhemstad removed this from Issue-P2 in v21.12 Release Nov 10, 2021
@jrhemstad jrhemstad removed this from Issue-Needs prioritizing in v22.02 Release Nov 10, 2021
@jrhemstad jrhemstad added this to Needs prioritizing in Feature Planning via automation Nov 10, 2021
@jrhemstad jrhemstad added this to Issue-Needs prioritizing in v21.12 Release via automation Nov 10, 2021
@jrhemstad jrhemstad removed this from Issue-Needs prioritizing in v21.12 Release Nov 10, 2021
@nvdbaranec
Copy link
Contributor

Closed with: #8983

Feature Planning automation moved this from Needs prioritizing to Closed May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
No open projects
Development

No branches or pull requests

7 participants