[FEA] help support for distributed approximate_percentile/quantile #7170

revans2 · 2021-01-19T21:00:36Z

Is your feature request related to a problem? Please describe.
For the Spark Accelerator we would like to be able to support the approximate percentile aggregation approx_percentile. This is not a simple aggregation/reduction.

First quoting from http://spark.apache.org/docs/latest/api/sql/index.html#approx_percentile

approx_percentile(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column col at the given percentage. The value of percentage must be between 0.0 and 1.0. The accuracy parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation. When percentage is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column col at the given percentage array.

And second it does this as a distributed algorithm where there are a few phases to do the aggregation.

Initial pass over the data and compute an intermediate value for each group, unless doing a reduction then it is for all data.
Shuffle the data so all of the values associated with a given key are on the same host (or all values go to a single host for a reduction).
Combine the intermediate results together, grouped by key except in the case of a reduction.
Pull out the percentile values from these combined results.

I don't think we have to match bit for bit with Spark, but we should be able to do produce answers that are relatively close, and ideally space efficient on the GPU.

Describe the solution you'd like

An aggregation/reduction similar to collect_list or collect_set, but with some form of smart lossy compression so we don't use too much memory.
An aggregation/reduction that would essentially be like a concat_lists or concat_sets, but again with some kind of lossy compression like in 1.
an API that would take in the output of step 2 and produce an approximate percentile/quantile value(s).

Describe alternatives you've considered
There really isn't a way to do this without some help from cudf.

Additional context
Like I said initially we don't need to be exact in this. We can store the intermediate data as some kind of a list of bytes or a list of doubles/longs whatever is needed.

The details of the math for the error in the percentile is a bit beyond me without some concentrated work, but I can tell you what other applications do.

Hive uses a numeric histogram for compressing the data.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java

Hive use the equivalent of accuracy to determine the maximum number of buckets that it stores in the histogram.

For Spark they use a more math focused approach based off of

"Space-efficient Online Computation of Quantile Summaries" by Greenwald, Michael and Khanna, Sanjeev. (https://doi.org/10.1145/375663.375670)

I am likely to get this wrong, but from reading the Spark code it looks like they store a count of how many values have been seen, an array of triplets, which holds a value (double), g (long) and delta (long). This array of triplets is called sampled. They also store an array of double values which is referred to as the head. Essentially head is there to get enough data to make it worth compressing the data into sampled. When head has reached a cutoff point compress is called which will sort head and then insert the values into sampled. It will then compress sampled by merging entries that are close to each other. The math of what is close enough is in the paper, and in the Spark code if this is the route we want to go with.

Dask appears to use a combination of T-Digest https://cinc.rud.is/web/packages/tdigest/ and I'm not a python expert so I don't really know but it looks like it calls percentile on each partition of the input column for the map stage and then tries to infer what the percentile would be from all of the other percentiles that it calculated previously, which can result in some serious errors in corner cases.

I would be happy with a t-digest based implementation there are C and C++ implementations as well with favorable licenses (MIT and Apache) that can be used as references. https://github.com/tdunning/t-digest although some of them are a bit old and don't provide any control over accuracy vs the amount of memory used.

I would also be happy with something not listed here so long as we get good performance and good memory usage.

The text was updated successfully, but these errors were encountered:

github-actions · 2021-02-26T16:26:20Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

jlowe · 2021-02-26T17:19:19Z

This feature is still desired.

github-actions · 2021-03-28T18:14:15Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

revans2 · 2021-03-29T18:00:50Z

this is still desired

github-actions · 2021-04-28T18:14:49Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

sameerz · 2021-06-16T01:37:19Z

This feature is still needed.

nvdbaranec · 2021-06-18T21:02:38Z

Going to start looking at this.

Addresses #7170 Adds 3 pieces of new functionality: - A `TDIGEST` aggregation which creates a tdigest column (https://arxiv.org/pdf/1902.04023.pdf) from a stream of input scalars. - A `MERGE_TDIGEST` aggregation which merges multiple tdigest columns into a new one. - a `percentile_approx` function which performs percentile queries on tdigest data. Also exposes several ::detail functions (`sort`, `merge`, `slice`) in detail headers. Ready for review. I do need to add more tests though. Authors: - https://github.com/nvdbaranec Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Jake Hemstad (https://github.com/jrhemstad) - MithunR (https://github.com/mythrocks) - Robert Maynard (https://github.com/robertmaynard) URL: #8983

nvdbaranec · 2022-05-23T19:27:48Z

Closed with: #8983

revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Jan 19, 2021

github-actions bot added this to Needs prioritizing in Feature Planning Jan 19, 2021

revans2 mentioned this issue Jan 19, 2021

[FEA] Percentile support NVIDIA/spark-rapids#13

Closed

tgravescs mentioned this issue Jan 19, 2021

[FEA] Support for a custom DataSource V2 which supplies Arrow data NVIDIA/spark-rapids#1072

Closed

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jan 27, 2021

github-actions bot added the inactive-30d label Feb 26, 2021

github-actions bot removed the inactive-30d label Feb 26, 2021

github-actions bot added the inactive-30d label Mar 28, 2021

github-actions bot removed the inactive-30d label Mar 29, 2021

github-actions bot added the inactive-30d label Apr 28, 2021

harrism added the 0 - Backlog In queue waiting for assignment label Jun 16, 2021

nvdbaranec self-assigned this Jun 18, 2021

nvdbaranec moved this from Needs prioritizing to Next release in Feature Planning Jul 20, 2021

nvdbaranec added this to Issue-Needs prioritizing in v21.10 Release via automation Jul 20, 2021

nvdbaranec mentioned this issue Aug 5, 2021

Support for using tdigests to compute approximate percentiles. #8983

Merged

beckernick moved this from Issue-Needs prioritizing to Issue-P2 in v21.10 Release Aug 26, 2021

beckernick removed 0 - Backlog In queue waiting for assignment inactive-30d labels Aug 26, 2021

caryr35 added this to Issue-Needs prioritizing in v21.12 Release via automation Oct 4, 2021

caryr35 moved this from Issue-Needs prioritizing to Issue-P2 in v21.12 Release Oct 4, 2021

caryr35 removed this from Issue-P2 in v21.10 Release Oct 4, 2021

jrhemstad removed this from Next release in Feature Planning Nov 10, 2021

jrhemstad added this to Issue-Needs prioritizing in v22.02 Release via automation Nov 10, 2021

jrhemstad removed this from Issue-P2 in v21.12 Release Nov 10, 2021

jrhemstad removed this from Issue-Needs prioritizing in v22.02 Release Nov 10, 2021

jrhemstad added this to Needs prioritizing in Feature Planning via automation Nov 10, 2021

jrhemstad added this to Issue-Needs prioritizing in v21.12 Release via automation Nov 10, 2021

jrhemstad removed this from Issue-Needs prioritizing in v21.12 Release Nov 10, 2021

nvdbaranec closed this as completed May 23, 2022

Feature Planning automation moved this from Needs prioritizing to Closed May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] help support for distributed approximate_percentile/quantile #7170

[FEA] help support for distributed approximate_percentile/quantile #7170

revans2 commented Jan 19, 2021

github-actions bot commented Feb 26, 2021

jlowe commented Feb 26, 2021

github-actions bot commented Mar 28, 2021

revans2 commented Mar 29, 2021

github-actions bot commented Apr 28, 2021

sameerz commented Jun 16, 2021

nvdbaranec commented Jun 18, 2021

nvdbaranec commented May 23, 2022

[FEA] help support for distributed approximate_percentile/quantile #7170

[FEA] help support for distributed approximate_percentile/quantile #7170

Comments

revans2 commented Jan 19, 2021

github-actions bot commented Feb 26, 2021

jlowe commented Feb 26, 2021

github-actions bot commented Mar 28, 2021

revans2 commented Mar 29, 2021

github-actions bot commented Apr 28, 2021

sameerz commented Jun 16, 2021

nvdbaranec commented Jun 18, 2021

nvdbaranec commented May 23, 2022