New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] help support for distributed approximate_percentile/quantile #7170
Comments
This issue has been labeled |
This feature is still desired. |
This issue has been labeled |
this is still desired |
This issue has been labeled |
This feature is still needed. |
Going to start looking at this. |
Addresses #7170 Adds 3 pieces of new functionality: - A `TDIGEST` aggregation which creates a tdigest column (https://arxiv.org/pdf/1902.04023.pdf) from a stream of input scalars. - A `MERGE_TDIGEST` aggregation which merges multiple tdigest columns into a new one. - a `percentile_approx` function which performs percentile queries on tdigest data. Also exposes several ::detail functions (`sort`, `merge`, `slice`) in detail headers. Ready for review. I do need to add more tests though. Authors: - https://github.com/nvdbaranec Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Jake Hemstad (https://github.com/jrhemstad) - MithunR (https://github.com/mythrocks) - Robert Maynard (https://github.com/robertmaynard) URL: #8983
Closed with: #8983 |
Is your feature request related to a problem? Please describe.
For the Spark Accelerator we would like to be able to support the approximate percentile aggregation
approx_percentile
. This is not a simple aggregation/reduction.First quoting from http://spark.apache.org/docs/latest/api/sql/index.html#approx_percentile
And second it does this as a distributed algorithm where there are a few phases to do the aggregation.
I don't think we have to match bit for bit with Spark, but we should be able to do produce answers that are relatively close, and ideally space efficient on the GPU.
Describe the solution you'd like
concat_lists
orconcat_sets
, but again with some kind of lossy compression like in 1.Describe alternatives you've considered
There really isn't a way to do this without some help from cudf.
Additional context
Like I said initially we don't need to be exact in this. We can store the intermediate data as some kind of a list of bytes or a list of doubles/longs whatever is needed.
The details of the math for the error in the percentile is a bit beyond me without some concentrated work, but I can tell you what other applications do.
Hive uses a numeric histogram for compressing the data.
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
Hive use the equivalent of accuracy to determine the maximum number of buckets that it stores in the histogram.
For Spark they use a more math focused approach based off of
"Space-efficient Online Computation of Quantile Summaries" by Greenwald, Michael and Khanna, Sanjeev. (https://doi.org/10.1145/375663.375670)
I am likely to get this wrong, but from reading the Spark code it looks like they store a count of how many values have been seen, an array of triplets, which holds a
value
(double),g
(long) anddelta
(long). This array of triplets is calledsampled
. They also store an array of double values which is referred to as thehead
. Essentiallyhead
is there to get enough data to make it worth compressing the data intosampled
. Whenhead
has reached a cutoff pointcompress
is called which will sorthead
and then insert the values intosampled
. It will then compresssampled
by merging entries that are close to each other. The math of what is close enough is in the paper, and in the Spark code if this is the route we want to go with.Dask appears to use a combination of T-Digest https://cinc.rud.is/web/packages/tdigest/ and I'm not a python expert so I don't really know but it looks like it calls percentile on each partition of the input column for the map stage and then tries to infer what the percentile would be from all of the other percentiles that it calculated previously, which can result in some serious errors in corner cases.
I would be happy with a t-digest based implementation there are C and C++ implementations as well with favorable licenses (MIT and Apache) that can be used as references. https://github.com/tdunning/t-digest although some of them are a bit old and don't provide any control over accuracy vs the amount of memory used.
I would also be happy with something not listed here so long as we get good performance and good memory usage.
The text was updated successfully, but these errors were encountered: