Support datasets that contain more than 2B values #14

jpountz · 2014-03-18T20:02:25Z

Current implementation uses ints in order to represent counts of values. It would be useful to switch to longs so that quantile estimations would still work if more than 2B values are accumulated.

The text was updated successfully, but these errors were encountered:

tdunning · 2014-03-18T20:50:20Z

I can do that but the current limit is typically more generous than 2^31 samples. Instead, it is more like two billion samples in a single centroid. If you don't have repeated values the limit is about 1000 times that.

Furthermore, it is common to have many aggregates going at the same time each of which has this limit separately.

So my question to you is: how real is this request? How many problematic examples have you seen?

How much do you have aggre

Sent from my iPhone

On Mar 18, 2014, at 13:02, Adrien Grand notifications@github.com wrote:

Current implementation uses ints in order to represent counts of values. It would be useful to switch to longs so that quantile estimations would still work if more than 2B values are accumulated.

—
Reply to this email directly or view it on GitHub.

jpountz · 2014-03-18T21:13:43Z

I don't have any problematic example yet since the feature that we are working on and which leverages t-digest is not released yet. However, some of our users store more than 2B documents and I was checking whether quantile estimations would work on such large datasets.

Maybe something simple that can be done to raise the maximum dataset size while having negligible impact on memory usage would be to just change ArrayDigest.totalWeight and TreeDigest.count to longs?

tdunning · 2014-03-18T23:11:34Z

On Tue, Mar 18, 2014 at 2:13 PM, Adrien Grand notifications@github.comwrote:

Maybe something simple that can be done to raise the maximum dataset size
while having negligible impact on memory usage would be to just change
ArrayDigest.totalWeight and TreeDigest.count to longs?

Good point. I will do that sometime this week unless I see a pull request
from you sooner.

Centroid counts remain tracked as integers, but whenever counts of several centroids need to be summed up, a long is used instead. This should allow for summarizing datasets of at least several tens of billions of values. Close tdunning#14

jpountz mentioned this issue Mar 19, 2014

Make TDigest instances able to summarize more than 2B values. #15

Merged

tdunning closed this as completed in #15 Mar 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support datasets that contain more than 2B values #14

Support datasets that contain more than 2B values #14

jpountz commented Mar 18, 2014

tdunning commented Mar 18, 2014

jpountz commented Mar 18, 2014

tdunning commented Mar 18, 2014

Support datasets that contain more than 2B values #14

Support datasets that contain more than 2B values #14

Comments

jpountz commented Mar 18, 2014

tdunning commented Mar 18, 2014

jpountz commented Mar 18, 2014

tdunning commented Mar 18, 2014