Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support datasets that contain more than 2B values #14

Closed
jpountz opened this issue Mar 18, 2014 · 3 comments · Fixed by #15
Closed

Support datasets that contain more than 2B values #14

jpountz opened this issue Mar 18, 2014 · 3 comments · Fixed by #15

Comments

@jpountz
Copy link
Contributor

jpountz commented Mar 18, 2014

Current implementation uses ints in order to represent counts of values. It would be useful to switch to longs so that quantile estimations would still work if more than 2B values are accumulated.

@tdunning
Copy link
Owner

I can do that but the current limit is typically more generous than 2^31 samples. Instead, it is more like two billion samples in a single centroid. If you don't have repeated values the limit is about 1000 times that.

Furthermore, it is common to have many aggregates going at the same time each of which has this limit separately.

So my question to you is: how real is this request? How many problematic examples have you seen?

How much do you have aggre

Sent from my iPhone

On Mar 18, 2014, at 13:02, Adrien Grand notifications@github.com wrote:

Current implementation uses ints in order to represent counts of values. It would be useful to switch to longs so that quantile estimations would still work if more than 2B values are accumulated.


Reply to this email directly or view it on GitHub.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 18, 2014

I don't have any problematic example yet since the feature that we are working on and which leverages t-digest is not released yet. However, some of our users store more than 2B documents and I was checking whether quantile estimations would work on such large datasets.

Maybe something simple that can be done to raise the maximum dataset size while having negligible impact on memory usage would be to just change ArrayDigest.totalWeight and TreeDigest.count to longs?

@tdunning
Copy link
Owner

On Tue, Mar 18, 2014 at 2:13 PM, Adrien Grand notifications@github.comwrote:

Maybe something simple that can be done to raise the maximum dataset size
while having negligible impact on memory usage would be to just change
ArrayDigest.totalWeight and TreeDigest.count to longs?

Good point. I will do that sometime this week unless I see a pull request
from you sooner.

jpountz added a commit to jpountz/t-digest that referenced this issue Mar 19, 2014
Centroid counts remain tracked as integers, but whenever counts of several
centroids need to be summed up, a long is used instead. This should allow
for summarizing datasets of at least several tens of billions of values.

Close tdunning#14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants