-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support datasets that contain more than 2B values #14
Comments
I can do that but the current limit is typically more generous than 2^31 samples. Instead, it is more like two billion samples in a single centroid. If you don't have repeated values the limit is about 1000 times that. Furthermore, it is common to have many aggregates going at the same time each of which has this limit separately. So my question to you is: how real is this request? How many problematic examples have you seen? How much do you have aggre Sent from my iPhone
|
I don't have any problematic example yet since the feature that we are working on and which leverages t-digest is not released yet. However, some of our users store more than 2B documents and I was checking whether quantile estimations would work on such large datasets. Maybe something simple that can be done to raise the maximum dataset size while having negligible impact on memory usage would be to just change |
On Tue, Mar 18, 2014 at 2:13 PM, Adrien Grand notifications@github.comwrote:
Good point. I will do that sometime this week unless I see a pull request |
Centroid counts remain tracked as integers, but whenever counts of several centroids need to be summed up, a long is used instead. This should allow for summarizing datasets of at least several tens of billions of values. Close tdunning#14
Current implementation uses
int
s in order to represent counts of values. It would be useful to switch to longs so that quantile estimations would still work if more than 2B values are accumulated.The text was updated successfully, but these errors were encountered: