-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale centroid sizes according to sqrt(q*(1-q)) instead of q*(1-q) #30
Comments
…to sqrt(q*(1-q)), see tdunning#30
Interesting you should mention this. I have been wondering about this same issue. And the size scaling with I worry that you have taken the square root of the q(1-q) factor, however. The red line is your new limit. The black line is the current limit. The The problem that I see is that if the error is proportional to the square It seems to me that the error that you have noted actually means that we I think that this graph shows the relative error for my original suggested On the other hand, if we actually use (4_q_(1-q))^2 as the size limits, the Your data suggest, however, that there is a big difference in the size On Sun, Nov 2, 2014 at 4:02 AM, Otmar Ertl notifications@github.com wrote:
|
Could you please add the mentioned figures to your comment? |
I have the feeling that the error is rather proportional to the centroid size than to the square root of the centroid size. Let's say we have m centroids with sizes c_i where c_1+c_2+...+c_m = N. |
Harumph... I was hoping that the figures would flow from email. On Sun, Nov 2, 2014 at 11:15 PM, Otmar Ertl notifications@github.com
|
Hmm.... I think I see where you are going. There are two kinds of estimation possible. One kind is to estimate the empirical quantiles. The other is to estimate the quantiles of the underlying distribution. Estimating the underlying distribution is subject first to the estimation of the empirical distribution and then subject to the inescapable variation of the empirical distribution relative to the underlying distribution. It sounds like you are talking about matching the error in estimating the empirical quantile to the discrepancy between the empirical and underlying distribution. So far, I have only been pushing for improving the error in estimating the empirical quantiles. |
Good point about relative error. My choice was based in two factors, both relatively weak:
At the very least, I think that you have made a very strong case that the limit should be pluggable across all implementations. Sent from my iPhone
|
I agree, the t-digest algorithm should be somehow parameterized to allow arbitrary limits. I think if one is only interested in a single quantile value or a predefined set of quantile values, completely different limit functions could be more appropriate. |
I closed this issue in favor of #37. |
Since the statistical error when estimating quantiles is proportional to sqrt(q*(1-q)) (compare Numerical Recipes, 3rd edition, p. 438), I think it could be better to choose the centroid sizes accordingly. In the add() method I replaced the line
double k = 4 * count * q * (1 - q) / compression;
by
double k = 2 * count * Math.sqrt(q * (1 - q)) / compression;
and got very interesting results as shown below. For comparison the figures are also shown for the original approach. The error-scaling and scaling figures show a reduction of digest sizes by a factor more than 2. The error figures suggest that the new approach is somewhat less accurate for the Gamma distribution. However, I believe that the accuracy is still small if compared to the statistical error introduced by data sampling. It would be interesting to measure the quality of the t-digest algorithm in terms of the statistical error as proposed in Numericial Recipes (3rd edition, p. 438):
(A_1, A_2, ..., A_N are the sampled values and IQ agent is the quantile estimator, t-digest in our case)
The text was updated successfully, but these errors were encountered: