-
Notifications
You must be signed in to change notification settings - Fork 228
Improved constraint on centroid sizes #40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have been traveling too much lately to think well, so this might be the reason for my not understanding the virture of this alternative constraint. Can you say in a sentence why this constraint is better? Also, how would pluggable constraints interact with this suggestion? |
The proposed constraint is slightly more accurate near 0 and 1. The main difference between both constraints is the way the integral is calculated. The original constraint is obtained by approximating the integral using the rectangle rule. Therefore, and because To solve the integral for arbitrary scaling laws, which cannot be done using analytical means, an approriate numerical quadrature rule needs to be applied. As replacement of the rectangle rule, the trapezoidal rule could be used instead. The trapezoidal rule has the nice property that it overestimates in case of convex functions, which leads to a more strict constraint. |
Using the new constraint and the Assume the |
For pluggable centroid scaling laws it could make sense to provide the antiderivative instead. The antiderivative describes the scale on which you want to limit the histogram bin sizes. For example, |
OK. I understand your point. But does it have practical impact? Here, for instance is the result of applying the standard approach: As you can see, close adherence. Magnifying: Likewise, the cluster sizes don't exceed the limit. So if the limit is not exceeded, would this alternative make a big difference? On the other hand, your integral formulation of the size limit and the resulting constant steps on logit space are intriguing relative to the size growth. Here are results of size growth for various size limits: As you can see, there is qualitative difference between the standard q(1-q) rule and the sqrt(q(1-q)) rule. I don't fully understand this yet, but the approach I was trying to take was somewhat similar to your integral approach. Here, btw, is the code that produced this graph. It uses a simplification of the t-digest that focuses only on a special case of ordered points where you know the number of points. |
It would be interesting to investigate the cluster sizes over q also for the integral formulated constraint (especially in the magnified figure). There should be more centroids with size 1 near q=0 and q=1, allowing for more accurate quantile estimations in that range. Anyway, I believe the integral formulation should be the fundamental constraint to start from. The rectangle rule, which leads to the original constraint, is a very good approximation for most values of q. However, for quantiles very close to 0 and 1 where the approximation error is largest better results could be obtained using a more accurate integration method. The different centroid size growth over the number of inserted points can be explained as follows (using the integral formulation). For the Contrary, the The advantage of using a centroid size rule with finite integral is that the memory consumption will converge to a constant dependent only on the compression and not on the number of points. I think for most use cases this behavior is exactly what you expect from a quantile estimation algorithm. I do not know, if there are use cases which require to query quantiles which are more and more close to 0 and 1 while adding points, which would justify the increasing memory needs. Usually, the quantiles you are interested in are fixed and do not change with increasing number of inserted points. |
Today I had time to analyze the asymptotic dependence of X on W. By definition the (X+1)-th centroid is the first centroid that fulfills the constraint: |
Excellent. I had derived exactly the same result for how many 1 there On Wed, Feb 4, 2015 at 12:33 PM, Otmar Ertl notifications@github.com
|
I am closing this (quite late) because Otmar was completely right and this is now the basis for all supported implementations. |
I propose an improved condition, whether neighboring centroids can be merged or not: Let
{(x_1, w_1), (x_2, w_2),...,(x_N, w_N)}
be the centroids in ascending order wherex_i
is the mean andw_i
is the weight of thei
-th centroid. LetW
be the sum of all weightsW := w_1 + w_2 +...+ w_N
.The original t-digest condition is equivalent to
w_i/W*f(q_i) <= 4*delta
, wheref(q):=1/(q*(1-q))
anddelta
is the reciprocal value of the compression.Instead, I propose to use the constraint
int_{z_i}^{z_{i+1}} f(q) dq <= 4*delta
withz_i:=(sum_{j<i} w_i)/W
. Note, that this inequality is more strict than the original constraint due to Jensen's inequality and becausef(q)
is a convex function. The integral off(q)
fromz_i
toz_{i+1}
can be solved analytically and the new constraint can be expressed asln(z_{i+1}/(1-z_{i+1})) - ln(z_{i}/(1-z_{i})) <= 4*delta
or equivalently as(z_{i+1}-z_{i})<=(e^{4*delta}-1)*z_{i}*(1-z_{i+1})
. The last inequality can be evaluated very efficiently, if the constant first factor on the right hand side is precalculated.Since the integral of
f(q)
is indefinite at boundaries 0 and 1, the new constraint inhibits in a natural way merging with the first or the last centroid. This is not the case for the original constraint.The text was updated successfully, but these errors were encountered: