# Improved constraint on centroid sizes #40

Closed
opened this Issue Jan 28, 2015 · 9 comments

Projects
None yet
2 participants

### oertl commented Jan 28, 2015

 I propose an improved condition, whether neighboring centroids can be merged or not: Let `{(x_1, w_1), (x_2, w_2),...,(x_N, w_N)}` be the centroids in ascending order where `x_i` is the mean and `w_i` is the weight of the `i`-th centroid. Let `W` be the sum of all weights `W := w_1 + w_2 +...+ w_N`. The original t-digest condition is equivalent to `w_i/W*f(q_i) <= 4*delta`, where `f(q):=1/(q*(1-q))` and `delta`is the reciprocal value of the compression. Instead, I propose to use the constraint `int_{z_i}^{z_{i+1}} f(q) dq <= 4*delta` with `z_i:=(sum_{j
Owner

### tdunning commented Feb 1, 2015

 I have been traveling too much lately to think well, so this might be the reason for my not understanding the virture of this alternative constraint. Can you say in a sentence why this constraint is better? Also, how would pluggable constraints interact with this suggestion?

### oertl commented Feb 1, 2015

 The proposed constraint is slightly more accurate near 0 and 1. The main difference between both constraints is the way the integral is calculated. The original constraint is obtained by approximating the integral using the rectangle rule. Therefore, and because `f(q)` is convex, the integral is slightly underestimated and the resulting inequality is less strict. For example, consider 10 centroids each with weight 1 and delta = 0.6. The original constraint would allow to merge the first two centroids while the proposed one would not. To solve the integral for arbitrary scaling laws, which cannot be done using analytical means, an approriate numerical quadrature rule needs to be applied. As replacement of the rectangle rule, the trapezoidal rule could be used instead. The trapezoidal rule has the nice property that it overestimates in case of convex functions, which leads to a more strict constraint.

### oertl commented Feb 1, 2015

 Using the new constraint and the `q*(1-q)` scaling law I have found an interesting relationship: Assume the `i`-th centroid corresponds to a histogram bin ranging from `z_i`to `z_{i+1)`. In case of the `q*(1-q)` scaling law and the new constraint we got `ln(z_{i+1}/(1-z_{i+1})) - ln(z_{i}/(1-z_{i})) <= 4*delta`. Hence, this inequality limits the bin width on the logit scale. Interestingly, the authors of the incremental quantile estimation algorithm (http://projecteuclid.org/download/pdfview_1/euclid.ss/1177334520) propose the same, to use equi-distant points on the logit scale for the approximation of the CDF.

### oertl commented Feb 1, 2015

 For pluggable centroid scaling laws it could make sense to provide the antiderivative instead. The antiderivative describes the scale on which you want to limit the histogram bin sizes. For example, `ln(z/(1-z))` corresponds to the `q*(1-q)` law, `arcsin(2*z-1)` corresponds to the `sqrt(q*(1-q))` law. If the antiderivative is given as parameter, the t-digest algorithm does not need to be concerned with numerical integration.
Owner

### tdunning commented Feb 1, 2015

 OK. I understand your point. But does it have practical impact? Here, for instance is the result of applying the standard approach: As you can see, close adherence. Magnifying: Likewise, the cluster sizes don't exceed the limit. So if the limit is not exceeded, would this alternative make a big difference? On the other hand, your integral formulation of the size limit and the resulting constant steps on logit space are intriguing relative to the size growth. Here are results of size growth for various size limits: As you can see, there is qualitative difference between the standard q(1-q) rule and the sqrt(q(1-q)) rule. I don't fully understand this yet, but the approach I was trying to take was somewhat similar to your integral approach. Here, btw, is the code that produced this graph. It uses a simplification of the t-digest that focuses only on a special case of ordered points where you know the number of points. https://gist.github.com/tdunning/7a1d24ac3efb18138536

### oertl commented Feb 2, 2015

 It would be interesting to investigate the cluster sizes over q also for the integral formulated constraint (especially in the magnified figure). There should be more centroids with size 1 near q=0 and q=1, allowing for more accurate quantile estimations in that range. Anyway, I believe the integral formulation should be the fundamental constraint to start from. The rectangle rule, which leads to the original constraint, is a very good approximation for most values of q. However, for quantiles very close to 0 and 1 where the approximation error is largest better results could be obtained using a more accurate integration method. The different centroid size growth over the number of inserted points can be explained as follows (using the integral formulation). For the `q*(1-q)` rule we have a couple of centroids of size 1 which violate the constraint. These centroids are the very first and the very last ones. All other centroids in between fulfill the constraint. Let's say there are `X` such violating centroids near 0 and due to symmetry also `X` such centroids near 1. If `N` is the total number of centroids we have `N-2*X`centroids fulfilling the constraint. Summing up the constraints `int_{z_i}^{z_{i+1}} 1/(q*(1-q)) dq <= 4*delta` for all these centroids (`X+1 <= i <= N-X`) gives `int_{X/W}^{1-X/W}} f(q) dq <= 4*delta*(N-2*X)` or `2*ln(W/X-1) <= 4*delta*(N-2*X)`. I am still unsure about the dependency of X on W. However, if it can be neglected, the number of centroids will increase at least logarithmically. This is also what the red graph suggests. Contrary, the `sqrt(q*(1-q))` rule allows the size of all centroids, including the first and the last ones, to scale linearly with the number of points. Consider for example the first centroid: The constraint for its size `w_1` reads as `int_{0}^{w_1/W} 1/sqrt(q*(1-q)) dq <= 4*delta` or `arcsin(2*w_1/W) + pi/2 <= 4*delta` or `2*w_1/W <= sin(4*delta - pi/2)`. Hence, `w_1` can get arbitrary large as `W` increases. Consequently, for large `W` we can assume that the LHS of the constraint is almost equal to the RHS: `int_{z_i}^{z_{i+1}} 1/sqrt(q*(1-q)) dq ~ 4*delta`. Summing up the constraints for all centroids gives `int_0^1 1/sqrt(q*(1-q)) dq ~ N*4*delta` or `pi ~ N*4*delta` which explains that the number of centroids `N` approaches a constant. The key difference is that the integral of `1/(q*(1-q))` is indefinite and that of `1/sqrt(q*(1-q))` is finite. The advantage of using a centroid size rule with finite integral is that the memory consumption will converge to a constant dependent only on the compression and not on the number of points. I think for most use cases this behavior is exactly what you expect from a quantile estimation algorithm. I do not know, if there are use cases which require to query quantiles which are more and more close to 0 and 1 while adding points, which would justify the increasing memory needs. Usually, the quantiles you are interested in are fixed and do not change with increasing number of inserted points.

### oertl commented Feb 4, 2015

 Today I had time to analyze the asymptotic dependence of X on W. By definition the (X+1)-th centroid is the first centroid that fulfills the constraint: `ln((X+1)/(W-X-1))-ln(X/(W-X)) < 4*delta` or equivalently `(X+1)*(W-X)/(X*(W-X-1)) < exp(4*delta)`or `W/(W*X-X*(X+1)) < exp(4*delta)-1`. For `W >> X` the LHS converges to `1/X` which gives `1/X < exp(4*delta)-1` and finally `X > 1/(exp(4*delta)-1)`. Therefore, in case `W >> X` the smallest `X` fulfilling the inequality is approximately `1/(exp(4*delta)-1)` which is approximately equal to `1/(4*delta)` for `delta<<1` and which is independent of `W`. As a consequence, the number of centroids really increases logarithmically with the number of points `W` for the `q*(1-q)` rule.
Owner

### tdunning commented Feb 5, 2015

 Excellent. I had derived exactly the same result for how many 1 there were, but I like the extension which shows that there is no dependence on W. The other way to attack this is to assume linearity in q at the periphery. On Wed, Feb 4, 2015 at 12:33 PM, Otmar Ertl notifications@github.com wrote: Today I had time to analyze the asymptotic dependence of X on W. By definition the (X+1)-th centroid is the first centroid that fulfills the constraint: ln((X+1)/(W-X-1))-ln(X/(W-X)) < 4_delta or equivalently (X+1)(W-X)/(X(W-X-1)) < exp(4_delta)or W/(W_X-X_(X+1)) < exp(4_delta)-1. For W >> X the LHS converges to 1/X which gives 1/X < exp(4_delta)-1 and finally X > 1/(exp(4_delta)-1). Therefore, in case W >> X the smallest X fulfilling the inequality is approximately 1/(exp(4_delta)-1) which is approximately equal to 1/(4_delta) for delta<<1 and which is independent of W. As a consequence, the number of centroids really increases logarithmically with the number of points W for the q_(1-q) rule. — Reply to this email directly or view it on GitHub #40 (comment).
Owner

### tdunning commented Apr 19, 2017

 I am closing this (quite late) because Otmar was completely right and this is now the basis for all supported implementations.