Scale centroid sizes according to sqrt(q(1-q)) instead of q(1-q) #30

oertl · 2014-11-02T12:02:35Z

Since the statistical error when estimating quantiles is proportional to sqrt(q*(1-q)) (compare Numerical Recipes, 3rd edition, p. 438), I think it could be better to choose the centroid sizes accordingly. In the add() method I replaced the line
double k = 4 * count * q * (1 - q) / compression;
by
double k = 2 * count * Math.sqrt(q * (1 - q)) / compression;
and got very interesting results as shown below. For comparison the figures are also shown for the original approach. The error-scaling and scaling figures show a reduction of digest sizes by a factor more than 2. The error figures suggest that the new approach is somewhat less accurate for the Gamma distribution. However, I believe that the accuracy is still small if compared to the statistical error introduced by data sampling. It would be interesting to measure the quality of the t-digest algorithm in terms of the statistical error as proposed in Numericial Recipes (3rd edition, p. 438):

..., there are statistical errors. One way to characterize these is to ask what
value j has A_j closest to the quantile reported by IQ agent, and then how small is |j-pN| as a fraction of sqrt(N p (1-p)). If this fraction is less than 1, then the estimate is “good enough,” meaning that no method can do substantially better at estimating the population quantiles given your sample.

(A_1, A_2, ..., A_N are the sampled values and IQ agent is the quantile estimator, t-digest in our case)

error-scaling (2_sqrt(q_(1-q)))
error-scaling (4_q_(1-q))
scaling (2_sqrt(q_(1-q)))
scaling (4_q_(1-q))
error (2_sqrt(q_(1-q)))
error (4_q_(1-q))
sizes (2_sqrt(q_(1-q)))
sizes(4_q_(1-q))

The text was updated successfully, but these errors were encountered:

…to sqrt(q*(1-q)), see tdunning#30

tdunning · 2014-11-02T22:37:09Z

Interesting you should mention this.

I have been wondering about this same issue. And the size scaling with
your alternative is interesting.

I worry that you have taken the square root of the q(1-q) factor, however.
This plot shows what I mean:

The red line is your new limit. The black line is the current limit. The
green line is the square of the current limit.

The problem that I see is that if the error is proportional to the square
root of the limit, then taking the square root of sqrt(q(1-q)) makes the
error limit proportional to the fourth root of q(1-q). This makes the
errors near the boundaries much larger rather than smaller.

It seems to me that the error that you have noted actually means that we
should use [4 q (1-q)]^2 as the limit rather than the current limit.

I think that this graph shows the relative error for my original suggested
size limits (black) and for your suggested limits (red).

On the other hand, if we actually use (4_q_(1-q))^2 as the size limits, the
relative error should look like the green line.

Your data suggest, however, that there is a big difference in the size
scaling. That is seriously surprising to me.

On Sun, Nov 2, 2014 at 4:02 AM, Otmar Ertl notifications@github.com wrote:

Since the statistical error when estimating quantiles is proportional to
sqrt(q*(1-q)) (compare Numerical Recipes, 3rd edition, p. 438), I think it
could be better to choose the centroid sizes accordingly. In the add()
method I replaced the line
double k = 4 * count * q * (1 - q) / compression;
by
double k = 2 * count * Math.sqrt(q * (1 - q)) / compression;
and got very interesting results as shown below. For comparison the
figures are also shown for the original approach. The error-scaling and
scaling figures show a reduction of digest sizes by a factor more than 2.
The error figures suggest that the new approach is somewhat less accurate
for the Gamma distribution. However, I believe that the accuracy is still
small if compared to the statistical error introduced by data sampling. It
would be interesting to measure the quality of the t-digest algorithm in
terms of the statistical error as proposed in Numericial Recipes (3rd
edition, p. 438):

..., there are statistical errors. One way to characterize these is to ask
what
value j has A_j closest to the quantile reported by IQ agent, and then how
small is |j-pN| as a fraction of sqrt(N p (1-p)). If this fraction is less
than 1, then the estimate is “good enough,” meaning that no method can do
substantially better at estimating the population quantiles given your
sample.

(A_1, A_2, ..., A_N are the sampled values and IQ agent is the quantile
estimator, t-digest in our case)

error-scaling (2_sqrt(q_(1-q))) [image: error-scaling]
https://cloud.githubusercontent.com/assets/9392465/4874942/5193ffc6-627a-11e4-9b19-b69cf71db31b.jpg

error-scaling (4_q_(1-q)) [image: error-scaling]
https://cloud.githubusercontent.com/assets/9392465/4874980/22054efc-627c-11e4-8403-0fc761fa721c.jpg

scaling (2_sqrt(q_(1-q))) [image: scaling]
https://cloud.githubusercontent.com/assets/9392465/4874943/51942956-627a-11e4-8cee-1a7d1b2313d3.jpg

scaling (4_q_(1-q)) [image: scaling]
https://cloud.githubusercontent.com/assets/9392465/4874977/2203ecf6-627c-11e4-9162-8356ba948fac.jpg

error (2_sqrt(q_(1-q))) [image: error]
https://cloud.githubusercontent.com/assets/9392465/4874944/51943914-627a-11e4-976e-3dbe43ba8026.jpg

error (4_q_(1-q)) [image: error]
https://cloud.githubusercontent.com/assets/9392465/4874978/220474a0-627c-11e4-8aa2-32dcaaf8e42b.jpg

sizes (2_sqrt(q_(1-q))) [image: sizes]
https://cloud.githubusercontent.com/assets/9392465/4874945/5198da6e-627a-11e4-92f6-5ed5776d9a77.jpg

sizes(4_q_(1-q)) [image: sizes]
https://cloud.githubusercontent.com/assets/9392465/4874979/2205190a-627c-11e4-8878-06c9e826bc03.jpg

—
Reply to this email directly or view it on GitHub
#30.

oertl · 2014-11-03T07:15:15Z

Could you please add the mentioned figures to your comment?

oertl · 2014-11-03T09:07:29Z

The problem that I see is that if the error is proportional to the square root of the limit, then taking the square root of sqrt(q(1-q)) makes the error limit proportional to the fourth root of q(1-q). This makes the errors near the boundaries much larger rather than smaller.

I have the feeling that the error is rather proportional to the centroid size than to the square root of the centroid size. Let's say we have m centroids with sizes c_i where c_1+c_2+...+c_m = N.
To estimate the q-quantile we determine k so that (c_1+...+c_(k-1))/N <= q <= (c1+...+c_k)/N. Hence, the error that we make by choosing the neighboring centroid is proportional to c_k/N <= (4 N delta q (1-q))/N = (delta 4 q (1-q)). I know, linear interpolation between both neighboring centroids is used to improve quantile estimation, but I do not know if this substantially changes the error scaling law in terms of q. However, using (2 sqrt(q (1-q))) in place of (4 q (1-q)) would give (delta 2 sqrt(q (1-q))) which corresponds to the statistical error.

tdunning · 2014-11-03T16:25:48Z

Harumph... I was hoping that the figures would flow from email.

On Sun, Nov 2, 2014 at 11:15 PM, Otmar Ertl notifications@github.com
wrote:

Could you please add the mentioned figures to your comment?

—
Reply to this email directly or view it on GitHub
#30 (comment).

tdunning · 2014-11-03T21:50:37Z

Hmm.... I think I see where you are going.

There are two kinds of estimation possible. One kind is to estimate the empirical quantiles. The other is to estimate the quantiles of the underlying distribution. Estimating the underlying distribution is subject first to the estimation of the empirical distribution and then subject to the inescapable variation of the empirical distribution relative to the underlying distribution. It sounds like you are talking about matching the error in estimating the empirical quantile to the discrepancy between the empirical and underlying distribution.

So far, I have only been pushing for improving the error in estimating the empirical quantiles.

oertl · 2014-11-04T20:10:50Z

You are right, I am talking about estimating the underlying distribution. I think the t-digest algorithm would also be convenient for that purpose.

Concerning the error scaling figure you posted:

Looking at the curves, I guess that the "relative error" is calculated as max(sqrt(e)/q, sqrt(e)/(1-q)) if e denotes the absolute error of q? Or should it be max(e/q, e/(1-q))? In the latter case, which makes more sense to me, the green line would correspond to the original approach and the relative error would be finite for any q. However, if it is the goal to limit the relative error for all q, 2min(q, 1-q) would be the better centroid size limit yielding constant relative error for all q. I am still confused about the 4q(1-q) centroid size limit. Is there any mathematical foundation for that?

tdunning · 2014-11-04T21:10:13Z

Good point about relative error. My choice was based in two factors, both relatively weak:

the value near the tails is what matters most. In this area max(q,1-q) \approx q (1-q).
I always prefer smooth and continuous values. Your suggestion is continuous and I can see no reason that continuous second derivative could matter here.

At the very least, I think that you have made a very strong case that the limit should be pluggable across all implementations.

Sent from my iPhone

On Nov 4, 2014, at 12:10, Otmar Ertl notifications@github.com wrote:

However, if it is the goal to limit the relative error for all q, 2min(q, 1-q) would be the better centroid size limit yielding constant relative error for all q. I am still confused about the 4q(1-q) centroid size limit. Is there any mathematical foundation for that?

oertl · 2014-11-05T19:43:03Z

I agree, the t-digest algorithm should be somehow parameterized to allow arbitrary limits. I think if one is only interested in a single quantile value or a predefined set of quantile values, completely different limit functions could be more appropriate.

oertl · 2015-01-02T14:04:26Z

I closed this issue in favor of #37.

oertl added a commit to oertl/t-digest that referenced this issue Nov 2, 2014

added variant of AVLTreeDigest, that scales centroid sizes according …

bfb3c29

…to sqrt(q*(1-q)), see tdunning#30

oertl mentioned this issue Jan 2, 2015

Allow arbitrary scaling laws for centroid sizes #37

Closed

oertl closed this as completed Jan 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale centroid sizes according to sqrt(q(1-q)) instead of q(1-q) #30

Scale centroid sizes according to sqrt(q(1-q)) instead of q(1-q) #30

oertl commented Nov 2, 2014

tdunning commented Nov 2, 2014

oertl commented Nov 3, 2014

oertl commented Nov 3, 2014

tdunning commented Nov 3, 2014

tdunning commented Nov 3, 2014

oertl commented Nov 4, 2014

tdunning commented Nov 4, 2014

oertl commented Nov 5, 2014

oertl commented Jan 2, 2015

Scale centroid sizes according to sqrt(q*(1-q)) instead of q*(1-q) #30

Scale centroid sizes according to sqrt(q*(1-q)) instead of q*(1-q) #30

Comments

oertl commented Nov 2, 2014

tdunning commented Nov 2, 2014

oertl commented Nov 3, 2014

oertl commented Nov 3, 2014

tdunning commented Nov 3, 2014

tdunning commented Nov 3, 2014

oertl commented Nov 4, 2014

tdunning commented Nov 4, 2014

oertl commented Nov 5, 2014

oertl commented Jan 2, 2015

Scale centroid sizes according to sqrt(q(1-q)) instead of q(1-q) #30

Scale centroid sizes according to sqrt(q(1-q)) instead of q(1-q) #30