Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale centroid sizes according to sqrt(q*(1-q)) instead of q*(1-q) #30

Closed
oertl opened this issue Nov 2, 2014 · 9 comments
Closed

Scale centroid sizes according to sqrt(q*(1-q)) instead of q*(1-q) #30

oertl opened this issue Nov 2, 2014 · 9 comments

Comments

@oertl
Copy link

oertl commented Nov 2, 2014

Since the statistical error when estimating quantiles is proportional to sqrt(q*(1-q)) (compare Numerical Recipes, 3rd edition, p. 438), I think it could be better to choose the centroid sizes accordingly. In the add() method I replaced the line
double k = 4 * count * q * (1 - q) / compression;
by
double k = 2 * count * Math.sqrt(q * (1 - q)) / compression;
and got very interesting results as shown below. For comparison the figures are also shown for the original approach. The error-scaling and scaling figures show a reduction of digest sizes by a factor more than 2. The error figures suggest that the new approach is somewhat less accurate for the Gamma distribution. However, I believe that the accuracy is still small if compared to the statistical error introduced by data sampling. It would be interesting to measure the quality of the t-digest algorithm in terms of the statistical error as proposed in Numericial Recipes (3rd edition, p. 438):

..., there are statistical errors. One way to characterize these is to ask what
value j has A_j closest to the quantile reported by IQ agent, and then how small is |j-pN| as a fraction of sqrt(N p (1-p)). If this fraction is less than 1, then the estimate is “good enough,” meaning that no method can do substantially better at estimating the population quantiles given your sample.

(A_1, A_2, ..., A_N are the sampled values and IQ agent is the quantile estimator, t-digest in our case)

  • error-scaling (2_sqrt(q_(1-q)))
    error-scaling
  • error-scaling (4_q_(1-q))
    error-scaling
  • scaling (2_sqrt(q_(1-q)))
    scaling
  • scaling (4_q_(1-q))
    scaling
  • error (2_sqrt(q_(1-q)))
    error
  • error (4_q_(1-q))
    error
  • sizes (2_sqrt(q_(1-q)))
    sizes
  • sizes(4_q_(1-q))
    sizes
oertl added a commit to oertl/t-digest that referenced this issue Nov 2, 2014
@tdunning
Copy link
Owner

tdunning commented Nov 2, 2014

Interesting you should mention this.

I have been wondering about this same issue. And the size scaling with
your alternative is interesting.

I worry that you have taken the square root of the q(1-q) factor, however.
This plot shows what I mean:

fig

The red line is your new limit. The black line is the current limit. The
green line is the square of the current limit.

The problem that I see is that if the error is proportional to the square
root of the limit, then taking the square root of sqrt(q(1-q)) makes the
error limit proportional to the fourth root of q(1-q). This makes the
errors near the boundaries much larger rather than smaller.

It seems to me that the error that you have noted actually means that we
should use [4 q (1-q)]^2 as the limit rather than the current limit.

I think that this graph shows the relative error for my original suggested
size limits (black) and for your suggested limits (red).

Error scaling

On the other hand, if we actually use (4_q_(1-q))^2 as the size limits, the
relative error should look like the green line.

Your data suggest, however, that there is a big difference in the size
scaling. That is seriously surprising to me.

On Sun, Nov 2, 2014 at 4:02 AM, Otmar Ertl notifications@github.com wrote:

Since the statistical error when estimating quantiles is proportional to
sqrt(q*(1-q)) (compare Numerical Recipes, 3rd edition, p. 438), I think it
could be better to choose the centroid sizes accordingly. In the add()
method I replaced the line
double k = 4 * count * q * (1 - q) / compression;
by
double k = 2 * count * Math.sqrt(q * (1 - q)) / compression;
and got very interesting results as shown below. For comparison the
figures are also shown for the original approach. The error-scaling and
scaling figures show a reduction of digest sizes by a factor more than 2.
The error figures suggest that the new approach is somewhat less accurate
for the Gamma distribution. However, I believe that the accuracy is still
small if compared to the statistical error introduced by data sampling. It
would be interesting to measure the quality of the t-digest algorithm in
terms of the statistical error as proposed in Numericial Recipes (3rd
edition, p. 438):

..., there are statistical errors. One way to characterize these is to ask
what
value j has A_j closest to the quantile reported by IQ agent, and then how
small is |j-pN| as a fraction of sqrt(N p (1-p)). If this fraction is less
than 1, then the estimate is “good enough,” meaning that no method can do
substantially better at estimating the population quantiles given your
sample.

(A_1, A_2, ..., A_N are the sampled values and IQ agent is the quantile
estimator, t-digest in our case)


Reply to this email directly or view it on GitHub
#30.

@oertl
Copy link
Author

oertl commented Nov 3, 2014

Could you please add the mentioned figures to your comment?

@oertl
Copy link
Author

oertl commented Nov 3, 2014

The problem that I see is that if the error is proportional to the square root of the limit, then taking the square root of sqrt(q(1-q)) makes the error limit proportional to the fourth root of q(1-q). This makes the errors near the boundaries much larger rather than smaller.

I have the feeling that the error is rather proportional to the centroid size than to the square root of the centroid size. Let's say we have m centroids with sizes c_i where c_1+c_2+...+c_m = N.
To estimate the q-quantile we determine k so that (c_1+...+c_(k-1))/N <= q <= (c1+...+c_k)/N. Hence, the error that we make by choosing the neighboring centroid is proportional to c_k/N <= (4 N delta q (1-q))/N = (delta 4 q (1-q)). I know, linear interpolation between both neighboring centroids is used to improve quantile estimation, but I do not know if this substantially changes the error scaling law in terms of q. However, using (2 sqrt(q (1-q))) in place of (4 q (1-q)) would give (delta 2 sqrt(q (1-q))) which corresponds to the statistical error.

@tdunning
Copy link
Owner

tdunning commented Nov 3, 2014

Harumph... I was hoping that the figures would flow from email.

On Sun, Nov 2, 2014 at 11:15 PM, Otmar Ertl notifications@github.com
wrote:

Could you please add the mentioned figures to your comment?


Reply to this email directly or view it on GitHub
#30 (comment).

@tdunning
Copy link
Owner

tdunning commented Nov 3, 2014

Hmm.... I think I see where you are going.

There are two kinds of estimation possible. One kind is to estimate the empirical quantiles. The other is to estimate the quantiles of the underlying distribution. Estimating the underlying distribution is subject first to the estimation of the empirical distribution and then subject to the inescapable variation of the empirical distribution relative to the underlying distribution. It sounds like you are talking about matching the error in estimating the empirical quantile to the discrepancy between the empirical and underlying distribution.

So far, I have only been pushing for improving the error in estimating the empirical quantiles.

@oertl
Copy link
Author

oertl commented Nov 4, 2014

You are right, I am talking about estimating the underlying distribution. I think the t-digest algorithm would also be convenient for that purpose.

Concerning the error scaling figure you posted:

Looking at the curves, I guess that the "relative error" is calculated as max(sqrt(e)/q, sqrt(e)/(1-q)) if e denotes the absolute error of q? Or should it be max(e/q, e/(1-q))? In the latter case, which makes more sense to me, the green line would correspond to the original approach and the relative error would be finite for any q. However, if it is the goal to limit the relative error for all q, 2min(q, 1-q) would be the better centroid size limit yielding constant relative error for all q. I am still confused about the 4q(1-q) centroid size limit. Is there any mathematical foundation for that?

@tdunning
Copy link
Owner

tdunning commented Nov 4, 2014

Good point about relative error. My choice was based in two factors, both relatively weak:

  1. the value near the tails is what matters most. In this area max(q,1-q) \approx q (1-q).

  2. I always prefer smooth and continuous values. Your suggestion is continuous and I can see no reason that continuous second derivative could matter here.

At the very least, I think that you have made a very strong case that the limit should be pluggable across all implementations.

Sent from my iPhone

On Nov 4, 2014, at 12:10, Otmar Ertl notifications@github.com wrote:

However, if it is the goal to limit the relative error for all q, 2min(q, 1-q) would be the better centroid size limit yielding constant relative error for all q. I am still confused about the 4q(1-q) centroid size limit. Is there any mathematical foundation for that?

@oertl
Copy link
Author

oertl commented Nov 5, 2014

I agree, the t-digest algorithm should be somehow parameterized to allow arbitrary limits. I think if one is only interested in a single quantile value or a predefined set of quantile values, completely different limit functions could be more appropriate.

@oertl
Copy link
Author

oertl commented Jan 2, 2015

I closed this issue in favor of #37.

@oertl oertl closed this as completed Jan 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants