-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Proof of bounds on merging digest size #77
Comments
On Tue, Dec 13, 2016 at 7:06 AM, Jim Crist ***@***.***> wrote:
- What is the actual theoretical bound?
- Can you sketch out a proof of how you came by this number?
The actual theoretic bound is 2x what is in the code comments.
Yes, I can sketch the proof. The proof is by contradiction.
In the algorithm, we convert q values to k values using
k = compression * (asin(2 * q - 1) + \pi / 2) / \pi
For q = 1, k_max = compression. This is the largest possible value k can
have and the number of centroids is less than or equal to ceiling(2 *
compression). We only consider cases where compression >= 1 which implies
that we have 2 or more centroids.
Take n as the number of centroids. The sum of the k-sizes for these
centroids must be k_max.
If n is even, then suppose that n > 2 ceiling(k_max). Further, to avoid
collapse, we know that all pairs of centroid weights must be greater or
equal to 1. But we can form k_max consecutive pairs which must add up to
more than k_max. This implies that the weights on the items after the k_max
pairs must be negative which is impossible. Thus, n > 2 ceiling(k_max) is
impossible for even n.
If n is odd, then we will have n > 2 ceiling(k_max) + 1. Again, we can form
at least k_max pairs, each with k_size greater or equal to 1. There will be
at least 3 centroids after these pairs. But this implies that the k-size of
the remaining centroids will be less than or equal to one. But the first
two of these must have k-size together of at least one so the remaining
centroids must have negative k-size. Thus, n > 2 ceiling(k_max) + 1 > 2
ceiling(k_max) is impossible for n odd.
|
Thanks, that was a very clear proof. I really appreciate you taking the time to sketch that out. This comment should probably be updated to reflect this - there are a few implementations that are using the incorrect |
Thanks for the references. I have posted issues with the proof.
…On Mon, Dec 12, 2016 at 10:47 PM, Jim Crist ***@***.***> wrote:
Thanks, that was a very clear proof. I really appreciate you taking the
time to sketch that out.
This comment
<https://github.com/tdunning/t-digest/blob/master/src/main/java/com/tdunning/math/stats/MergingDigest.java#L135>
should probably be updated to reflect this - there are a few
implementations that are using the incorrect ceil(compression * pi / 2)
that it states (see here
<https://github.com/stripe/veneur/blob/master/tdigest/merging_digest.go#L75>
and here
<https://github.com/vega/datalib-sketch/blob/master/src/t-digest.js#L20>).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#77 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPSev4C9AamIkxEvZ-DbrZSNXvQqXI0ks5rHj-KgaJpZM4LLFWX>
.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I've been working on a python implementation of the merging digest (written mostly in c). Code is here. I'm running into bounds issues on the centroid array, following the theoretical size of
ceil(compression * pi/2)
(as seen here).I modified the java implementation to test this bound, and get out-of-bounds errors when running the existing test suite. I also modified the java tests to output the input data, and ran my implementation on those inputs - finding the same results. As such, I trust my implementation matches yours - and since both have out-of-bounds errors on buffers of that size, I'm skeptical of that theoretical size.
From reading the paper, it appears you concluded that the bounds on the number of maximally merged centroids is
2 * compression
(see here), which differs from what's stated in the code. I can kind of convince myself of this proof by iterating through a worst case scenario on paper, but am not math-y enough to write out a formal proof. Using a bound of2 * compression
seems to squash all the out-of-bounds errors though.Question:
The text was updated successfully, but these errors were encountered: