Question: Proof of bounds on merging digest size #77

jcrist · 2016-12-12T22:06:53Z

I've been working on a python implementation of the merging digest (written mostly in c). Code is here. I'm running into bounds issues on the centroid array, following the theoretical size of ceil(compression * pi/2) (as seen here).

I modified the java implementation to test this bound, and get out-of-bounds errors when running the existing test suite. I also modified the java tests to output the input data, and ran my implementation on those inputs - finding the same results. As such, I trust my implementation matches yours - and since both have out-of-bounds errors on buffers of that size, I'm skeptical of that theoretical size.

From reading the paper, it appears you concluded that the bounds on the number of maximally merged centroids is 2 * compression (see here), which differs from what's stated in the code. I can kind of convince myself of this proof by iterating through a worst case scenario on paper, but am not math-y enough to write out a formal proof. Using a bound of 2 * compression seems to squash all the out-of-bounds errors though.

Question:

What is the actual theoretical bound?
Can you sketch out a proof of how you came by this number?

The text was updated successfully, but these errors were encountered:

tdunning · 2016-12-13T01:03:21Z

On Tue, Dec 13, 2016 at 7:06 AM, Jim Crist ***@***.***> wrote: - What is the actual theoretical bound? - Can you sketch out a proof of how you came by this number?

The actual theoretic bound is 2x what is in the code comments. Yes, I can sketch the proof. The proof is by contradiction. In the algorithm, we convert q values to k values using k = compression * (asin(2 * q - 1) + \pi / 2) / \pi For q = 1, k_max = compression. This is the largest possible value k can have and the number of centroids is less than or equal to ceiling(2 * compression). We only consider cases where compression >= 1 which implies that we have 2 or more centroids. Take n as the number of centroids. The sum of the k-sizes for these centroids must be k_max. If n is even, then suppose that n > 2 ceiling(k_max). Further, to avoid collapse, we know that all pairs of centroid weights must be greater or equal to 1. But we can form k_max consecutive pairs which must add up to more than k_max. This implies that the weights on the items after the k_max pairs must be negative which is impossible. Thus, n > 2 ceiling(k_max) is impossible for even n. If n is odd, then we will have n > 2 ceiling(k_max) + 1. Again, we can form at least k_max pairs, each with k_size greater or equal to 1. There will be at least 3 centroids after these pairs. But this implies that the k-size of the remaining centroids will be less than or equal to one. But the first two of these must have k-size together of at least one so the remaining centroids must have negative k-size. Thus, n > 2 ceiling(k_max) + 1 > 2 ceiling(k_max) is impossible for n odd.

jcrist · 2016-12-13T06:47:37Z

Thanks, that was a very clear proof. I really appreciate you taking the time to sketch that out.

This comment should probably be updated to reflect this - there are a few implementations that are using the incorrect ceil(compression * pi / 2) that it states (see here and here).

tdunning · 2016-12-13T19:01:19Z

Thanks for the references. I have posted issues with the proof.

…

On Mon, Dec 12, 2016 at 10:47 PM, Jim Crist ***@***.***> wrote: Thanks, that was a very clear proof. I really appreciate you taking the time to sketch that out. This comment <https://github.com/tdunning/t-digest/blob/master/src/main/java/com/tdunning/math/stats/MergingDigest.java#L135> should probably be updated to reflect this - there are a few implementations that are using the incorrect ceil(compression * pi / 2) that it states (see here <https://github.com/stripe/veneur/blob/master/tdigest/merging_digest.go#L75> and here <https://github.com/vega/datalib-sketch/blob/master/src/t-digest.js#L20>). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#77 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAPSev4C9AamIkxEvZ-DbrZSNXvQqXI0ks5rHj-KgaJpZM4LLFWX> .

jcrist closed this as completed Dec 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Proof of bounds on merging digest size #77

Question: Proof of bounds on merging digest size #77

jcrist commented Dec 12, 2016

tdunning commented Dec 13, 2016 via email

jcrist commented Dec 13, 2016

tdunning commented Dec 13, 2016 via email

Question: Proof of bounds on merging digest size #77

Question: Proof of bounds on merging digest size #77

Comments

jcrist commented Dec 12, 2016

tdunning commented Dec 13, 2016 via email

jcrist commented Dec 13, 2016

tdunning commented Dec 13, 2016 via email