Improvement to t-digest stability under merging #78

erikerlandson · 2016-12-22T15:46:34Z

I noticed that merging t-digests appeared to cause some drift from the underlying distribution, at least using the method mentioned in the original paper where clusters are randomly shuffled and inserted. Inserting combined clusters from largest-to-smallest gave a big improvement in "convergence" of the merged clusters. I wrote up the results here:

http://erikerlandson.github.io/blog/2016/12/19/converging-monoid-addition-for-t-digest/

tdunning · 2016-12-22T16:55:14Z

Hmm...

The more modern implementations (MergingDigest) already use a sorted merge approach. Check out the faster-merge branch for implementation. Also the new-article branch for some discussions on how this all works.

tdunning · 2016-12-22T18:56:26Z

I did some checking on an R implementation of the merging algorithm and didn't see any of this behavior. Here, for instance, is evolution of KS over 100 merges:

On the left is a picture of the cluster sizes and on the right is KS over iterations.

tdunning · 2016-12-22T19:30:23Z

Code at https://gist.github.com/tdunning/26ce5f109cd0ea044caaf3ebfd21eb99

erikerlandson · 2016-12-22T23:32:32Z

Aha, that explains the code I was looking at. I'll work on including that algorithm in my comparisons

tdunning · 2016-12-23T05:09:52Z

Here is a comparison of a bunch of computations of ks statistic using a t-digest or the original data: [image: Inline image 1] This indicates to me that the t-digest is doing a good job of representing the distribution. This is for \delta = 100.

…

On Thu, Dec 22, 2016 at 3:32 PM, Erik Erlandson ***@***.***> wrote: Aha, that explains the code I was looking at. I'll work on including that algorithm in my comparisons — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#78 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAPSekezAshgDe_b0n23UgLhfEszbKERks5rKwiRgaJpZM4LUGkf> .

erikerlandson · 2016-12-24T00:42:22Z

I implemented the new merge-based combining algorithm for comparison. The code is here -- I believe I have it working correctly, but it hasn't been reviewed by anybody else.

My updated plots look like this (merge-based algorithm added in green):

Assuming my code is all working as expected, largest-to-smallest converges a bit better than merge-based, although either is clearly far better than the older random-insertion algorithm.

tdunning · 2016-12-24T02:56:06Z

Your code looks wrong here <https://github.com/erikerlandson/isarn-sketches-algebird-api/blob/blog/t_digest_sum_2/src/main/scala/org/isarnproject/sketchesAlgebirdAPI/AlgebirdFactory.scala#L108-L113> . The problem is that you are merging clusters with the same centroid. That isn't right. You should only merge them if they satisfy the merge criterion. My guess is that if that code is being triggered, then you are getting clusters that are larger than allowed. I don't see that this code should be triggered in your test case, however. Also, remember that this form of the algorithm with k2q(q+dq) - k2q(q) is fine for testing, but keep in mind that you can go about 10x faster with some limit caching and if you are merging many digests in one go. All that said, I think that I am seeing poor convergence like you are seeing. Looking at ks.test of sample data versus cdf's estimated from the t-digest is showing limited convergence like you show. Doing the biggest first merge is going to be a pain in the butt with static data structures. I will have to think on it.

…

On Fri, Dec 23, 2016 at 4:42 PM, Erik Erlandson ***@***.***> wrote: I implemented the new merge-based combining algorithm for comparison. The code is here <https://github.com/erikerlandson/isarn-sketches-algebird-api/blob/blog/t_digest_sum_2/src/main/scala/org/isarnproject/sketchesAlgebirdAPI/AlgebirdFactory.scala#L107> -- I believe I have it working correctly, but it hasn't been reviewed by anybody else. My updated plots look like this (merge-based algorithm added in green): [image: merge_compare] <https://cloud.githubusercontent.com/assets/259898/21464321/c1cbf850-c936-11e6-9a27-40b31403bd2f.png> Assuming my code is all working as expected, largest-to-smallest converges a bit better than merge-based, although either is clearly far better than the older random-insertion algorithm. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#78 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAPSetnzaOCRCV85dPcAETlFmQOGE4_Zks5rLGpugaJpZM4LUGkf> .

erikerlandson · 2016-12-24T15:32:55Z

I wonder how important it really is, especially if it's inconvenient to change. My original goal was to just find something that didn't keep getting worse, like the older random-insertion algorithm does. The merge-based algorithm clearly meets and exceeds that requirement already.

My logic for combining clusters with exactly the same centroid is a by-product of my choice of backing data structure. Since I'm using a map whose key is the centroid location, multiple clusters with the same centroid are inconvenient to work with, so I decided to just live with the possibility of violating the cluster size bound. I predicted it would be a rare event in most scenarios.

tdunning · 2016-12-24T18:56:38Z

If you look at the test cases for the Java version of t-digest, you will find a number of cases that deal with lots of repeated data values. These tests are in there because of how badly I have been burned by real data like that. For instance, take a payment amounts on a mixed transaction stream. These are often zero, often stereotypical numbers like 9.95 (plus tax). The result of such repeated values is that you get multiple centroids with the same mean. It is important, however, to be able to peel out one of these centroids at a time to express some subtlety to the distribution. A good test case for this has input that is 100,000 zero values followed by 10,000 exponentially distributed values. If you merge centroids, you get one massive centroid from the first part of the data and then singletons. Some of the singletons will combine with the massive centroid and shift it. The resulting KS difference can be very near 1.

…

On Sat, Dec 24, 2016 at 7:32 AM, Erik Erlandson ***@***.***> wrote: I wonder how important it really is, especially if it's inconvenient to change. My original goal was to just find something that didn't keep getting worse, like the older random-insertion algorithm does. The merge-based algorithm clearly meets and exceeds that requirement already. My logic for combining clusters with exactly the same centroid is a by-product of my choice of backing data structure. Since I'm using a map whose key is the centroid location, multiple clusters with the same centroid are inconvenient to work with, so I decided to just live with the possibility of violating the cluster size bound. I predicted it would be a rare event in most scenarios. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#78 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAPSeqdBdNvpgis_O-3P1Rw7j0mJ1VmFks5rLTsngaJpZM4LUGkf> .

tdunning · 2016-12-25T19:39:04Z

An easy, though slightly hacky, workaround would be to use the Go equivalent of java.lang.Math.nextAfter() method instead of merging the values. The resulting centroids would not be identical and thus would key differently in your data structure. My preference in such cases is actually to use an unstable sort so that you get any of the equal centroids. This avoids a problem where one of the "equal" centroids might shadow the others. In the original t-digest, I did a lot of work to scan all equally distant centroids to avoid this. In the merging digest, I plan to simply destabilize the merge with random numbers.

…

On Sat, Dec 24, 2016 at 10:56 AM, Ted Dunning ***@***.***> wrote: If you look at the test cases for the Java version of t-digest, you will find a number of cases that deal with lots of repeated data values. These tests are in there because of how badly I have been burned by real data like that. For instance, take a payment amounts on a mixed transaction stream. These are often zero, often stereotypical numbers like 9.95 (plus tax). The result of such repeated values is that you get multiple centroids with the same mean. It is important, however, to be able to peel out one of these centroids at a time to express some subtlety to the distribution. A good test case for this has input that is 100,000 zero values followed by 10,000 exponentially distributed values. If you merge centroids, you get one massive centroid from the first part of the data and then singletons. Some of the singletons will combine with the massive centroid and shift it. The resulting KS difference can be very near 1. On Sat, Dec 24, 2016 at 7:32 AM, Erik Erlandson ***@***.***> wrote: > I wonder how important it really is, especially if it's inconvenient to > change. My original goal was to just find something that didn't keep > getting worse, like the older random-insertion algorithm does. The > merge-based algorithm clearly meets and exceeds that requirement already. > > My logic for combining clusters with exactly the same centroid is a > by-product of my choice of backing data structure. Since I'm using a map > whose key is the centroid location, multiple clusters with the same > centroid are inconvenient to work with, so I decided to just live with the > possibility of violating the cluster size bound. I predicted it would be a > rare event in most scenarios. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#78 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAPSeqdBdNvpgis_O-3P1Rw7j0mJ1VmFks5rLTsngaJpZM4LUGkf> > . >

erikerlandson · 2016-12-27T19:44:06Z

nextAfter() is a cool widget! I like the idea of introducing jitter. Why not, it's just a sketch anyway.

I have a theory my logic could work OK with a big spike of zeros (or other numbers); such a spike would not merge with different cluster locations (and then shift) on the grounds of already being larger than the nominal limit. Clusters only get to grow past their limit by acquiring new mass at exactly the same location. But, that needs either proof or disproof via additional unit testing.

My original concept involved a sort of back-off sketching model where discrete values were stored in a map, unless and until some sufficient number of unique values was encountered, and then it would back-off to a t-digest. I suppose a hybrid model might be possible, where large spikes get stored exterior to the digest.

tdunning · 2017-01-12T01:01:32Z

On Tue, Dec 27, 2016 at 11:44 AM, Erik Erlandson ***@***.***> wrote: My original concept involved a sort of back-off sketching model where discrete values were stored in a map, unless and until some sufficient number of unique values was encountered, and then it would back-off to a t-digest. I suppose a hybrid model might be possible, where large spikes get stored exterior to the digest.

I can't remember if I mentioned this, but the merging versino of t-digest effectively does just this.

tdunning · 2017-04-20T07:13:22Z

Erik,

Thanks again for your excellent observation. I just wanted to mention that the MergingDigest now has an add function that takes a list of TDigests. This mega merge is handled very efficiently and accuracy should be better than monoid style accumulation.

erikerlandson · 2017-04-20T19:15:39Z

That sounds very cool! 👍

tdunning · 2017-05-31T08:18:55Z

Btw... I think that you may actually have been seeing the thin end of #89

AVLTreeDigest has a serious bug with repeated values.

tdunning · 2017-08-06T22:04:09Z

This seems better lately. This is from the data produced by TDigestTest.testKSDrift.

Note that the algorithms are different in that the AVLTreeDigest uses a size limit that results in the number of centroids to grow without bound roughly according to \delta * log(n) while MergingDigest uses a size limit that results in the number being strictly bounded by 2 \delta and practically speaking by about 1.3 \delta. In any case, the KS statistic is very small in both cases.

erikerlandson · 2017-08-07T13:59:36Z

Nice, it looks like that cut the D in half!

tdunning · 2017-08-07T16:59:11Z

Well, it isn't clear that this is all apples to apples. But, yes. This is encouraging.

The MergingDigest has different size limits that make things behavior very differently in terms of centroid count. When I goosed the compression limit to give similar centroid count MergingDigest and AVLTreeDigest have very similar D. The

tdunning · 2017-08-07T19:24:31Z

Hmm... last comment go cut off.

I was going to say that there were some not yet understood initial convergence behavior. Not significant as far as I can tell.

tdunning · 2021-04-02T06:09:36Z

I think that this is resolved. It wasn't a single issue, but resulted in significant improvements.

Thanks to all!

spenczar mentioned this issue Oct 5, 2017

merge into empty t-digest produces strange numbers spenczar/tdigest#5

Open

spenczar mentioned this issue Feb 7, 2018

Possible bug in quantiles? spenczar/tdigest#9

Open

tdunning closed this as completed Apr 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement to t-digest stability under merging #78

Improvement to t-digest stability under merging #78

erikerlandson commented Dec 22, 2016

tdunning commented Dec 22, 2016

tdunning commented Dec 22, 2016

tdunning commented Dec 22, 2016

erikerlandson commented Dec 22, 2016

tdunning commented Dec 23, 2016 via email

erikerlandson commented Dec 24, 2016

tdunning commented Dec 24, 2016 via email

erikerlandson commented Dec 24, 2016

tdunning commented Dec 24, 2016 via email

tdunning commented Dec 25, 2016 via email

erikerlandson commented Dec 27, 2016

tdunning commented Jan 12, 2017 via email

tdunning commented Apr 20, 2017

erikerlandson commented Apr 20, 2017

tdunning commented May 31, 2017

tdunning commented Aug 6, 2017

erikerlandson commented Aug 7, 2017

tdunning commented Aug 7, 2017

tdunning commented Aug 7, 2017

tdunning commented Apr 2, 2021

Improvement to t-digest stability under merging #78

Improvement to t-digest stability under merging #78

Comments

erikerlandson commented Dec 22, 2016

tdunning commented Dec 22, 2016

tdunning commented Dec 22, 2016

tdunning commented Dec 22, 2016

erikerlandson commented Dec 22, 2016

tdunning commented Dec 23, 2016 via email

erikerlandson commented Dec 24, 2016

tdunning commented Dec 24, 2016 via email

erikerlandson commented Dec 24, 2016

tdunning commented Dec 24, 2016 via email

tdunning commented Dec 25, 2016 via email

erikerlandson commented Dec 27, 2016

tdunning commented Jan 12, 2017 via email

tdunning commented Apr 20, 2017

erikerlandson commented Apr 20, 2017

tdunning commented May 31, 2017

tdunning commented Aug 6, 2017

erikerlandson commented Aug 7, 2017

tdunning commented Aug 7, 2017

tdunning commented Aug 7, 2017

tdunning commented Apr 2, 2021