Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement to t-digest stability under merging #78

Closed
erikerlandson opened this issue Dec 22, 2016 · 20 comments
Closed

Improvement to t-digest stability under merging #78

erikerlandson opened this issue Dec 22, 2016 · 20 comments

Comments

@erikerlandson
Copy link

I noticed that merging t-digests appeared to cause some drift from the underlying distribution, at least using the method mentioned in the original paper where clusters are randomly shuffled and inserted. Inserting combined clusters from largest-to-smallest gave a big improvement in "convergence" of the merged clusters. I wrote up the results here:

http://erikerlandson.github.io/blog/2016/12/19/converging-monoid-addition-for-t-digest/

@tdunning
Copy link
Owner

Hmm...

The more modern implementations (MergingDigest) already use a sorted merge approach. Check out the faster-merge branch for implementation. Also the new-article branch for some discussions on how this all works.

@tdunning
Copy link
Owner

I did some checking on an R implementation of the merging algorithm and didn't see any of this behavior. Here, for instance, is evolution of KS over 100 merges:

image

On the left is a picture of the cluster sizes and on the right is KS over iterations.

@tdunning
Copy link
Owner

@erikerlandson
Copy link
Author

Aha, that explains the code I was looking at. I'll work on including that algorithm in my comparisons

@tdunning
Copy link
Owner

tdunning commented Dec 23, 2016 via email

@erikerlandson
Copy link
Author

I implemented the new merge-based combining algorithm for comparison. The code is here -- I believe I have it working correctly, but it hasn't been reviewed by anybody else.

My updated plots look like this (merge-based algorithm added in green):

merge_compare

Assuming my code is all working as expected, largest-to-smallest converges a bit better than merge-based, although either is clearly far better than the older random-insertion algorithm.

@tdunning
Copy link
Owner

tdunning commented Dec 24, 2016 via email

@erikerlandson
Copy link
Author

I wonder how important it really is, especially if it's inconvenient to change. My original goal was to just find something that didn't keep getting worse, like the older random-insertion algorithm does. The merge-based algorithm clearly meets and exceeds that requirement already.

My logic for combining clusters with exactly the same centroid is a by-product of my choice of backing data structure. Since I'm using a map whose key is the centroid location, multiple clusters with the same centroid are inconvenient to work with, so I decided to just live with the possibility of violating the cluster size bound. I predicted it would be a rare event in most scenarios.

@tdunning
Copy link
Owner

tdunning commented Dec 24, 2016 via email

@tdunning
Copy link
Owner

tdunning commented Dec 25, 2016 via email

@erikerlandson
Copy link
Author

nextAfter() is a cool widget! I like the idea of introducing jitter. Why not, it's just a sketch anyway.

I have a theory my logic could work OK with a big spike of zeros (or other numbers); such a spike would not merge with different cluster locations (and then shift) on the grounds of already being larger than the nominal limit. Clusters only get to grow past their limit by acquiring new mass at exactly the same location. But, that needs either proof or disproof via additional unit testing.

My original concept involved a sort of back-off sketching model where discrete values were stored in a map, unless and until some sufficient number of unique values was encountered, and then it would back-off to a t-digest. I suppose a hybrid model might be possible, where large spikes get stored exterior to the digest.

@tdunning
Copy link
Owner

tdunning commented Jan 12, 2017 via email

@tdunning
Copy link
Owner

Erik,

Thanks again for your excellent observation. I just wanted to mention that the MergingDigest now has an add function that takes a list of TDigests. This mega merge is handled very efficiently and accuracy should be better than monoid style accumulation.

@erikerlandson
Copy link
Author

That sounds very cool! 👍

@tdunning
Copy link
Owner

Btw... I think that you may actually have been seeing the thin end of #89

AVLTreeDigest has a serious bug with repeated values.

@tdunning
Copy link
Owner

tdunning commented Aug 6, 2017

This seems better lately. This is from the data produced by TDigestTest.testKSDrift.

image

Note that the algorithms are different in that the AVLTreeDigest uses a size limit that results in the number of centroids to grow without bound roughly according to \delta * log(n) while MergingDigest uses a size limit that results in the number being strictly bounded by 2 \delta and practically speaking by about 1.3 \delta. In any case, the KS statistic is very small in both cases.

@erikerlandson
Copy link
Author

Nice, it looks like that cut the D in half!

@tdunning
Copy link
Owner

tdunning commented Aug 7, 2017

Well, it isn't clear that this is all apples to apples. But, yes. This is encouraging.

The MergingDigest has different size limits that make things behavior very differently in terms of centroid count. When I goosed the compression limit to give similar centroid count MergingDigest and AVLTreeDigest have very similar D. The

@tdunning
Copy link
Owner

tdunning commented Aug 7, 2017

Hmm... last comment go cut off.

I was going to say that there were some not yet understood initial convergence behavior. Not significant as far as I can tell.

@tdunning
Copy link
Owner

tdunning commented Apr 2, 2021

I think that this is resolved. It wasn't a single issue, but resulted in significant improvements.

Thanks to all!

@tdunning tdunning closed this as completed Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants