# An implementation of t-digest numeric distribution sketching #495

Closed
wants to merge 16 commits into
from

## Conversation

Projects
None yet
6 participants
Contributor

### erikerlandson commented Oct 18, 2015

 As described in the following paper: Computing Extremely Accurate Quantiles Using t-Digests Ted Dunning and Otmar Ertl https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf
Contributor

### erikerlandson commented Oct 18, 2015

 For reference, my blog post that describes my reasoning behind binary tree algorithms backing the t-digest: http://erikerlandson.github.io/blog/2015/09/26/a-library-of-binary-tree-algorithms-as-mixable-scala-traits/
Contributor

### erikerlandson commented Oct 19, 2015

 CI is showing a failure, but the log output makes it seem like tests succeeded. I'm not replicating any failures on my local machine.

### johnynek reviewed Oct 19, 2015

 + * @note This operation, combined with the empty digest, satisfy a Monoid law, with the caveat + * that it involves random shuffling of clusters, and so the result is not deterministic. + */ + def ++(that: TDigest) = TDigestMonoid.plus(this, that, this.delta)

#### johnynek Oct 19, 2015

Collaborator

return type on public methods, please.

### johnynek reviewed Oct 19, 2015

 + * @param x a numeric value + * @return the cumulative probability that a random sample from the distribution is <= x + */ + def cdf[N](x: N)(implicit num: Numeric[N]) = clusters.cdf(x)

#### johnynek Oct 19, 2015

Collaborator

return type on public methods please.

### johnynek reviewed Oct 19, 2015

 + * @param q a quantile value. The value of q is expected to be on interval [0, 1] + * @return the value x such that cdf(x) = q + */ + def cdfInverse[N](q: N)(implicit num: Numeric[N]) = clusters.cdfInverse(q)

Collaborator

### johnynek reviewed Oct 19, 2015

 + * (it may grow larger if data are not presented randomly). The default corresponds to + * an expected number of clusters of about 100. + */ + val deltaDefault = (50.0 / 100.0) // delta * E[clusters] ~ 50

#### johnynek Oct 19, 2015

Collaborator

type on public val, please.

### johnynek reviewed Oct 19, 2015

 + * clusters will only trigger the corresponding re-clustering threshold when data are being + * presented in a non-random order. + */ + val K = 10.0 * 50.0

#### johnynek Oct 19, 2015

Collaborator

type on public val, please.

### johnynek reviewed Oct 19, 2015

 +} + +/** The Monoid type class for TDigest objects */ +class TDigestMonoid(delta: Double = TDigest.deltaDefault) extends Monoid[TDigest] {

#### johnynek Oct 19, 2015

Collaborator

I don't like that both TDigest and the monoid know about delta. Can we make it so only one needs to?

#### erikerlandson Oct 19, 2015

Contributor

I think you're right, the monoid has no need to know. That does imply the need for a policy on what it means when adding two t-digests with differing delta values. My proposal would be just adopting the left-hand-side delta, which is simple and also allows a user to choose which delta to use via picking which argument appears on the left. I don't like them as much, but other plausible options would be:

• minimum (i.e. pick lowest sketch resolution)
• maximum (highest resolution)
• average
• throw exception (only allow adding digests with identical delta values)
Contributor

### erikerlandson commented Oct 19, 2015

 Comment on t-digest monoids: I'm pretty sure I can address the issue of non-deterministic behavior of monoid `plus`, which may be worth doing, but I think there is a deeper failure of strict associativity, since `(td1 ++ td2) ++ td3` is unlikely to yield the same final set of clusters as `td1 ++ (td2 ++ td3)`, even if the re-clustering presentation order becomes repeatable. I'd contend that this is OK (the behavior is "statistically monoidal", to whatever extent that's a thing), but worth calling out, since I'm declaring that it has a monoid type class.

### johnynek reviewed Oct 19, 2015

 + +/** Factory functions for TDigest */ +object TDigest { + import scala.language.reflectiveCalls

#### johnynek Oct 19, 2015

Collaborator

can we remove this?

### johnynek reviewed Oct 19, 2015

 +/** Factory functions for TDigest */ +object TDigest { + import scala.language.reflectiveCalls + import com.twitter.algebird.Monoid

#### johnynek Oct 19, 2015

Collaborator

this is not needed now, right? This is in the space right?

### johnynek reviewed Oct 19, 2015

 + * val map2 = IncrementMap.key[String].value(mon) + * }}} + */ + def key[K](implicit ord: Ordering[K]) = new AnyRef {

#### johnynek Oct 19, 2015

Collaborator

you are using several types here, why not use a named type here? `UnsetValueIncrementMap` or something.

Collaborator

### avibryant commented Oct 19, 2015

 @erikerlandson agreed on "statistically monoidal" - it seems fine, and is consistent with `QTree` and with the top-k stuff in `CMS`. We should come up with some formal expression of this but it's definitely something we're comfortable with.
Contributor

### erikerlandson commented Oct 20, 2015

 @johnynek looks like some unrelated HLL unit test failed, but I can't re-run it

### johnynek reviewed Oct 23, 2015

 + if (s.nclusters <= R) s + else { + // too many clusters: attempt to compress it by re-clustering + val ds = scala.util.Random.shuffle(s.clusters.toVector)

#### johnynek Oct 23, 2015

Collaborator

this is non-deterministic, but I wonder if we track a seed which we update on merges if we can make it deterministic and yet keep the property that in expectation it is randomized.

#### erikerlandson Oct 23, 2015

Contributor

Yes, definitely. I was thinking of just using a Random object seeded with some variation of `hashCode` on the data, something along the lines of:

```def deterministicShuffle[T](data: Seq[T]) =
if (data.isEmpty) data else (new Random(data.head.hashCode)).shuffle(data)```

### johnynek reviewed Oct 23, 2015

 + delta: Double = deltaDefault)(implicit num: Numeric[N]): TDigest = { + require(delta > 0.0, s"delta was not > 0") + val td = data.foldLeft(empty(delta))((c, e) => c + ((e, 1))) + val ds = scala.util.Random.shuffle(td.clusters.toVector)

#### johnynek Oct 23, 2015

Collaborator

here is another randomization. Is this needed? Why not have the user shuffle if they want that behavior and just do the fold here?

#### erikerlandson Oct 23, 2015

Contributor

Input data is not pre-shuffled. What you see there is a final shuffle which is for a final clustering compression.

### johnynek reviewed Oct 23, 2015

 +object tree { + /** The color (red or black) of a node in a Red/Black tree */ + sealed trait Color extends Serializable + case object R extends Color

#### johnynek Oct 23, 2015

Collaborator

can we call these Red and Black?

### johnynek reviewed Oct 23, 2015

 + /** Defines the data payload of a tree node */ + trait Data[K] extends Serializable { + /** The axiomatic unit of data for R/B trees is a key */ + val key: K

#### johnynek Oct 23, 2015

Collaborator

`def key: K` is what we normally use here. You can override def with a val, but an abstract val has some bad properties (like opening up the val-init-order bug as a possibility).

### johnynek reviewed Oct 23, 2015

 + case object B extends Color + + /** Defines the data payload of a tree node */ + trait Data[K] extends Serializable {

#### johnynek Oct 23, 2015

Collaborator

this can be covariant: `Data[+K]`, which may be useful.

#### erikerlandson Oct 23, 2015

Contributor

Covariance involves some open questions for me. For example, if we want to inherit from SortedMap[K, +V], then I have to support a type-widening insert:

`def +[V2 >: V](kv: (K, V2)): Map[K, V2]`

Which I think I can do, but it isn't painless either.

### johnynek reviewed Oct 23, 2015

 + + override def hashCode = key.hashCode + override def equals(that: Any) = that match { + case data: Data[K] => this.key.equals(data.key)

#### johnynek Oct 23, 2015

Collaborator

I think this gives a warning because the `K` can't be checked. I think you need: `case data: Data[_] => key.equals(data.key)`

### johnynek reviewed Oct 23, 2015

 + trait Node[K] extends Serializable { + + /** The ordering that is applied to key values */ + val keyOrdering: Ordering[K]

#### johnynek Oct 23, 2015

Collaborator

I'd prefer not to have abstract vals.

### johnynek reviewed Oct 23, 2015

 + val lsub: Node[K] + /** The right sub-tree */ + val rsub: Node[K] +

#### johnynek Oct 23, 2015

Collaborator

same abstract val concern.

#### erikerlandson Oct 23, 2015

Contributor

I think updating these to be `def` instead of `val` will be relatively straightforward

### johnynek reviewed Oct 23, 2015

 + } + } + + private object RNode {

#### johnynek Oct 23, 2015

Collaborator

can we call these RedNode/BlackNode?

#### johnynek Oct 23, 2015

Collaborator

Actually, why not make:

```case class RedNode[K](data: Data[K], lsub: INode[K], rsub: INode[K]) extends INode[K]
case class ...```

or even:

`case class INode[+C <: Color, K](color: C, lsub: INode[Color, K], rsub: INode[Color, K])`

This would be more idiomatic.

#### erikerlandson Oct 23, 2015

Contributor

Because they get inherited from, and scala doesn't seem to like inheriting from case classes.

Collaborator

### johnynek commented Oct 23, 2015

 This is exciting, and I really appreciate the work on this. I want to get this merged, but also I hope you agree with us that careful code review is important. One thing I'm wondering: could we break this into two PRs? One to add the Red/black trees with tests, and the second to leverage the red/black trees to implement T-digest (that builds on the first clearly)? What I'm not seeing yet in the review is why scala.collection.immutable.SortedMap: http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.SortedMap won't work here? Could you comment in the code? Sorry if I missed it. Also, Red/black trees are O(log_2 N), but the hash-trie based standard maps are O(log_32 N). Can you comment what features we need of the red/black that the faster hash-trie can't do for us?
Contributor

### erikerlandson commented Oct 23, 2015

 @johnynek, I agree, it's a big PR with a lot going on, and I think that reviews may as well be rigorous otherwise they aren't as useful
Contributor

### erikerlandson commented Oct 23, 2015

 @johnynek regarding inheriting from `SortedMap`, I am pretty sure I can update the tree classes to satisfy the `SortedMap` interface trait, although it involves implementing several methods that I didn't immediately have great answers for, so I punted a bit. I had figured that might be addressed in a future PR. However, if you prefer I can try and take that bull by the horns now. Regarding the particular use of binary trees, one nontrivial reason for that choice was that binary trees support the log-time maintenance and queries for prefix sums, and my "cover" constructs, and a couple other basic algorithms. So there is more going on than just mapping objects. Although that might be possible in a non-binary-tree data structure, I think the logic would be a lot harder at best, and maybe not possible at all. Using red-black trees was just a way to guarantee balanced trees with a well-understood data structure that I had a hope of getting to work along with all the other functionality I was layering on top :)
Contributor

### erikerlandson commented Oct 23, 2015

 Maybe an even better way to explain it is that a lot of the logic I need to support requires that clusters are maintained in numeric / location order. So any hash-based container will not get me what I need. A tree data structure maintains the ordering by (numeric) key that I need, along with the desirable log-time operations.
Contributor

### erikerlandson commented Oct 23, 2015

 @johnynek regarding separate Red/Black node classes, in an earlier iteration I was doing that, but the code wasn't as dry. To make a long story short, it was resulting in two parallel copies of internal node logic (which is where most of the logic is). So I felt like designing around a single internal node class, with color as one field, was the cleaner solution.
Contributor

### erikerlandson commented Oct 23, 2015

 @johnynek splitting into a "tree PR" and a "t-digest PR" is OK with me, would you like me to pull the trigger on that?
Collaborator

Closed

Contributor

### erikerlandson commented Oct 24, 2015

 The supporting tree/map library has been factored out to #496, and I will now be keeping this branch rebased off of topic branch `feature/treemaps`

Merged

### erikerlandson added some commits Oct 24, 2015

``` A Library of Binary Tree Algorithms as Mixable Scala Traits ```
``` 319588e ```
``` Replace abstract 'val' with 'def' in trait definitions ```
``` 2e67187 ```
``` inherit from Scala's MapLike and SetLike traits ```
``` e0312ca ```
``` change implementation of INodeIterator so it will support iteratorFro… ```
`…m methods`
``` 96f572c ```
``` iteratorFrom() family from SortedMapLike and SortedSetLike ```
``` 986f08b ```
``` rangeImpl ```
``` 642dcdc ```
``` inherit from SortedSetLike and SortedMapLike ```
``` 2e083e6 ```

### erikerlandson added some commits Oct 30, 2015

``` re-centralize definition of '+' insertion to clean up some problems w… ```
`…ith specific classes that inherit from the hierarchy which interact badly with type-widening`
``` 927898d ```
``` replace 'object tree' with 'package tree' (similar for 'infra') ```
``` d4cc85a ```
``` add explicit return types ```
``` 810de02 ```
``` Replace IncrementingMonoid with MonoidAggregator ```
``` 4c9b73d ```
``` annotate the purpose of the Inject classes ```
``` 62505f4 ```
``` An implementation of t-digest numeric distribution sketching ```
``` 96137ef ```
``` sync TDigestMap with updates to tree/map hierarchy requirements ```
``` 49ea6cf ```
``` make t-digest shuffling operations referentially transparent ```
``` e78810c ```
``` update to use MonoidAggregator ```
``` 503db35 ```
Contributor

### erikerlandson commented Feb 9, 2016

 @johnynek @ianoc @avibryant is there still interest in #495 and #496 ?

### kainoa21 commented Aug 18, 2016

 Would very much like to see this PR merged, I am currently using the reference implementation but would benefit from an Algebird based implementation.
Contributor

### erikerlandson commented Aug 22, 2016

 @kainoa21 coincidentally I started re-visiting this last weekend. There was a request to un-factor the various tree functions to reduce the code bulk on the back end, which is still on my to-do list.

### jnievelt reviewed Aug 22, 2016

 + // if we have already distributed all the mass, remaining clusters unchanged + cmNew = cmNew :+ ((clust.centroid, clust.mass)) + } else if (xn == clust.centroid) { + // if xn lies exactly on the centroid, add all mass in regardless of bound

#### jnievelt Aug 22, 2016

Contributor

What's the justification for this? It doesn't seem to come from the paper.

In fact, there is explicit acknowledgment that two clusters could have the same centroid:

For centroids with identical means, order of creation is used as a tie-breaker to allow an unambiguous ordering

#### erikerlandson Aug 22, 2016

Contributor

Allows centroid to serve as unique key, which simplifies things. Multiple clusters with the same centroid adds nothing useful to the model.

Collaborator

### johnynek commented Aug 22, 2016

 Sorry, we did not communicate clearly on this. This is indeed an exciting algorithm, however, if we are going to become the maintainers of it, here it needs to be significantly smaller in size. An implementation that could reuse existing scala collection classes may be small enough to accept. Honestly, the best approach is probably for @erikerlandson to provide a low dependency subproject that hosts the t-digest, and then we could have `algebird-tdigest` sub-project which adds the relevant algebird integration (monoids, aggregators mostly, I think).
Collaborator

### johnynek commented Aug 22, 2016

 I'd like to add, these cases are a challenge. We want to welcome contributions, but then inclusion can seem like an endorsement which has burned us a few times in the past when implementations bit-rot or are poorly performing (not saying that will happen here, I just mean that we need to really have the bandwidth to understand and maintain the code to add it).

### willb commented Aug 23, 2016 • edited Edited 1 time willb edited Aug 23, 2016 (most recent)

 @johnynek What about an `algebird-contrib` repository (or organization) for cases just like this, where developers could host community-maintained code that was designed to work with Algebird but without an Algebird imprimatur?
Contributor

### erikerlandson commented Dec 7, 2016

 This is now available as a package here: https://github.com/isarn/isarn-sketches So I'm going to close this out

### erikerlandson closed this Dec 7, 2016

to join this conversation on GitHub. Already have an account? Sign in to comment