A Library of Binary Tree Algorithms as Mixable Scala Traits #496

Closed
wants to merge 12 commits into
from

Projects

None yet

3 participants

@erikerlandson
Contributor

This is a PR for tree and map traits supporting some binary tree algorithms, factored out of the PR for implementing t-digest: #495

A Library of Binary Tree Algorithms as Mixable Scala Traits
http://erikerlandson.github.io/blog/2015/09/26/a-library-of-binary-tree-algorithms-as-mixable-scala-traits/

@erikerlandson
Contributor

@johnynek I have tree/map libs in a state that is ready for some more review. They now inherit from SortedMapLike and SortedSetLike. (#495 is not currently rebased onto these latest changes). Unit tests are passing except for a spurious randomized test failure in HLL.

@erikerlandson
Contributor

Actually I'm going to tweak this a little, it causes some subtle problems when I try to apply it to TDigestMap

@erikerlandson
Contributor

@johnynek OK I have #495 and #496 synced up.

@ianoc ianoc commented on the diff Nov 12, 2015
.travis.yml
@@ -1,9 +1,8 @@
language: scala
sudo: false
-matrix:
@ianoc
ianoc Nov 12, 2015 Collaborator

why is this change in here? other than the compile step being before the test step whats the gain/reasoning?

@erikerlandson
erikerlandson Nov 12, 2015 Contributor

The test was timing out. Factoring out the compilation helps keep the actual 'sbt test' command from running too long. This used to be a separate commit with its own message, but it was all squashed when I refactored the PR

@ianoc ianoc and 1 other commented on an outdated diff Nov 12, 2015
.../main/scala/com/twitter/algebird/maps/increment.scala
+limitations under the License.
+*/
+
+package com.twitter.algebird.maps.increment
+
+import math.Ordering
+
+import scala.collection.SortedMap
+
+import com.twitter.algebird.Monoid
+
+import com.twitter.algebird.maps.redblack.tree._
+import com.twitter.algebird.maps.ordered._
+import com.twitter.algebird.maps.ordered.tree.DataMap
+
+object tree {
@ianoc
ianoc Nov 12, 2015 Collaborator

object names generally are all capitalized

@erikerlandson
erikerlandson Nov 12, 2015 Contributor

In this case the tree object is functioning more like a package, and so it seemed apropos to leave it lower case

@ianoc
ianoc Nov 12, 2015 Collaborator

why not make it a package then? this defining things in lower case objects then importing them immediately seems unusual.

@erikerlandson
Contributor

@ianoc I replaced object with package

@johnynek johnynek commented on the diff Nov 18, 2015
.../main/scala/com/twitter/algebird/maps/increment.scala
+ bNode(d, lsub, rsub)
+ }
+ }
+ }
+}
+
+import tree._
+
+package infra {
+ class Inject[K, V](val keyOrdering: Ordering[K], val valueMonoid: Monoid[V])
+ extends Serializable {
+ def iNode(clr: Color, dat: Data[K], ls: Node[K], rs: Node[K]) =
+ new Inject[K, V](keyOrdering, valueMonoid) with INodeInc[K, V] with IncrementMap[K, V] {
+ // INode[K]
+ val color = clr
+ val lsub = ls.asInstanceOf[NodeInc[K, V]]
@johnynek
johnynek Nov 18, 2015 Collaborator

why the casts here? Why not accept NodeInc[K, V] as the type for ls? same for the rest.

@johnynek
johnynek Nov 18, 2015 Collaborator

Can you put a return type on iNode. What are we using here?

@erikerlandson
erikerlandson Nov 19, 2015 Contributor

iNode is an abstract function defined in ...redblack.tree.Node (the very top of the hierarchy). Its return type is INode (the highest level internal tree node type). At that level, it has no knowledge of any sub classes, and so ls and rs have to be Node, and therefore they need to be cast. I also would have preferred not to require casting, but could not discover any way around it. I decided it was at least (a) localized and (b) only relevant to anybody designing a new tree/map subclass.

@johnynek johnynek and 1 other commented on an outdated diff Nov 18, 2015
.../main/scala/com/twitter/algebird/maps/increment.scala
+ * @tparam V The value type
+ * @tparam IN The node type of the concrete internal R/B tree subclass
+ * @tparam M The map self-type of the concrete map subclass
+ */
+trait IncrementMapLike[K, V, IN <: INodeInc[K, V], M <: IncrementMapLike[K, V, IN, M] with SortedMap[K, V]]
+ extends NodeInc[K, V] with OrderedMapLike[K, V, IN, M] {
+
+ /**
+ * Add (w.r.t. valueMonoid) a given value to the value currently stored at key.
+ * @note If key is not present, equivalent to insert(k, valueMonoid.plus(valueMonoid.zero, iv)
+ */
+ def increment(k: K, iv: V) = this.incr(
+ new DataMap[K, V] {
+ val key = k
+ val value = iv
+ }).asInstanceOf[M]
@johnynek
johnynek Nov 18, 2015 Collaborator

how does this cast succeed? Why is it needed?

It seems that DataMap[K, V] <: DataMap[K] <: AnyRef.

I'd really rather not have any casts.

@erikerlandson
erikerlandson Nov 19, 2015 Contributor

It is being used so that the xxxMap classes can return their own type (encoded by self type M). As far as I know, the only way to avoid it would be to extend the self types up into the tree classes. I made a few attempts at that, and could not get any of them to work, and so in the end I decided that casting in a few places was acceptable as a way to get it all working.

The cast succeeds because the class-specific implementation of iNode generates objects of the correct type.

Generally speaking, I share your aesthetic of wishing to avoid asInstanceOf, I just wasn't able to figure out how to avoid it in the places I did use it. The places I do use it, the type casts will succeed.

That said, if I or anybody else can demonstrate a version of the code that doesn't need them, that would be very nice.

@johnynek johnynek and 1 other commented on an outdated diff Nov 18, 2015
.../main/scala/com/twitter/algebird/maps/increment.scala
+ /**
+ * Instantiate a new empty IncrementMap from key and value types
+ * {{{
+ * import com.twitter.algebird.maps.increment._
+ *
+ * // map strings to integers, using default string ordering and default value monoid
+ * val map1 = IncrementMap.key[String].value[Int]
+ * // Use a custom ordering
+ * val ord: Ordering[String] = ...
+ * val map2 = IncrementMap.key(ord).value[Int]
+ * // A custom value monoid, defines what 'increment' means
+ * val mon: Monoid[Int] = ...
+ * val map2 = IncrementMap.key[String].value(mon)
+ * }}}
+ */
+ def key[K](implicit ord: Ordering[K]) = infra.GetValue(ord)
@johnynek
johnynek Nov 18, 2015 Collaborator

please put a return type on all public and protected methods.

@erikerlandson
erikerlandson Nov 19, 2015 Contributor

Sure, that will be no problem.

@johnynek johnynek and 1 other commented on an outdated diff Nov 18, 2015
.../main/scala/com/twitter/algebird/maps/prefixsum.scala
+ }
+
+ /** Mediating class between value method and prefix method */
+ case class GetPrefix[K, V](ord: Ordering[K]) {
+ def prefix[P](implicit mon: IncrementingMonoid[P, V]): PrefixSumMap[K, V, P] =
+ new Inject[K, V, P](ord, mon) with LNodePS[K, V, P] with PrefixSumMap[K, V, P]
+ }
+ }
+}
+
+/**
+ * A monoid that also supports an 'increment' operation. This class is intended to encode
+ * semantics similar to Scala's seq.aggregate(z)(seqop,combop), where the standard
+ * Monoid 'plus' corresponds to 'combop' and the 'inc' method corresponds to 'seqop'.
+ */
+trait IncrementingMonoid[T, E] extends Monoid[T] {
@johnynek
johnynek Nov 18, 2015 Collaborator

why is this needed? What are the laws? Is it any different than:

case class PreparedMonoid[E, T](prepare: E => T, monoid: Monoid[T])

why define a new trait?

@erikerlandson
erikerlandson Nov 19, 2015 Contributor

PreparedMonoid might be equivalent. I couldn't find anything except Monoid when I looked. If you already have something like it, that would be even better.

@erikerlandson
erikerlandson Nov 19, 2015 Contributor

I don't actually see any definition of PreparedMonoid, is this a recent addition?

@erikerlandson
erikerlandson Nov 19, 2015 Contributor

I have some misgivings about the Algebird use of the prepare concept. I believe it doesn't perform well when it is used on monoids with nontrivial structure.

As one example, consider a monoid defined over Map where m1.plus(m1) (or m1 ++ m2) is defined as something such as "merge maps m1 and m2" Suppose we also want to implement aggregate - like behavior over a list of elements. That is Map[K, V], where (K, V) is the element type. In Scala aggregate, this shows up as combop(m1, m2) (the analog of monoid m1 ++ m2), and seqop(m1, (k,v)) (which I will short-hand as operator m1 + (k,v)). Here, m + (k,v) is defined as "insert (k,v) into map m" (which is efficient).

Now, compare this to Algebird's aggregators, where instead of defining +, the user has to define a prepare function, in this case it would be prepare((k,v)) => Map((k,v)) That is, for every single pair (k,v) it has to instantiate an entire Map object containing a single k/v entry, and then merge it into the result using the monoidal op ++.

I ran into this when defining the monoid for TDigest, where I had to do some gymnastics inside of the plus function to make sure that a singleton-tdigest created by prepare could be merged in without tripping over quadratic inefficiencies. A model closer to Scala's combop and seqop (or ++ and +) would be more efficient (and, IMO, actually easier to think about). My definition of the trait IncrementingMonoid was an attempt to capture this more efficent behavior, where I add inc (aka +) into the trait.

@erikerlandson
erikerlandson Dec 3, 2015 Contributor

After working through #501, I think I can replace my IncrementingMonoid with MonoidAggregator, since append is the analog of inc

@johnynek johnynek and 1 other commented on an outdated diff Nov 18, 2015
.../src/test/scala/com/twitter/algebird/maps/mixed.scala
+ with INodePS[K, V, P] with INodeInc[K, V] with INodeNearMap[K, V] {
+ def lsub: NodeMix[K, V, P]
+ def rsub: NodeMix[K, V, P]
+ }
+ }
+
+ import tree._
+
+ object infra {
+ import com.twitter.algebird.maps.redblack.tree._
+ import com.twitter.algebird.maps.ordered.tree.DataMap
+
+ class Inject[K, V, P](
+ val keyOrdering: Numeric[K],
+ val valueMonoid: Monoid[V],
+ val prefixMonoid: IncrementingMonoid[P, V]) extends Serializable {
@johnynek
johnynek Nov 18, 2015 Collaborator

both valueMonoid and prefixMonoid have monoids on V. What if they are not related? Can we avoid using two monoids and just take a function from P => V here?

@erikerlandson
erikerlandson Nov 19, 2015 Contributor

They do different things, and serve orthogonal purposes. So they definitely aren't related. I presume it is only a problem where implicits are in play. Currently one of them is the different type IncrementingMonoid so it isn't generating any type conflict. I assume the same would be true if I can replace it with PreparedMonoid

@johnynek johnynek commented on the diff Nov 18, 2015
.../src/test/scala/com/twitter/algebird/maps/mixed.scala
+ with LNodePS[K, V, P] with LNodeInc[K, V] with LNodeNearMap[K, V]
+
+ trait INodeMix[K, V, P] extends NodeMix[K, V, P]
+ with INodePS[K, V, P] with INodeInc[K, V] with INodeNearMap[K, V] {
+ def lsub: NodeMix[K, V, P]
+ def rsub: NodeMix[K, V, P]
+ }
+ }
+
+ import tree._
+
+ object infra {
+ import com.twitter.algebird.maps.redblack.tree._
+ import com.twitter.algebird.maps.ordered.tree.DataMap
+
+ class Inject[K, V, P](
@johnynek
johnynek Nov 18, 2015 Collaborator

This is a very abstract name. Can you add a comment as to how it used?

@erikerlandson
erikerlandson Nov 19, 2015 Contributor

Yes, I'll add some doc to explain it

@johnynek johnynek commented on the diff Nov 18, 2015
.../src/test/scala/com/twitter/algebird/maps/mixed.scala
+
+ trait INodeMix[K, V, P] extends NodeMix[K, V, P]
+ with INodePS[K, V, P] with INodeInc[K, V] with INodeNearMap[K, V] {
+ def lsub: NodeMix[K, V, P]
+ def rsub: NodeMix[K, V, P]
+ }
+ }
+
+ import tree._
+
+ object infra {
+ import com.twitter.algebird.maps.redblack.tree._
+ import com.twitter.algebird.maps.ordered.tree.DataMap
+
+ class Inject[K, V, P](
+ val keyOrdering: Numeric[K],
@johnynek
johnynek Nov 18, 2015 Collaborator

why is it called a keyOrdering but it is a Numeric? Can we avoid Numeric which is a pretty strong type class.

@erikerlandson
erikerlandson Nov 19, 2015 Contributor

keyOrdering is just the general ordering relation for all keys, and in general it is only required to be Ordering. However, in the case of the subclasses that provide nearest-neighbor queries, keys must also support a notion of distance, defined as |k1 - k2|, and more generically obey the idea that when k1 < k2 < k3, then also k2 - k1 < k3 - k1.

In other words, for the nearest-neighbor subclass, the keys aren't just ordered, they are numeric. This works out nice and clean, since Numeric <: Ordering in scals's type system, so that subclass just tightens the type constraint to Numeric

@erikerlandson
Contributor

@johnynek, replacing IncrementingMonoid with the standard and more idiomatic MonoidAggregator works as expected.

@erikerlandson
Contributor

@johnynek @ianoc @avibryant
Any feedback and/or questions on the latest state?

@erikerlandson
Contributor

This is now available here:
https://github.com/isarn/isarn-collections

So I'm going to close this out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment