Immutable TreeMap/TreeSet performance (SI-5331) #82

merged 37 commits into from Feb 16, 2012

6 participants


As discussed on the scala-internals mailing list. There are some further improvements related to performance and code.

Note: This breaks test/files/run/t2873.scala, since it uses RedBlack#Empty for testing for the absence of a compiler bug. I don't know how to fix this test so that it still tests for the same compiler bug.

erikrozendaal added some commits Dec 17, 2011
@erikrozendaal erikrozendaal Use RedBlack.iterator to create iterators for TreeSet/TreeMap.
This turns iterator creation from an O(n) operation into an
O(log n) operation. Unfortunately, it halves actual iteration
speed (consuming the iterator fully), probably due to the many
by-name closures that are needed.
@erikrozendaal erikrozendaal Use custom implementation for iterating over RedBlack trees. Raw
performance is much better than '++' based iterator.
@erikrozendaal erikrozendaal Optimized implementations of head/headOption/last/lastOption for
@erikrozendaal erikrozendaal Optimized implementation of init/tail for TreeSet/TreeMap. 9cdede8
@erikrozendaal erikrozendaal RedBlack.scala: Change count from 'def' to 'val' in NonEmpty tree
to ensure TreeSet/TreeMap 'range' operations are O(log n) instead
of O(n).
@erikrozendaal erikrozendaal Implemented drop/take/slice/splitAt/dropRight/takeRight for
TreeMap/TreeSet by splitting the underlying RedBlack tree. This
makes the operation O(log n) instead of O(n) and allows more
structural sharing.
@erikrozendaal erikrozendaal Implemented takeWhile/dropWhile/span to use tree splitting. This
changes the operation from O(n log n) to O(n) and allows for more
structural sharing.
@erikrozendaal erikrozendaal Switched from isSmaller to ordering. 3f66061
@erikrozendaal erikrozendaal Moved from implicit ordering value to implicit parameter. 7ec9b0b
@erikrozendaal erikrozendaal Moved from Empty case object to case class in preparation of moving
type parameter A.
@erikrozendaal erikrozendaal Moved type parameter A from RedBlack to Tree. 418adc6
@erikrozendaal erikrozendaal Changed abstract class RedBlack to singleton object. 6c0e036
@erikrozendaal erikrozendaal Use single shared Empty instance across all RedBlack trees. d2706db
@erikrozendaal erikrozendaal Make sure the redblack test compiles and runs. 6b95074
@erikrozendaal erikrozendaal Made RedBlack private to the scala.collection.immutable package.
Use ArrayStack instead of Stack in TreeIterator for slightly increased
@erikrozendaal erikrozendaal TreeMap/TreeSet no longer keep track of the size (the RedBlack tree
already does so).
@erikrozendaal erikrozendaal Performance improvements for iteration (foreach and iterator). b421bba
@erikrozendaal erikrozendaal Improved performance of RedBlack.NonEmpty.nth (helps take/drop/split/…
@erikrozendaal erikrozendaal Added some tests for TreeMap/TreeSet. ad0b09c
@erikrozendaal erikrozendaal Minimize number of calls to ordering. c51bdea

That's awesome. Have you signed the SCA? I can't merge this much work until that's in place.


Thanks for reminding me. Just mailed the signed SCA to Danielle at EPFL according to the instructions.


What is the gain of making it private? I could see having it private right from the beginning, but removing access to something that is public -- without even removing the code itself -- ought to bring some gain.

Sadly, it will make testing stuff on REPL a real pain.

Mainly to make it possible to further improve the implementation (performance) without causing more pain later, such as replacing the Empty singleton instance with null and relying less on polymorphism.

Right now seems a good time because RedBlack has changed so much anyway that it is definitely no longer source (or binary) compatible.

BTW, object these classes still leak out of TreeMap/TreeSet when serializing. Should we use custom serialization to avoid breaking this in the future? Is compatible serialization guaranteed by the Scala library at all?

I didn't realize -- or, rather, the coin didn't drop -- that the changes were that big on the source code compatibility front. If that's the case, I can only agree that it is a good time to make it private.


Since you put the test itself on the package, why do you need to import its contents?

Hmm... yes, this could be moved into "object Test" instead...

@dcsobral dcsobral and 2 others commented on an outdated diff Dec 28, 2011
def isEmpty: Boolean
def isBlack: Boolean
- def lookup(x: A): Tree[B]
- def update[B1 >: B](k: A, v: B1): Tree[B1] = blacken(upd(k, v))
- def delete(k: A): Tree[B] = blacken(del(k))
- def range(from: Option[A], until: Option[A]): Tree[B] = blacken(rng(from, until))
- def foreach[U](f: (A, B) => U)
- def toStream: Stream[(A,B)]
+ def lookup(x: A)(implicit ordering: Ordering[A]): Tree[A, B]
+ def update[B1 >: B](k: A, v: B1)(implicit ordering: Ordering[A]): Tree[A, B1] = blacken(upd(k, v))
+ def delete(k: A)(implicit ordering: Ordering[A]): Tree[A, B] = blacken(del(k))
+ def range(from: Option[A], until: Option[A])(implicit ordering: Ordering[A]): Tree[A, B] = blacken(rng(from, until))
+ def foreach[U](f: ((A, B)) => U)
dcsobral Dec 28, 2011

This is wrong. By passing the Ordering on each method, you make it possible for the tree to be inconsistent. Sure, someone has to work at it, and having the class private helps prevent such misuse, but the very possibility indicates the ordering is in the wrong place. I really dislike this design.

Instead, the tree itself should have an Ordering (and getting A back), which kind of gets back to RedBlack being a class and its subclasses having a hidden pointer to it -- which would be required to access the Ordering. However, as long as TreeMap and TreeSet don't extend RedBlack, it shouldn't be a problem. RedBlack becomes just a store for the Ordering shared by the nodes.

viktorklang Dec 29, 2011

I agree w Daniel

erikrozendaal Dec 29, 2011

This was another reason I feel RedBlack (as it currently is in this pull request) should be private[immutable]. This way TreeMap/TreeSet can ensure a consistent Ordering[A] is always used.

Making RedBlack a class again could help with this. The costs are additional instances of this class (including the Empty nested object) and the additional $$outer pointer in the Tree nodes (this may or may not cause additional overhead, depending on alignment, etc). At least on MacOS X JDK 1.6.0_29 with compressed oops there was additional memory usage per node.

So definitely a tradeoff between performance and potential correctness.

viktorklang Dec 29, 2011

That's an understandable tradeoff. conserving memory is important for datastructures.

@dcsobral dcsobral and 2 others commented on an outdated diff Dec 29, 2011
- def iterator: Iterator[(A, B)] =
- left.iterator ++ Iterator.single(Pair(key, value)) ++ right.iterator
+ override def foreach[U](f: ((A, B)) => U) {
+ if (left ne Empty.Instance) left foreach f
dcsobral Dec 29, 2011

I'm pretty sure this is a pessimization. You introduced two test conditions that will be executed on every node except empty ones . Much better to have Empty do nothing.

erikrozendaal Dec 29, 2011

Strangely enough this makes a huge different in foreach. Looping over a random TreeMap[Int, Int] size 10,000:

  • without guard conditions: ~3150 times per second
  • with guard conditions: ~7100 times per second

For a TreeSet[Int] of the same size:

  • without guard conditions: ~3000 times per second
  • with guard conditions: ~6000 times per second

(TreeSet is slower than TreeMap, I suspect due to unboxing of Ints when invoking the supplied foreach body)

See for the basic benchmark used (there must be an easier way to do this though).

This is why I feel using plain nulls and less polymorpishm will give another speed boost. For example, the TreeSet "head" method is still about 3 times slower than the java.util.Map "first", probably due to the need to a needed method call in RedBlack.NonEmpty.smallest.

viktorklang Dec 29, 2011

nullchecks are gigacheap

dcsobral Dec 29, 2011

That's bizarre. Are you sure this was the only thing changed on the benchmark? On the other hand, if are keeping the method this way, you might as well move the whole foreach to Tree, put a self condition for the empty tree, and make it final. Experiments with Range have shown this (making the method final in the class it is called) gives the best performance possible.

erikrozendaal Dec 29, 2011

I think pointer comparisons are just a lot faster than polymorphic calls. For a tree of size n there will be n+1 pointers to the empty node, so we save on half the calls to foreach. It may also help that the call sites to left.foreach() and right.foreach() are now monomorphic, maximizing the efficiency of the inline method call cache.

Moving the code to Tree and making it final is a bit harder though. Tree doesn't have access to key/value/left/right, so would need to do downcasting. Again, using null to represent the empty node and moving key/value/left/right to Tree would be optimal but is quite a complicated undertaking...


I've updated the RedBlack implementation to use null for representing the empty tree. Also the iterators were further optimized. You can find the implementation at

Below you can see a performance comparison between the tree from this pull request (up to commit erikrozendaal@c51bdea), the null based redblack tree without inlining the accessors to the Tree fields, null based with inlining, java.util.TreeMap and scala.collection.immutable.HashMap. All measurements are based on a map with 10,000 random integer keys/value pairs. Run with Java 1.6.0_29 with -d64 -server. Numbers are operations per second.

pull/82 with nulls, use getters with nulls, inlined getters java.util.TreeMap immutable.HashMap
lookup 15,331,921 25,107,208 25,288,542 24,927,772 46,726,788
add 1,876,683 2,768,516 5,216,653 6,415,887
add & remove 630,904 1,261,497 2,225,205 3,193,698 2,933,961
head 30,097,999 41,659,029 69,986,842 66,009,215 46,456,097
last 25,671,173 36,402,421 63,042,957 67,400,887
iterator (create) 3,850,423 26,174,516 27,073,711 58,404,392 180,170,441
iterator (first) 3,741,934 25,085,605 25,207,521 42,033,231 46,981,884
iterator (all) 3,594 11,836 11,925 9,329 4,652
foreach 8,099 15,375 15,258 7,058
drop(5) 269,805 669,568 732,468
take(5) 1,529,728 3,544,943 6,643,459
splitAt(size / 2) 204,768 451,053 567,126
range(-1e8, 1e8) 385,453 712,859 901,274 53,214,983
range(-1e8, 1e8),headOption 379,835 700,441 885,708 12,531,721

As you can see the null based implementation is 2-4x faster than the one with the Empty singleton node. When inlining the accessors it is also very close to performance of the java version. The iteration optimization make it even faster than the Java TreeMap or the immutable.HashMap. The main area where the java.util.TreeMap still wins is with the range related operations. In the Java version subMap creates a view (O(1)) while the Scala version constructs a new tree (O(log n)).


  • Is this new code acceptable?
  • If yes, shall I update this pull request or create a new one after this one is merged?

It looks pretty good to me but I have to count on the others in this thread to help me spot issues as I have no time for the fine toothed comb treatment. I'm sorry you had to resort to "@(inline @getter)" -- if you can put together some evidence tht direct field access is significantly faster, maybe we can convince martin to make it less difficult to achieve.


I'm wondering if you shouldn't leave the current public implementation
of rb tree alone, and provide yours as a new private implementation,
plus the changes to TreeSet and TreeMap. The public implementation can
be deprecated, and we then look at the migration impacts of not having
it as ancestor to TreeSet/TreeMap.

Yes, I thought about that too. Shouldn't be a problem and it may help avoid some level of source/binary compatibility. I wonder how many people really use RedBlack directly though. It has "implementation detail" written all over it (few docs, API not consistent with standard collections, etc). But maybe it's better to be safe than sorry :)


I think that's a great idea. Yes, we don't really have the luxury of assuming something isn't much used, we have to suffer the same restrictions on everything. A new implementation makes my life much simpler.


I had a look at the benchmark and it seems to me that operations are being measured without using the result. Is this true or did I miss something? If that is true, then benchmark results are potentially inaccurate.


To give a concrete example:

bench("head", tree, count) { _.head }

def bench(name: String, tree: TreeSet[Int], count: Int)(block: TreeSet[Int] => Unit): Unit = {
  val s1 = time {
    for (_ <- 1 to count) {
  println("%-25s: size: %7d: %7d iterations in %6.3f seconds (%12.2f per second)".format(name, tree.size, count, s1, count / s1))

(Micro)benchmarking hotspot is indeed very hard, especially to get consistent results, even from run to run :(

I originally started with an implementation of bench that returned the last result. Later I switched to unit to reduce benchmark overhead. This didn't seem to change relative speeds between the various different benchmarks, so I'm pretty sure hotspot isn't totally optimizing away the code.

Using the following version of bench:

def bench[A, B, C](name: String, tree: TreeMap[A, B], count: Int)(block: TreeMap[A, B] => C): C = {
  var result: C = null.asInstanceOf[C]
  val s1 = time {
    for (_ <- 1 to count) {
      result = block(tree)
  println("%-25s: size: %7d: %7d iterations in %6.3f seconds (%12.2f per second)".format(name, tree.size, count, s1, count / s1))

Changes the head benchmarks to 31.5M / 44M for getters vs inlined getters respectively. So although the numbers change I don't think it affects the conclusion.

Finally, it is hard to tell the affect of micro-benchmark results on real world programs...

erikrozendaal added some commits Jan 6, 2012
@erikrozendaal erikrozendaal Deprecate TreeMap.isSmaller and TreeSet.isSmaller.
These methods were used by the old RedBlack tree implementation,
but are no longer required and were not defined in any interface.
Use ordering or compare instead.
@erikrozendaal erikrozendaal Restore old RedBlack class to maintain backwards compatibility.
The class is marked as deprecated and no longer used by the TreeMap/TreeSet
implementation but is restored in case it was used by anyone else (since
it was not marked as private to the Scala collection library).

Renamed RedBlack.{Tree,RedTree,BlackTree} to Node, RedNode, and BlackNode
to work around name clash with RedBlack class.

I restored the old implementation of RedBlack class in the commit erikrozendaal@f656142 and marked the class as deprecated.

This required renaming the {Red,Black}Tree classes in the RedBlack object to avoid naming conflicts. I renamed them to {Red,Black}Node. Another option is to rename the RedBlack object to something else instead (so it is no longer a companion of the deprecated RedBlack class).

Let me know what you think and I'll adjust it and add it to this pull request.


I think it would be preferable to avoid the existing public object altogether, so that it can be removed when the deprecation expires.

The two RedBlack implementations are totally independent (one is a class, the other an object) that happen to share the same name. But your comment convincend me that the updated implementation should have a new name. So I'll rename it to RedBlackTree and put it in its own file.

@erikrozendaal erikrozendaal Renamed object RedBlack to RedBlackTree.
This more clearly separates the new implementation from the now
deprecated abstract class RedBlack and avoids naming conflicts
for the member classes.

I added the null based implementation to this pull request. The old RedBlack class has also been restored and the new implementation is named RedBlackTree. I wonder if it is worth the effort to rewrite the history so that all binary compatible changes are at the start and the new RedBlackTree based implementation is added later. This would make integration with the next 2.9.x release easier. I could also just make a separate pull request for 2.9.x that only contains the binary compatible changes (mainly improvements to head/last/iterator/toStream).


The fact that micro-benchmarks on HotSpot are hard is more reason not to break the rules. There is enough uncertainty as it is without adding more. Also, if you want to reduce overhead, do not use Range (use a while loop). At least until the optimization efforts on that are finished.

@erikrozendaal erikrozendaal Tests for takeWhile/dropWhile/span.
Also simplified implementation of span to just use splitAt.

Good point! I'll rerun the benchmarks using the following benchmark method:

  def bench[A, B](name: String, tree: TreeMap[A, B], count: Int)(block: TreeMap[A, B] => Int => Int): Unit = {
    print("%-25s: ".format(name))
    val f = block(tree)
    var result = 0
    val elapsed = time {
      var i = 0
      while (i < count) {
        result += f(i)
        i += 1
    println("size: %7d: %7d iterations in %6.3f seconds (%12.2f per second): result = %s".format(tree.size, count, elapsed, count / elapsed, result))

    val benches = Map[String, (TreeMap[java.lang.Integer, java.lang.Integer] => Int => Int)](
      "lookup" -> { tree => n => if (tree.contains(rnd(n))) 1 else 0 },
      "add" -> { tree => n => tree.updated(rnd(n), n).size },
      "add-remove" -> { tree => n => (tree.updated(rnd(n), n) - rnd(n)).size },
      "head" -> { tree => n => tree.head._1 },
      "last" -> { tree => n => tree.last._1 },
      "tail" -> { tree => n => tree.tail.size },
      "init" -> { tree => n => tree.init.size },
      "iterator" -> { tree => n => if (tree.iterator.hasNext) 1 else 0 },
      "" -> { tree => n => },
      "iterator.size" -> { tree => n => tree.iterator.size },
      // etc...

I'll also run one benchmark at a time (in a single JVM), this seems to give best/most consistent results.


Here are the updated benchmark results. Complete results can be found at (except for the java.util.TreeMap output, which accidentally overwrote). Numbers changed, sometimes radically, but I think the main conclusions are still the same.

#82 with nulls, use getters with nulls, inlined getters java.util.TreeMap immutable.HashMap
lookup 3,801,779 7,362,195 7,397,677 7,572,085 24,217,850
add 1,192,201 1,794,593 2,427,549 4,528,338
add & remove 392,738 867,402 1,016,917 3,528,667 2,465,620
head 28,917,103 40,490,928 61,891,427 57,270,760 41,516,223
last 25,028,036 32,569,793 52,351,476 56,938,347
iterator (create) 4,667,903 27,797,584 27,981,034 72,102,692 366,207,914
iterator (consume first) 4,567,211 24,348,634 24,282,580 37,049,432 42,535,377
foreach 7,228 12,541 12,694 9,231
iterator (consume all) 3,583 10,052 9,895 9,147 6,202
drop(5) 256,134 632,225 696,058
take(5) 2,044,954 4,463,086 7,529,279
splitAt(size / 2) 188,260 367,508 484,105
range(-1e8, 1e8).head 382,497 693,286 880,981 12,473,028

Here are the results for OpenJDK build 1.7.0-u4-b05-20120106. It's nice to see the pattern match code got a decent speed boost (good news for more idiomatic Scala code with a lot of pattern matching). Also inlining of field access has less effect now. Still, the null based code is still the faster version of RedBlack.

PS Don't forget that the iterator based comparison is not fair between "#82" and the null based versions because additional algorithmic improvements were made in the meantime.
PPS All these numbers are running in 64-bit mode with compressed OOPS enabled, just like the previous 1.6.0_29 based results.

#82 (pattern matching, singleton empty) with nulls, use getters with nulls, inlined getters java.util.TreeMap immutable.HashMap
lookup 4,507,130 7,299,700 7,240,064 7,181,138 29,847,773
add 1,746,636 2,534,294 2,553,912 4,696,720
add & remove 756,158 1,072,570 1,060,522 3,496,455 2,608,751
head 25,772,815 43,935,453 58,751,122 59,795,623 43,097,982
last 22,547,783 37,729,653 52,286,158 59,816,398
iterator (create) 5,353,564 28,918,386 28,923,354 73,329,891 375,772,141
iterator (consume first) 5,572,964 23,698,292 23,899,985 35,495,586 42,454,982
foreach 8,054 12,523 12,411 9,541
iterator (consume all) 5,274 11,105 13,521 10,574 8,997
drop(5) 394,074 749,614 698,918
take(5) 3,680,581 7,920,021 7,778,135
splitAt(size / 2) 351,272 500,936 527,641
range(-1e8, 1e8).head 645,592 945,271 958,675 25,801,491


Please take a look at 'to' method implementation.
From the 4 subrange methods defined in scala.collection.generic.Sorted three (from, range, until) are implemented as single call to rangeImpl unlike 'to' which require two calls to 'rangeImpl' + creating and twice advancing subtree iterator in the worst case.

I wonder if it's worth to reimplement this method in TreeMap/TreeSet for better performance?

Also, it would be interesting to benchmark these four methods, especially "to(5)" vs. "until(6)".
I believe fast subrange creation is one of important scenarios for use of TreeSet/TreeMap.

@erikrozendaal erikrozendaal Optimized implementation of TreeMap/TreeSet#to method.
Performance of `to` and `until` is now the same.

Hi Pavel,

I benchmarked the range operations and improved the implementation a bit such that to now has the same performance as until.

old to optimized to java.util.TreeMap
range(-1e8, 1e8).head 984,096 938,338 25,801,491
from(1).head 1,339,841 1,214,761 23,861,585
to(0).last 551,214 1,132,467 28,157,299
until(1).last 1,148,834 1,137,265 29,828,129

As you can see the java.util.TreeMap version is still much faster due to the use of views instead of building new trees. So to really get competitive with the Java version we need to support views (or ranged iterators, or something) and probably also implement methods like lower/floor/ceiling/higher, like java.util.NavigableMap.


Hi Erik,

It's very cool that you could achieve such impressive performance improvements for your scenario.
However, I'm thinking about another performance-sensitive scenario: when we have zillion short-lived maps of moderate size each (say from 1 to 100 elements) instead of your scenario of having one map with zillion elements.

In that case another overheads come to the fore.
For example, looking into your last commit, you create 4 temporary objects on each call to range: two Options and two functions. Replacing option test here with null drops 2 of them and also gives faster calls to after and before in rng.

Also, you check ***Inclusive invariants each time when after or before is called. It's better to create specialized closure for each case in range.

In general, it would be great to benchmark the scenario I mentionned above and get the real numbers.


@erikrozendaal erikrozendaal Custom coded version of range/from/to/until.
This avoids unnecessary allocation of Option and Function objects,
mostly helping performance of small trees.

FYI since everyone is doing such a fine job shaking the tree, I figure I'll let that continue a while longer. (In other words, I haven't looked seriously at this patch and intend to let it mature as far as it can before I do.)



I've some more improvements to range/from/to/until and take/drop/slice, basically by specializing RedBlackTree.rng for each of these cases. I'll try to get the benchmark numbers and patches posted here tonight CET.

erikrozendaal added some commits Jan 22, 2012
@erikrozendaal erikrozendaal Custom implementations of drop/take/slice.
This mainly helps performance when comparing keys is expensive.
@erikrozendaal erikrozendaal Removed TODOs. 51667dc

Here are the benchmark numbers related to the last few commits. I had to change the benchmarks somewhat to get decent numbers with small trees. So instead of getting the head or last element I just get the size of the resulting tree. For some benchmarks the random seed also had to be changed. All noted in the table below.

master default to (f26f610) optimized to (00b5cb8) specialized from/to/until/range (7824dbd) specialized drop/take/slice (78374f3)
Size 10000, Seed 1234
range(-1e8, 1e8).head 984,096 938,338 1,274,907 1,274,907
from(1).head 1,339,841 1,214,761 1,358,472 1,358,472
to(0).last 551,214 1,132,467 1,208,318 1,208,318
until(1).last 1,148,834 1,137,265 1,246,551 1,246,551
Size 10000, Seed 1234
drop(5).size 762,210 761,143 783,775 825,173
take(5).size 8,613,232 6,491,055 9,433,669 14,746,586
splitAt(size / 2).size 578,984 504,674 616,517 654,141
slice(3, size - 3).size 373,338 348,044 397,642 414,725
Size 10, Seed 1234
range(-1e9, 1e9).size 8,213,661 11,389,161 8,291,250 14,930,028 15,113,249
from(1).size 8,630,657 13,060,927 11,276,415 14,540,512 14,371,185
to(0).size 1,052,989 3,840,600 11,524,856 14,487,387 14,262,957
until(1).size 7,775,261 11,946,964 11,365,586 14,217,502 14,295,698
Size 10, Seed 1235
drop(5).size 281,202 11,900,665 10,404,720 12,763,176 14,397,612
take(5).size 347,416 11,831,718 10,036,927 15,386,880 16,705,482
splitAt(size / 2).size 587,840 5,683,290 4,796,227 7,146,472 8,209,405
slice(3, size - 3).size 320,606 6,529,863 5,025,384 9,531,225 11,963,432

I think I'm pretty much done with performance optimizations. I think it is now more interesting to discuss how we can get this patch into 2.10 and also see if there are binary compatible parts that can be put into 2.9. Maybe at least deprecate RedBlack starting with 2.9.2?


A migration warning on TreeSet/TreeMap is necessary as well. I don't think there's any need to rush deprecation to 2.9.2 -- it can be deprecated on 2.10.


So the migration warning should go into 2.9.x branch? Or 2.10? Warning about RedBlack inheritance, isSmaller, etc?


Actually, I've changed my mind. I was thinking about warning about the fact that TreeSet no longer extends RedBlack, but any dependency on it will result in a compile error, so there's no need for it.


So, anything else that needs to be done before merging this into 2.10?


I will interpret the petering out of discussion as implied endorsement by the involved parties if someone doesn't say otherwise. (It's still possible I personally will discover objections, but let's hope not.)

@paulp paulp merged commit 51667dc into scala:master Feb 16, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment