Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Radix-Balanced Finger Tree Vectors #8534

Open
wants to merge 4 commits into
base: 2.13.x
from
Open

Conversation

@szeiger
Copy link
Contributor

szeiger commented Nov 10, 2019

This is a complete rewrite of the old scala.collection.immutable.Vector. The current draft contains my raw development history and still calls the new version NVector for easy benchmarking and testing. I duplicated the obvious tests and added more NVector-specific tests but after the switch-over the complete test suite may reveal additional bugs. The final version will also have to be opened against 2.14.x (which doesn't exist yet). It is not expected to be binary compatible with the 2.13 implementation.

I recommend the RRB tree papers by Bagwell/Rompf and Stucki as background. Like the old Vector, the new one does not use a "relaxed" radix-balanced tree but these papers also give a good overview of the non-relaxed design and specific optimizations for it in the old Vector implementation.

Unlike other data structures that have been proposed as replacements for Vector as the default immutable.IndexedSeq my goal was to create a proper replacement for Vector that would not make any performance trade-offs.

Design

My design differs from the old one in one important aspect: Instead of a single movable finger (the "display" pointers) it has a fixed finger at each end. These fingers are part of the proper data structure, not a transient optimization. There are no other ways to get to the data in the fingers (it is not part of the main data array), no focus point, and no stabilization. As a side-effect, the data structure is completely immutable and no extra memory fences are required. I am not aware of existing descriptions or implementations of this data structure. Based on related structures "radix-balanced finger tree" seems to be an appropriate name.

A second major difference lies in the implementation. Whereas the old Vector uses a single class for collections of all supported sizes, the new one is split into Vector0 to Vector6 for the different dimensions of the main data array. Vector0 is a singleton object for the empty vector, Vector1 (used for collections up to size 32) is essentially the same as ArraySeq, all higher dimensions are finger trees.

I prefer to think of the nested arrays containing the data as hypercubes instead of trees. For a full VectorN we start with an n-dimensional hypercube data_n and slice off an n-1-dimensional prefix_n-1 and suffix_n-1. We continue to slice off at the beginning of the prefix and the end of the suffix until we reach prefix1 and suffix1. The length of each side of the initial hypercube is 32 (same as the old Vector). After slicing, the highest dimension of data can have a length up to 30, prefix1 and suffix1 can be up to 32, and the intermediate prefixes and suffixes up to 31. When the original data does not fill the entire hypercube, there may be empty parts both at the beginning (the offset) and the end. When the offset is 0 we call the vector "left-aligned". Notice that the slicing ensures that only the highest dimension of each data array can be incomplete. All lower dimensions always have a length of 32 and are filled entirely.

Balancing requires that prefix1 and suffix1 are both non-empty. There is no obvious advantage from doing the same for higher-dimensional prefixes and suffixes; it would actually make appending and prepending more expensive. They only get rebalanced when it is necessary for prefix1 or suffix1. There is no requirement that the main data array be non-empty, either. For example, when appending an element to a a completely filled and left-aligned Vector3, the existing prefix1, prefix2 and data3 become prefix1, prefix2 and prefix3 of a new Vector4. data4, suffix3 and suffix2 remain empty and suffix1 contains the new element.

Building is done with a single NVectorBuilder for all dimensions. It can be initialized with an existing vector structure for more efficient concatenation. I did not get around to implementing an optimized concatenation where the right-hand side vector determines the alignment. Like most other algorithms for this data structure it is quite easy to visualize how it works and assure yourself that it is possible but actually implementing it correctly is much harder. This would allow concatenation of two vectors of sizes m and n in O(min(n, m) + log(n+m)) time.

There is another builder class NVectorSliceBuilder for non-trivial slicing operations. The hypercube fingers are very well suited for slicing because any slicing will automatically result in array slices with dimensions in the correct order for a new vector. Only minor rebalancing is required in some cases.

Memory Usage

The old Vector has a huge overhead for small instances. In addition to the actual data arrays, a Vector object adds 56 bytes (in 64-bit HotSpot with compressed pointers). By having different classes for different dimensions, this goes down to 16 bytes for an NVector1, the same as an immutable.ArraySeq. The overhead grows up to 80 bytes for an NVector6. The break-even is at NVector4. Constant overhead for these larger sizes should be inconsequential. It could be reduced by not storing the intermediate prefix lengths, at the expense of slower access to the prefix.

Additionally, all instances built by VectorBuilder have their data arrays padded to 32, which constitutes an extra 124 byte overhead for a Vector of size 1 that was built with a VectorBuilder. NVectorBuilder always truncates the arrays to avoid this.

Benchmarking

Benchmarks were run with

bench/jmh:run -jvmArgs "-Xms256M -Xmx10G" scala.collection.immutable.VectorBenchmark2.nv
bench/jmh:run -jvmArgs "-Xms256M -Xmx10G" scala.collection.immutable.VectorBenchmark2.v

while the initialization of the unnecessary part (v or nv) was commented out. If you have more than 16G of memory you should be able to choose a larger heap size and run everything in one go.

Interactive diagrams can be found at http://szeiger.de/tmp/vector2/vector-results.html. All diagrams plot time against number of elements, both on logarithmic scales. The diagrams on that page are interactive so you can explore the exact numbers for all data points. Due to the double logarithmic scales and the huge range of instance sizes, even big differences can appear very small in the diagrams.

The diagram generation code is currently part of this draft PR. We may want to split it off into a separate project for more general use or just keep it here because it's only a small amount of code.

Performance

We can see small improvements for sequential apply and very similar results for random apply. Other than micro-optimizations, there isn't much that can be done here:

ApplyRandom
ApplySequential

Updating with random indexes shows the expected logarithmic growth but the new implementation is several times faster due to lower overhead:

UpdateRandom

Sequential updates are the Achilles heel of the radix-balanced finger trees, at least in theory. The movable finger of the old Vector allows sequential updates in what appears to be amortized constant time (i.e. an update positions the finger at the updated position and then further updates in the vicinity can access the location more directly). Due to index computations being based on integers which have to grow logarithmically, this is not really the case but as long as all indexes are represented by the same 32-bit Int type, the performance graph is essentially flat. The new Vector cannot do this optimization and therefore the logarithmic growth is directly visible when stepping up in vector dimensions. Fortunately, it is so much faster to begin with and the base-32 logarithm is so slow that it still wins at larger sizes:

UpdateSequential

Sequential appending and prepending is efficient and has low overhead in the new Vector. The old Vector also makes it efficient by positioning the finger appropriately but the overhead is much larger so that the new implementation ends up being several times faster:

Append
Prepend

Alternating between appending and prepending is a worst-case scenario for the old Vector because the finger always ends up in the wrong location (i.e. you get all the overhead but none of the benefit). The new implementation is about 35 to 40 times faster in this case:

Apprepend

All bulk-appends benefit from initializing an NVectorBuilder with the existing left-hand-side NVector, plus other optimizations. Appending an ArraySeq of the same size is several times faster:

BulkAppend100p

This is even more apparent when we repeatedly append a collection only 1/10 the size where the new implementations is about 25 to 50 times faster in non-trivial cases:

BulkAppend10p

Both implementations have optimizations for small bulk-appends. Below a certain size limit for the collection being appended, individual appends are performed instead of using a builder. The new implementation optimizes this further: When the data to be appended fits completely into suffix1 (which allows for appends up to 31 elements depending on alignment), it is done as a bulk copy into the array without the need to create either a builder or intermediate vectors. As a result, appending a collection of size 2 is about 8 to 10 times faster:

BulkAppend2

When concatenating two vectors, we can make use of the fact that both are built from array slices. We should explore generalizing this to other array-based collections in the future but this will require new abstractions in the collections library. For now the optimization is limited to NVector. Instead of appending the right-hand side element by element to the NVectorBuilder, we can append a whole slice of 32 elements with (depending on alignment) one or two System.arraycopy calls. This is another 2 to 3 times faster than appending individual elements:

BulkAppendSame

Iterating with foreach and iterator is pretty much the same in both implementations, with some wins for better optimization for small instance sizes in the new one:

Foreach
Iterator

With the balancing requirement of non-empty prefix1 and suffix1, both head and last are only one array indexing operation away in the new implementation. The old Vector still does well on head because the instances for the benchmark were built with a VectorBuilder (which always results in the focus being on the first element) but shows logarithmic growth on last, where the new implementation is still O(1):

Head
Last

tail is 8 to 10 times faster in the new Vector. The rebalancing is amortized constant time but it gets a bit slower for higher dimensions because the overhead for managing the instances grows:

Tail

Building with a Builder is a bit faster for most instance sizes in the new implementation. The old one wins for very small instances because it doesn't truncate the arrays and therefore creates a huge constant overhead for small vectors (see above):

Build

map is essentially unoptimized in the old implementation. It uses the default implementation based on Iterator + Builder which is decently optimized and good in terms of big-O-complexity but we can still do much better by rebuilding the existing structure:

MapNew

The new implementation also has map-conserve semantics as an internal optimization. While a map call always results in a new NVector instance, the internal arrays (at all dimensions) are shared if none of the entries change. Since arrays are treated as fully immutable there are no ownership issues. I opted for a conservative implementation which is essentially free for the non-conserving case (i.e. there is no observable performance penalty when mapping everything to a new object). The performance improvements for map(identity) are still decent (but it could be about twice as fast with an dedicated mapConserve implementation):

MapIdentity

@scala-jenkins scala-jenkins added this to the 2.13.2 milestone Nov 10, 2019
@mdedetrich

This comment has been minimized.

Copy link
Contributor

mdedetrich commented Nov 10, 2019

Wow, @Ichoran what are your thoughts on this?

@Ichoran

This comment has been minimized.

Copy link
Contributor

Ichoran commented Nov 10, 2019

Wow, this looks fantastic, Stefan! I don't have time right now to read through all the code, but the bits that I sampled all look reasonable (fairly repetitive, but that's expected given the nature of the data structure). The algorithmic design seems sound in concept; I'll try it on collection-laws when I get a chance to see how it holds up in practice.

Anyway, this is great! Maybe we could introduce it in a separate dependency in the meantime so we don't have to wait for 2.14 to get it?

@odersky

This comment has been minimized.

Copy link
Contributor

odersky commented Nov 10, 2019

This looks really impressive, Stefan! Can we swap the implementations in the 2.13 series? If not, make it the standard for 3.0 maybe?

* the prefix). The level is increased/decreased when the affected side plus main data is already full/empty
* - All arrays are left-aligned and truncated
*
* In addition to the data slices (`prefix`, `prefix2`, ..., `dataN`, ..., `prefix2`, `prefix1`) we store a running

This comment has been minimized.

Copy link
@diesalbla

diesalbla Nov 10, 2019

Contributor

Is there any reference we may cite, regarding the design and implementation of radix-balanced finger trees? For instance, is this ICFP article closely related?

This comment has been minimized.

Copy link
@szeiger

szeiger Nov 11, 2019

Author Contributor

As I mentioned above I am not aware of any pre-existing references. The Bolivar paper (I found a PDF here) covers RRB trees, like the two papers I referenced in the PR description. My implementation uses non-relaxed trees. Relaxed trees can greatly improve insertion and concatenation performance but will show worse results for indexed access. I think it would be interesting to see if RBF trees can be generalized to RRBF trees. I expect it to be possible in principle but with higher implementation complexity than for a relaxed version of normal RB trees.

The "Off-Tree Tail" and "Left Shadow" optimizations from section 2.3 are worth mentioning in this context. The fingers are a generalization of an off-tree tail. While the off-tree tail is only the last one-dimensional slice, a finger extends this to all dimensions from 1 to n-1 for an n-dimensional vector. The left shadow is implemented by the old Vector. The finger on the left now provides the same advantages with a more compact memory representation.

This comment has been minimized.

Copy link
@retronym

retronym Nov 12, 2019

Member

Should this be:

(`prefix`, `prefix2`, ..., `dataN`, ..., `suffix2`, `suffix1`)

?

This comment has been minimized.

Copy link
@szeiger

szeiger Nov 12, 2019

Author Contributor

No, it should be (`prefix1`, `prefix2`, ..., `dataN`, ..., `suffix2`, `suffix1`). I got 3 of 5 names wrong...

@linasm

This comment has been minimized.

Copy link
Contributor

linasm commented Nov 11, 2019

Amazing results! If 2.13 is really out of question, maybe https://github.com/scala/scala-collection-contrib would be the fastest way of making it available?

@dwijnand dwijnand modified the milestones: 2.13.2, 2.14.0-M1 Nov 11, 2019
@odd

This comment has been minimized.

Copy link
Contributor

odd commented Nov 11, 2019

Very impressive! 🚀
Wanting to see which binary compatibility problems MiMa detected, I renamed the old implementation OVector and the new one Vector. This should make it easier to generate the MiMa breakage report and it also makes any classes in the standard library using Vector use the new implementation in their tests instead of the old (e.g. VectorMap). The code can be found at szeiger#2.

@retronym

This comment has been minimized.

Copy link
Member

retronym commented Nov 12, 2019

We ought to warm up the benchmarks in a way to ensure we measure the megamorphic dispatch of callsites that range overNVectorN.

Right now, I see inlining logs like:

sbt:root> bench/jmh:run -jvmArgs -Xmx1G scala.collection.immutable.VectorBenchmark2.nvApprepend -f1 -psize=500 -jvmArgs -XX:+UnlockDiagnosticVMOptions -jvmArgs -XX:+PrintInlining
...
[info]                               @ 22   org.openjdk.jmh.infra.Blackhole::consume (49 bytes)   disallowed by CompilerOracle
[info]                               @ 19   scala.collection.immutable.VectorBenchmark2::nvApprepend (70 bytes)   force inline by CompilerOracle
[info]                                 @ 1   scala.collection.immutable.VectorBenchmark2::nv (5 bytes)   accessor
[info]                                 @ 6   scala.collection.immutable.VectorBenchmark2::nv (5 bytes)   accessor
[info]                                 @ 14   scala.collection.immutable.VectorBenchmark2::size (5 bytes)   accessor
[info]                                 @ 29   scala.collection.immutable.VectorBenchmark2::o (5 bytes)   accessor
[info]                                 @ 32   scala.collection.immutable.NVector2::appended (6 bytes)   inline (hot)
[info]                                  \-> TypeProfile (300500/300500 counts) = scala/collection/immutable/NVector2
[info]                                   @ 2   scala.collection.immutable.NVector2::appended (197 bytes)   inline (hot)
[info]                                     @ 18   scala.collection.immutable.NVectorStatics$::copyAppend (18 bytes)   inline (hot)
[info]                                       @ 5   java.util.Arrays::copyOf (13 bytes)   inline (hot)
[info]                                         @ 3   java.lang.Object::getClass (0 bytes)   (intrinsic)
[info]                                         @ 6   java.util.Arrays::copyOf (46 bytes)   (intrinsic)
[info]                                     @ 23   scala.collection.immutable.NVector2::length (5 bytes)   accessor
[info]                                     @ 30   scala.collection.immutable.NVector2::copy$default$1 (5 bytes)   accessor
[info]                                     @ 36   scala.collection.immutable.NVector2::copy$default$2 (5 bytes)   accessor
[info]                                     @ 42   scala.collection.immutable.NVector2::copy$default$3 (5 bytes)   accessor
[info]                                     @ 56   scala.collection.immutable.NVector2::copy (15 bytes)   inline (hot)

A quick and dirty way to see the effect is to disable JMH's forking and just run all experiments in its control JVM.

sbt:root> bench/jmh:run scala.collection.immutable.VectorBenchmark2.nvApprepend -f0 -psize=5000,50000,1,500

[info] Benchmark                     (size)  Mode  Cnt        Score       Error  Units
[info] VectorBenchmark2.nvApprepend    5000  avgt    5    92220.333 ±  9314.284  ns/op
[info] VectorBenchmark2.nvApprepend   50000  avgt    5  1029616.649 ± 41164.566  ns/op
[info] VectorBenchmark2.nvApprepend       1  avgt    5       22.443 ±     1.452  ns/op
[info] VectorBenchmark2.nvApprepend     500  avgt    5     9525.892 ±   391.606  ns/op
[success] Total time: 44 s, completed 12/11/2019 1:58:20 PM
sbt:root> bench/jmh:run scala.collection.immutable.VectorBenchmark2.nvApprepend -f0 -psize=500

[info] Benchmark                     (size)  Mode  Cnt     Score     Error  Units
[info] VectorBenchmark2.nvApprepend     500  avgt    5  8126.497 ± 231.999  ns/op

There are a few techniques to harden the benchmark against this problem.

If the only virtual callsites you want to protect are all in the benchmark code itself, you can indirect them through a private method that bears the @jmh.annotations.CompilerControl(DONT_INLINE) annotation.

If you are benchmarking calls into library code that contain the virtual call sites, you need to warm them up with various subclasses (in this case, various sizes of collection).

JMH has a command line option that lets you specify a set of benchmark methods to run during warmup phase. -wm BULK -wmb <regex>. (run bench/jmh:run -help for details).

Adding this to our setup:

    println("size = " + size)
    println("JVM: " + ManagementFactory.getRuntimeMXBean().getName())

Shows that:

sbt>  bench/jmh:run -jvmArgs -Xmx1G scala.collection.immutable.VectorBenchmark2.nvApprepend -f1 -psize=1,50,500,5000 -wm BULK

Warms up all sizes in the same JVM before running the measurement for a particular size. However, it will use a fresh JVM for each size, which means N^2 benchmark execution time. I don't see an extension point in JMH to customize this. We could patch this part of JMH to add support for a new warmup mode that shared a measurement JVM for parameter variations of a given benchmark.

Perhaps the simplest option for now would be some code in the setup method that runs each benchmark method at a few sizes.

* - prefix1 and suffix1 are never empty
* - Balancing does not cross the main data array (i.e. prepending never touches the suffix and appending never touches
* the prefix). The level is increased/decreased when the affected side plus main data is already full/empty
* - All arrays are left-aligned and truncated

This comment has been minimized.

Copy link
@retronym

retronym Nov 12, 2019

Member

It would be good to encode some of these invariants in a checkConsistent method which we could call incorporate into a Scalacheck test suite that performed random operations and asserts the internal structure is consistent.

This comment has been minimized.

Copy link
@szeiger

szeiger Nov 12, 2019

Author Contributor

There are validation methods in the JUnit test. I originally had them directly in NVector but factored them out to keep the code size down.

case that: NVector[_] => false
case o => super.equals(o)
}
}

This comment has been minimized.

Copy link
@retronym

retronym Nov 12, 2019

Member

We should keep optimized implementations of equals / sameElements for Vector / Vector comparisons short circuiting when the lengths differ and avoiding element comparison when this and that structurally share a backing array.

This comment has been minimized.

Copy link
@szeiger

szeiger Nov 12, 2019

Author Contributor

The old optimizations don't make sense here. The only case where you would benefit from the shortcut is when comparing the result of map(identity). If the vectors were built separately, they won't share any arrays. Doing structural comparisons on each array separately might be useful. This could cover cases such as dropping the head element and then consing it back on. The implementation would be rather complex and you need to take care not to compromise the more common cases though. You cannot handle all arrays individually because equal vectors can be balanced differently. In the old Vector the only possible difference was the offset but now we have to deal with all the different prefix lengths.

This comment has been minimized.

Copy link
@retronym

retronym Nov 20, 2019

Member

@szeiger Would (NVector.tabulate(1000)(i => i)).equals(NVector.tabulate(999)(i => i)) short circuit based on a cheap length call, as per Vector?

This comment has been minimized.

Copy link
@szeiger

szeiger Nov 20, 2019

Author Contributor

Yes, this uses the default implementation of sameElements which checks knownSize first (and the size of an IndexedSeq is always known).

@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Nov 19, 2019

I ran some new benchmarks with megamorphic dispatch. The results (at http://szeiger.de/tmp/vector2mega/vector-results.html) unfortunately mean that the new vector is not always faster which makes the argument for it much harder.

  • apply is roughly 10% slower than the old version (both for sequential and random access)
  • head is roughly 30% slower than the old version
  • last starts out the same as head but is faster above 1000 elements because it is O(1) in the new design and O(log n) in the old one
  • Iterating with iterator is 10% to 20% slower than the old version

There are positive effects, too. Well, positive for my new implementation, because some operations in the old vector also suffer from the varying sizes. It can't be megamorphic method dispatch. My guess is that previously the same branches were taken and got highly optimized. Now we get more variation which has huge performance effects. For more complex algorithms this benefits the new implementation. Once the initial impact of the non-inlined call site has been absorbed, the rest is still specific to a certain dimension. The effect can be seen in sequential updates (last diagram on the page). The old vector used to be around 150μs independent of size. Now it starts at this level for small sizes but then appears to be growing logarithmically (which it really shouldn't, given the algorithm). Maybe @retronym has a better idea about what's going on.

(tail for size=1 looks much faster in the old vector now. The new vector shows similar results in the non-megamorphic benchmarks. I noticed these oddities before but didn't point them out because they affected both versions equally. This appears to be a benchmarking artifact. Even though the size is initialized as a parameter, HotSpot is apparently still able to constant-fold the whole call graph into an empty vector. ~5ns is typically the result I get on this machine for a no-op.)

@retronym

This comment has been minimized.

Copy link
Member

retronym commented Nov 20, 2019

head is roughly 30% slower than the old version

I believe you could fix this particular slowdown by using a field in the base class to store NVector1.data1 / {NVector0..6}.prefix1 in the same field and pulling having a single, final implementation of head up there.

My guess is that previously the same branches were taken and got highly optimized.

Yes, benchmarking with polluted branch profiles (for the JIT or even the CPU branch predictor) is yet another best practice. If we want to dig in to this we could start by using JITWatch to compare.

@mkeskells

This comment has been minimized.

Copy link
Contributor

mkeskells commented Nov 20, 2019

Are there any benchmarks for allocations and retained memory usage. In large apps this is as much a problem as cpu usage.

CPU improvements look good

@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Nov 20, 2019

I believe you could fix this particular slowdown by using a field in the base class to store NVector1.data1 / {NVector0..6}.prefix1 in the same field and pulling having a single, final implementation of head up there.

Yes, I had the same idea and I already did that yesterday and ran some benchmarks over night. The results look promising: http://szeiger.de/tmp/vector2mega2/vector-results.html. head is consistently faster again and other results appear to be unaffected. I'll run a variation of this tonight to get at least bimorphic dispatch for last.

Unfortunately apply is the one really important method that has no easy fix.

Are there any benchmarks for allocations and retained memory usage. In large apps this is as much a problem as cpu usage.

I didn't run any benchmarks for allocations. My expectation is that bulk operations may do some more one-time allocations (e.g. in NVectorBuilder.result). Allocations while building are the same (I even use the old algorithm for allocating new vector slices).

Any operation on the prefix or suffix (tail, init, appended, prepended, and updates in the prefix or suffix) will allocate less. Only sequential updates in the main area of large vectors are expected to do worse because they now have to do a full path copy every time.

Retained memory for small vectors should be much better than before and start to even out as instances get larger. Constant overhead is a bit higher in dimensions 5 and 6 but the small advantage for the old implementation only holds as long as the vectors are left-aligned and right-truncated (i.e. they were built by appending individual elements, not with a builder), otherwise the small constant savings are dwarfed by the padding overhead. Overall, retained memory should be very close to optimal for NVector.

NVectorIterator keeps a reference to the whole source vector while iterating but that was also true for VectorIterator. In both implementations there are no easy ways to improve this. You would have to trade retained memory for more allocations and slower performance.

@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Nov 20, 2019

I pushed an update with the head changes and some cleanups. The bimorphic version of last doesn't have any advantages over the megamorphic one. It's actually a bit slower in my benchmarks (but that may be just noise).

@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Nov 20, 2019

Interesting result: Using NVectorIterator everywhere (so you get a monomorphic iterator call and monomorphic calls into the Iterator's methods) is detrimental to iterator performance! This looks similar to the branch optimization issues we see with the old Vector. Perhaps splitting NVectorIterator into multiple versions specific to one level could be useful. I wanted to do that initially but having a single implementation turned out to be simpler.

@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Nov 20, 2019

I benchmarked some alternatives for iterator. Unfortunately, all were much slower. It looks like we're already hitting the sweet spot of optimization with the separate iterators for levels 0 and 1 and a shared implementation for higher dimensions. If you look at the warmup data, it's clear that HotSpot is really screwing it up in this case. Warmup goes from small to large sizes and every single measurement along the way is much better than what you get in the end during the actual benchmarked iterations.

Of course, what this also means is that the iterator performance is very dependent on warmup and getting results within a few percent of the old implementation is probably fine.

@joshlemer

This comment has been minimized.

Copy link
Member

joshlemer commented Nov 20, 2019

@szeiger did you ever try a version of this new structure which does not split out implementations based on depth? i.e., just have all fields present, but perhaps null, and logic is dispatched based on some private[this] val depth: Int?

@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Nov 21, 2019

Having eveything in one class would add a lot of memory overhead for small instances, even more than the old vector (but with the truncated arrays we should still come up ahead in the end). After seeing the effects of branch prediction in the benchmarks for the old Vector, I doubt that it would be beneficial for performance. If you have to switch anyway in the critical code paths you may as well make HotSpot do it.

I'll see if I can run some tests for this. It shouldn't be that hard to try it for a subset of operations.

@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Nov 21, 2019

Here's a single-class implementation: https://github.com/szeiger/scala/blob/wip/new-vector-one-class/src/library/scala/collection/immutable/NVector.scala. I started the benchmarking for apply but aborted when I saw the first results scrolling by. It's about 2x slower for sequential access in small vectors. A few methods that operate on the prefix and suffix can be written with a single implementation without looking at the dimension but other than last for small sizes they are already much faster.

@smarter

This comment has been minimized.

Copy link
Contributor

smarter commented Nov 21, 2019

Have you tried benchmarking with graal? Might be useful for future-proofing the work done here.

@retronym

This comment has been minimized.

Copy link
Member

retronym commented Nov 22, 2019

Having eveything in one class would add a lot of memory overhead for small instances,

Another way to factor this out would be to keep all the subclasses for the data, but move the all the implementations up into the base class. Those implementations would look a lot like the ones we have to day: switch initially on something to discriminate the size, then cast to the subclass to get out the data.

I would guess that the fastest way would be to have a depth field that could be used in a (depth: @switch) match { ... }. You could avoid the space of needing that field and instead have the cascade of isInstanceOf checks.

This is all getting close to the trickery from:
https://shipilev.net/blog/2015/black-magic-method-dispatch/#_cheating_the_runtime_2

The warnings and conclusions at the end of that article are worth reading. As @smarter cautions, we don't want to overfit to C2 without also trying Graal.

@retronym

This comment has been minimized.

Copy link
Member

retronym commented Nov 22, 2019

One small improvement would be to share a field for length() to make that monomorphic. I've added a commit to that effect in: retronym#78.

I've also pushed a commit that makes apply a final method of NVector with dispatches on a new field depth. It than statically dispatches to the sub-class-specific implementations.

@retronym

This comment has been minimized.

Copy link
Member

retronym commented Nov 22, 2019

I ran some new benchmarks with megamorphic dispatch

@szeiger What change did you make to enable pollute the profiles? I don't see any changes to NVector2Becnchmark pushed to this PR.

@retronym

This comment has been minimized.

Copy link
Member

retronym commented Nov 22, 2019

It's about 2x slower for sequential access in small vectors.

Just a guess, but this could be due to the the new implementation using explicit bounds checking, whereas before NVector1.apply was written to use implicit bounds checking on array access.

This suggests yet another sort of profile pollution (😅 ) -- calling Vector(-1) and Vector(Int.MaxVaklue) for each dimension of NVector)

@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Nov 22, 2019

Here's a complete run with the latest Graal CE using mixed-size warmup: http://szeiger.de/tmp/vector2graalce/vector-results.html

TL;DR: There are no big surprises that would change the general performance assessment. The problem with the 10% lower performance in apply remains, the performance issue in last is gone on Graal.

There are some notable performance differences and in most cases they benefit the old implementation but these tend to be benchmarks where the new vector has a huge performance lead on HotSpot anyway so the lead becomes smaller but it's still much faster in the end.

  • appended is much faster for the old vector and a bit faster for the new one. The new one is still much faster than the old one.
  • Alternating appended / prepended: No significant changes
  • apply is a bit faster for both, the old implementation still has roughly a 10% performance advantage.
  • The Builder for the old vector has the same performance as on HotSpot, the new one one loses its small performance advantage and runs at almost exactly the same speed (slower than on HotSpot)
  • The regular appendedAll (100% and 10% size benchmarks) is much faster for the old vector and much slower for the new one than on HotSpot. A significant performance advantage for the new one remains.
  • The optimized appendedAll for small suffixes is much faster than on HotSpot for both implementations.
  • Optimized vector appending is a bit slower than on HotSpot in the new implementation, the (non-optimized) counterpart in the old implementation a bit faster. No big change in general.
  • foreach: No significant changes
  • head: Both are significantly faster, no change in relative performance
  • last: Both are a bit faster, the new implementation more so than the old one. This means that the new implementation is now faster above size 1, not only above size 1000.
  • Iterating with iterator: Same performance for the old implementation. The new one gets a few percent faster and is now at the same level.
  • map(identity): Similar performance for the old vector, the new one is more than twice as slow as on HotSpot (but still much faster than the old one)
  • Mapping to a new value: The old implementation gets a bit faster, the new one a bit slower. No big changes.
  • prepend: The old implementation is roughly twice as fast, the new one a bit slower. Still much faster than the old one.
  • tail: Both are significantly faster, same relative performance. HotSpot was able to elide the trivial size=1 case for the old, monomorphic call in the benchmark but not the new, polymorphic one. Graal seems to be able to do it for both.
  • update: The branch prediction issue we saw with HotSpot for the old implementation is gone, the graph shows the expected constant-time sequential update performance and it is twice as fast as on HotSpot. The new implementation is a bit slower now, still much faster than the old one.

It is also interesting to compare these results to the original ones with single-size warmup. Most of the cases where we see a significant speed-up in Graal (appended (old implementation), regular appendedAll (old), small suffix appendedAll (both), prepended (old)) are essentially back to their single-size warmup performance on HotSpot. Graal doesn't fall into the branch predication trap that kills performance on HotSpot if you warm up with the "wrong" sizes.

@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Dec 4, 2019

Latest results on JDK 8 and JDK 11 (different machine!) are as expected. And it reproduces the very weird results in vBulkAppend2 and vUpdateSequential that I saw earlier on JDK 11.

The current version should be fully forwards and backwards binary compatible with Scala 2.13 so you can use it as a drop-in replacement for testing and benchmarking while keeping all your normal 2.13 dependencies. In order to use it, put the following into your build.sbt:

resolvers += "scala-pr" at "https://scala-ci.typesafe.com/artifactory/scala-pr-validation-snapshots/"

scalaVersion := "2.13.2-bin-9da5c9f-SNAPSHOT"

scalaBinaryVersion := "2.13"

Making sure that we have the right version:

> console
[info] Starting scala interpreter...
[...]
scala> Vector.empty.getClass.getName
val res0: String = scala.collection.immutable.Vector0$

If this says scala.collection.immutable.Vector (without the 0$ at the end) you have the old Vector.

Update: Use 2.13.2-bin-9da5c9f-SNAPSHOT (compatible with parallel collections) instead of 2.13.2-bin-5f613e3-SNAPSHOT

@plokhotnyuk

This comment has been minimized.

Copy link
Contributor

plokhotnyuk commented Dec 4, 2019

@szeiger Which CPU have you used for tests? If it was Intel then please check which version of microcode it has using command like:

$ cat /proc/cpuinfo | grep od
model        : 158
model name    : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
microcode    : 0xca
...

The story behind this is that since 12 Nov 2019 Intel has released a new microcode with Jump Conditional Code (JCC) Erratum which shows up to 2x degradation of performance in hot loops, especially with heavily branched or deeply nested bodies. Moreover in current versions of JVMs performance of such places is getting unpredictable because depends on if call/ret and branch/jump instructions hit 32-byte border.

@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Dec 4, 2019

@plokhotnyuk The initial benchmarks were run on my old MacBook Pro:

machdep.cpu.brand_string: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
machdep.cpu.model: 58
machdep.cpu.microcode_version: 33

The benchmark servers (as listed in the directory name of benchmark results):

alice:

model		: 60
model name	: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
microcode	: 0x27

india:

model		: 30
model name	: Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz
microcode	: 0xa

Benchmark results differ quite a bit between JDK 8 (HotSpot), JDK 11 (HotSpot) and JDK 8 (Graal CE) but are consistent across different runs on all machines.

I don't have a good intuition how the microcode changes would affect this code, in particular if the new Vector would be more or less affected than the old one. If you have access to machines that allow you to compare these results and would like to run the benchmarks yourself, you can use the latest benchmarking script from https://gist.github.com/szeiger/59fe1507f0b81cc37963d9e9ed0ca7d4. Simply pipe it into sbt like this:

time sbt <bench.txt

and 10 hours later you should have a test/benchmarks/vector-bench-data.js file which can be opened via test/benchmarks/vector-results.html.

@plokhotnyuk

This comment has been minimized.

Copy link
Contributor

plokhotnyuk commented Dec 5, 2019

@szeiger szeiger modified the milestones: 2.14.0-M1, 2.13.2 Dec 6, 2019
szeiger added 2 commits Oct 17, 2019
to avoid scala/bug#11812
@szeiger szeiger force-pushed the szeiger:wip/new-vector branch from 9da5c9f to f1a8e44 Dec 12, 2019
@szeiger szeiger marked this pull request as ready for review Dec 12, 2019
@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Dec 12, 2019

Squashed and rebased, draft status removed. This should be ready to merge for 2.13.2.

@Jasper-M

This comment was marked as resolved.

Copy link
Member

Jasper-M commented Dec 12, 2019

Could it be that you somehow reverted the changes to Vector in the rebase?

@retronym

This comment has been minimized.

Copy link
Member

retronym commented Dec 13, 2019

dotc reports:

[warn] -- Warning: /home/travis/build/scala/scala/src/library/scala/collection/immutable/Vector.scala:241:8 

[warn] 241 |private final object Vector0 extends BigVector[Nothing](empty1, empty1, 0) {

[warn]     |        ^^^^^

[warn]     |        final modifier is redundant for objects

[error] -- [E007] Type Mismatch Error: /home/travis/build/scala/scala/src/library/scala/collection/immutable/Vector.scala:149:63 

[error] 149 |        case it: Iterable[_] => it.foreach(x => v = v.appended(x))

[error]     |                                                               ^

[error]     |         Found:    _(x)

[error]     |         Required: B

[error]     |

[error]     |         where:    B is a type in method appendedAll0 with bounds >: A

[error] -- [E007] Type Mismatch Error: /home/travis/build/scala/scala/src/library/scala/collection/immutable/Vector.scala:1405:21 

[error] 1405 |      else addVector(v)

[error]      |                     ^

[error]      |                     Found:    Vector[_](v)

[error]      |                     Required: Vector[A]

[error] -- Error: /home/travis/build/scala/scala/src/library/scala/collection/immutable/Vector.scala:1556:81 

[error] 1556 |  private[immutable] def getData: Array[Array[_]] = Array(a1, a2, a3, a4, a5, a6)

[error]      |                                                                                 ^

[error]      |no implicit argument of type reflect.ClassTag[Array[?]] was found for parameter evidence$6 of method apply in object Array
@szeiger szeiger force-pushed the szeiger:wip/new-vector branch from f5ff15d to b1dabca Dec 13, 2019
@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Dec 13, 2019

Added some casts that should hopefully satisfy dotty

@joshlemer

This comment has been minimized.

Copy link
Member

joshlemer commented Dec 19, 2019

This is out of scope for this PR but for wont of a better place to put this:

With this new and improved Vector implementation, it could be time to revisit the status of List as the default Seq (aka, the implementation delegated to by the Seq factory). I wonder in what situations would List continue to outperform Vector, now that small-vector memory overhead is reduced and many operations are orders-of-magnitude faster? If it is only prepending and head/tail decomposition, that seems like a pretty niche use-case for the default collection.

@Jasper-M

This comment has been minimized.

Copy link
Member

Jasper-M commented Dec 19, 2019

I bet that List is still faster for everything except indexed access, appending and concatenation. i.e. all usual combinators (map, flatMap, filter, find, ...) that seem "safe" to use on some Seq.

@joshlemer

This comment has been minimized.

Copy link
Member

joshlemer commented Dec 19, 2019

Something like this is quite a bit faster for Vector1's filterImpl

    override def filterImpl(pred: A => Boolean, isFlipped: Boolean): Vector[A] = {
    // approach:
    //  we traverse the data array until we find an element which does not pass the filter.
    //  if all elements pass the filter, we return `this`
    //
    //  otherwise, we continue to the end of the data array, marking each index that passes the filter, in the
    //  `bitmap`.
    //
    //  Once all indices are processed, the new data array can be computed from the index of first failure (`i`) plus
    //  the bitcount of the bitmap. Now we allocate the final array, and copy all the elements before first failure into it.
    //
    //  We then traverse the original data array starting from first-failure + 1, and each time the index's bit position
    //  is marked in `bitmap`, we copy that element into the new array at the next available index.

    var i = 0
    val len = _data1.length
    while (i != len) {
      if (pred(_data1(i).asInstanceOf[A]) == isFlipped) {
        // each 1 bit indicates that index passes the filter.
        // all indices < i are also assumed to pass the filter
        var bitmap = 0

        var j = i + 1

        while (j < len) {
          if (pred(_data1(j).asInstanceOf[A]) != isFlipped) {
            bitmap |= (1 << j)
          }
          j += 1
        }
        val newLen = i + java.lang.Integer.bitCount(bitmap)

        if (newLen == 0) return Vector0

        val newData = new Array[AnyRef](newLen)
        System.arraycopy(_data1, 0, newData, 0, i)

        var k = i + 1
        while (i != newLen) {
          if (((1 << k) & bitmap) != 0) {
            newData(i) = _data1(k)
            i += 1
          }
          k += 1
        }
        return new Vector1[A](newData)
      }
      i += 1
    }
    this
  }
// this implementation
[info] Benchmark                    (size)  Mode  Cnt    Score    Error  Units
[info] VectorBenchmark3.filter_vec       1  avgt   16    4.898 ±  0.044  ns/op
[info] VectorBenchmark3.filter_vec       5  avgt   16   35.216 ±  0.232  ns/op
[info] VectorBenchmark3.filter_vec      10  avgt   16   57.222 ±  0.482  ns/op
[info] VectorBenchmark3.filter_vec      32  avgt   16  124.393 ± 12.371  ns/op


// wip-newvector branch 
[info] Benchmark                    (size)  Mode  Cnt    Score    Error  Units
[info] VectorBenchmark3.filter_vec       1  avgt   16   37.886 ± 11.691  ns/op
[info] VectorBenchmark3.filter_vec       5  avgt   16   53.154 ±  3.952  ns/op
[info] VectorBenchmark3.filter_vec      10  avgt   16   75.427 ± 15.255  ns/op
[info] VectorBenchmark3.filter_vec      32  avgt   16  140.693 ± 10.697  ns/op

Edit: Actually maybe this would be more useful extracted to a more general purpose Array.filter kind of method so it could be used by ArraySeq and others too.

@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Jan 8, 2020

Filtering in general could use some optimization. It's not even overridden at the moment (in the old implementation neither). .filter(_ => false) is twice as fast in the new Vector but keeping all data or partial data can be 50% slower.

BTW, constructor parameters like _data1 should not be accessed by any method. Unfortunately there is no way to enforce this. You have to use prefix1 which is a monomorphic getter method that can be nicely inlined. If you use _data1 anywhere you create a new class field with an unnecessary copy of the reference.

This is complete rewrite of the old scala.collection.immutable.Vector.
The design differs from the old one in one important aspect: Instead
of a single movable finger (the "display" pointers) it has a fixed
finger at each end. These fingers are part of the proper data
structure, not a transient optimization. There are no other ways to
get to the data in the fingers (it is not part of the main data
array), no focus point, and no stabilization. As a side-effect, the
data structure is completely immutable and no extra memory fences are
required. A second major difference lies in the implementation.
Whereas the old Vector uses a single class for collections of all
supported sizes, the new one is split into Vector0 to Vector6 for the
different dimensions of the main data array.
szeiger added a commit to szeiger/scala that referenced this pull request Jan 10, 2020
@szeiger szeiger force-pushed the szeiger:wip/new-vector branch from b1dabca to 822ba03 Jan 10, 2020
@szeiger

This comment has been minimized.

Copy link
Contributor Author

szeiger commented Jan 10, 2020

I've added the optimized filter implementation. Performance of the different versions that I benchmarked was rather surprising.

The inherited implementation (the same as in the old Vector) uses an Iterator plus Builder. Both are fast (as shown by the other benchmarks) but the combination of both was still twice as slow as for the old Vector.

The implementation proposed by @joshlemer works great for Vector1 and we can get a good speed-up for BigVector with a simple filterImpl based on foreach and VectorBuilder, but combining them in the obvious way (with two separate branches) is much slower than it should be.

In the end I combined the two in such a way that we always process prefix1 with the bitmap approach first. If the Vector is a BigVector, the array with the results from prefix1 is then used to initialize a VectorBuilder and the remainder of the data is processed in the simple way. This gives good performance for small and large sizes.

Based on #8534 (comment)
@szeiger szeiger force-pushed the szeiger:wip/new-vector branch from 822ba03 to 34655b8 Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.