Improved runtime speed for Vector, restoring previous performance. #5516

Merged
merged 2 commits into from Nov 11, 2016

Projects

None yet

7 participants

@Ichoran
Contributor
Ichoran commented Nov 9, 2016

copyOf was redirected to java.util.Arrays.copyOf to reduce the work that the JIT compiler has to do to make this work well.

zeroLeft, zeroRight, copyLeft, and copyRight were made private[this] to improve performance (presumably by encouraging the JIT compiler to try harder despite using the new trait encoding scheme).

Checks of performance were conducted with Thyme to aid iteration speed.

@scala-jenkins scala-jenkins added this to the 2.12.1 milestone Nov 9, 2016
@@ -466,24 +466,24 @@ override def companion: GenericCompanion[Vector] = Vector
display5 = copyRange(display5, oldLeft, newLeft)
}
- private def zeroLeft(array: Array[AnyRef], index: Int): Unit = {
+ private[this] def zeroLeft(array: Array[AnyRef], index: Int): Unit = {
@viktorklang
viktorklang Nov 9, 2016 Contributor

final is also nice?

@Ichoran
Ichoran Nov 9, 2016 Contributor

Either one does the trick.

@Ichoran
Ichoran Nov 9, 2016 Contributor

Actually, I reran the original tests and I'm not 100% sure it makes any difference at all. If it's important I should rerun them later, maybe with JMH. I've been trying to iterate quickly, so have been using Thyme which is more subject to fluctuations.

@Ichoran
Ichoran Nov 9, 2016 Contributor

"it" meaning having the method be final, private[this], or just private like it used to be. In any case the effects are small in my hands (~5%).

@lrytz
lrytz Nov 9, 2016 Member

The bytecode for a private[this] method is exactly the same as for private, so this should not make any difference.

@paplorinc
paplorinc Nov 9, 2016 Contributor

In my experiment only avoiding the pre-null-ing of arrays had any performance impact (i.e. the array allocation and copying should be next to each other, otherwise the array is null-ed first).

Before:

VectorBenchmark.Slice.scala_persistent  thrpt   4193.769 ops/s

After:

VectorBenchmark.Slice.scala_persistent  thrpt  10306.275 ops/s

i.e. 2.5x faster, congrats @Ichoran

@Ichoran
Ichoran Nov 9, 2016 Contributor

@lrytz - Yeah, I see that. It does (or has--not sure about now) occasionally make a difference with fields and it was faster for me to check the benchmark than the bytecode. But it seems there was just a benchmarking fluctuation that made everything I did after the one bad benchmark seem like it was helping (a tiny bit).

@paplorinc

👍

@@ -466,24 +466,24 @@ override def companion: GenericCompanion[Vector] = Vector
display5 = copyRange(display5, oldLeft, newLeft)
}
- private def zeroLeft(array: Array[AnyRef], index: Int): Unit = {
+ private[this] def zeroLeft(array: Array[AnyRef], index: Int): Unit = {
@paplorinc
paplorinc Nov 9, 2016 Contributor

In my experiment only avoiding the pre-null-ing of arrays had any performance impact (i.e. the array allocation and copying should be next to each other, otherwise the array is null-ed first).

Before:

VectorBenchmark.Slice.scala_persistent  thrpt   4193.769 ops/s

After:

VectorBenchmark.Slice.scala_persistent  thrpt  10306.275 ops/s

i.e. 2.5x faster, congrats @Ichoran

@Ichoran
Contributor
Ichoran commented Nov 9, 2016

@paplorinc - Which change specifically was the improvement? I tested the java.util.Arrays.copyOf one extensively, but the others I just threw in at the last moment as a benchmark indicated there was a small residual slowdown. It seemed to work so I left it, but it sounds like it's unnecessary and should be removed?

@retronym
Member

HotSpot optimizes val a = new Array; arrayCopy(..., dest = a) to avoid zeroing: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/tip/src/share/vm/opto/macroArrayCopy.cpp

The extra null checks and/or other small differences in bytecode shapes between 2.11.8 and 2.12.0 probably hampered this optimization.

Would be interesting to see if the 2.11.8 benchmark slows down when this JIT optimization is disabled (-XX:-ReduceBulkZeroing.)

- Platform.arraycopy(a, 0, b, 0, a.length)
- b
- }
+ private[immutable] final def copyOf(a: Array[AnyRef]) = java.util.Arrays.copyOf(a, a.length)
@retronym
retronym Nov 10, 2016 Member

This will go through:

public static <T,U> T[] copyOf(U[] original, int newLength, Class<? extends T[]> newType) {
        @SuppressWarnings("unchecked")
        T[] copy = ((Object)newType == (Object)Object[].class)
            ? (T[]) new Object[newLength]
            : (T[]) Array.newInstance(newType.getComponentType(), newLength);
        System.arraycopy(original, 0, copy, 0,
                         Math.min(original.length, newLength));
        return copy;
    }

It might be better to just call new Array[AnyRef]; System.arrayCopy here to avoid passing through the generic code in copyOf, unless we're sure that HotSpot can elide the array zeroing for it as well (in cases where it has been used by other callers so both branches of the conditional are live.)

@Ichoran
Ichoran Nov 10, 2016 edited Contributor

@retronym - On my machine, the copyOf version is actually better-optimized than the new Array/arraycopy. I don't see anything left in the assembly of the object type test, and it looks a lot like the 2.11.8 version.

@Ichoran
Contributor
Ichoran commented Nov 10, 2016

@retronym - Keeping bulk zeroing does make a big difference on 2.11--about 50% on my machine, which is a majority of the slowdown observed. There are measurable residual effects, but those are a few percent at most. 2.12 with the copyOf = java.util.Arrays.copyOf fix, plus the inlined System.arraycopy, works the same way, while 2.12.0 is barely affected by -XX:-ReduceBulkZeroing.

@Ichoran
Contributor
Ichoran commented Nov 10, 2016 edited

Here are some benchmark numbers:

Throughput (- means zeroing elision disabled)
---------------------------------------------
Test:      :+    tail   init  slice updated
2.11.8    100%   100%   100%   100%   100%
2.11.8-    71%    71%    74%    69%    72%
2.12.0     69%    70%    75%    67%    71%
2.12.0-    68%    70%    75%    67%    71%
patch      98%   100%   100%    97%    99%  
patch-     69%    71%    74%    69%    72%
+copyOf    98%    90%    96%    91%    99%
-Platform  98%    99%    99%    97%   100%

Error is about +- 0.5%.

copyOf is just changing copyOf to forward to java.util.Arrays.copyOf, while -Platform is replacing all Platform.arraycopy calls with java.lang.System.arraycopy.

I don't think there's a meaningful difference between the two (I don't think I was systematic enough before), so I think @retronym is right that we should just stick to new Array/System.arraycopy.

@Ichoran Ichoran Improved runtime speed for Vector, restoring previous performance.
All calls to Platform.arraycopy were rewritten as java.lang.System.arraycopy to reduce the work that the JIT compiler has to do to produce optimized bytecode that avoids zeroing just-allocated arrays that are about to be copied over.

(Tested with -XX:-ReduceBulkZeroing as suggested by retronym.)
e5fd42d
@Ichoran
Contributor
Ichoran commented Nov 10, 2016

@retronym - I was just about to mention the library ones there, plus one in mutable.PriorityQueue also that is not exactly the same pattern but could become it if optimized/inlined enough. Not sure if I should fix them all (just by doing Platform -> System) as part of this patch? I guess there's little harm in it.

@retronym
Member

+1 for inlining all callers of Platform.arrayCopy.

@retronym
Member

Thanks for the running those benchmarks and presenting the results so clearly!

@Ichoran Ichoran Manually inlined all other instances of Platform.arraycopy to System.…
…arraycopy

to avoid the same kind of slowdowns that Vector was experiencing due
to the less aggressive inlining by scalac.
7f26b44
@Ichoran
Contributor
Ichoran commented Nov 10, 2016

@retronym - Got them all, I think.

@retronym
Member

LGTM

@retronym retronym self-assigned this Nov 10, 2016
@paplorinc

While I agree that these changes solve the issue, I think this is a lot of workaround for a known root cause (i.e. symptomatic treatment for the Scala inliner working differently).

If we solve this locally, the clients of @inline will still be affected.

Since there is a very specific test for the inlining of Platform.arrayCopy, I suggest we look there for why it's passing: https://github.com/scala/scala/blob/0e0614c866526d8922a34e3aab1afc64d7b4f01c/test/junit/scala/tools/nsc/backend/jvm/opt/InlinerTest.scala#L267-L267

@@ -102,7 +101,7 @@ object Array extends FallbackArrayBuilding {
def copy(src: AnyRef, srcPos: Int, dest: AnyRef, destPos: Int, length: Int) {
val srcClass = src.getClass
if (srcClass.isArray && dest.getClass.isAssignableFrom(srcClass))
- arraycopy(src, srcPos, dest, destPos, length)
+ java.lang.System.arraycopy(src, srcPos, dest, destPos, length)
@paplorinc
paplorinc Nov 10, 2016 Contributor

Note: this probably isn't affected anyway, as the array that gets here is already null-ed out.
But at least we're consistent :) (in which case Platform.arraycopy should probably be removed).
Also note, that in other cases (e.g. https://github.com/scala/scala/blob/40f7fce0af1da614d99048b024e1ff579635f0f2/src/compiler/scala/tools/nsc/backend/jvm/BCodeHelpers.scala#L461-L461) the System reference is not fully qualified.

@@ -478,12 +477,12 @@ override def companion: GenericCompanion[Vector] = Vector
// if (array eq null)
// println("OUCH!!! " + right + "/" + depth + "/"+startIndex + "/" + endIndex + "/" + focus)
val a2 = new Array[AnyRef](array.length)
- Platform.arraycopy(array, 0, a2, 0, right)
+ java.lang.System.arraycopy(array, 0, a2, 0, right)
@paplorinc
paplorinc Nov 10, 2016 Contributor

👍

a2
}
private def copyRight(array: Array[AnyRef], left: Int): Array[AnyRef] = {
val a2 = new Array[AnyRef](array.length)
- Platform.arraycopy(array, left, a2, left, a2.length - left)
+ java.lang.System.arraycopy(array, left, a2, left, a2.length - left)
@paplorinc
paplorinc Nov 10, 2016 Contributor

👍

@@ -955,7 +954,7 @@ private[immutable] trait VectorPointer[T] {
private[immutable] final def copyOf(a: Array[AnyRef]) = {
@paplorinc
paplorinc Nov 10, 2016 Contributor

Minor: I think that the signature of copyLeft, copyRight, copyRange and copyOf should be the same (except for the name, of course), so maybe consider removing the [immutable] final modifiers here (or add them to the rest also)

@Ichoran
Ichoran Nov 10, 2016 Contributor

I'm going for a bug fix with no visible changes (and binary compatibility), so while I agree that it's inconsistent, I don't think this is the time to change it.

@@ -955,7 +954,7 @@ private[immutable] trait VectorPointer[T] {
private[immutable] final def copyOf(a: Array[AnyRef]) = {
val b = new Array[AnyRef](a.length)
- Platform.arraycopy(a, 0, b, 0, a.length)
+ java.lang.System.arraycopy(a, 0, b, 0, a.length)
@paplorinc
paplorinc Nov 10, 2016 Contributor

👍

@@ -1119,7 +1118,7 @@ private[immutable] trait VectorPointer[T] {
private[immutable] final def copyRange(array: Array[AnyRef], oldLeft: Int, newLeft: Int) = {
val elems = new Array[AnyRef](32)
- Platform.arraycopy(array, oldLeft, elems, newLeft, 32 - math.max(newLeft,oldLeft))
+ java.lang.System.arraycopy(array, oldLeft, elems, newLeft, 32 - math.max(newLeft,oldLeft))
@paplorinc
paplorinc Nov 10, 2016 Contributor

👍

@@ -67,7 +67,7 @@ class ArrayBuffer[A](override protected val initialSize: Int)
override def sizeHint(len: Int) {
if (len > size && len >= 1) {
val newarray = new Array[AnyRef](len)
- scala.compat.Platform.arraycopy(array, 0, newarray, 0, size0)
+ java.lang.System.arraycopy(array, 0, newarray, 0, size0)
@paplorinc
paplorinc Nov 10, 2016 Contributor

👍

@@ -331,8 +331,8 @@ sealed class PriorityQueue[A](implicit val ord: Ordering[A])
val pq = new PriorityQueue[A]
val n = resarr.p_size0
pq.resarr.p_ensureSize(n)
+ java.lang.System.arraycopy(resarr.p_array, 1, pq.resarr.p_array, 1, n-1)
@paplorinc
paplorinc Nov 10, 2016 Contributor

I don't think this will avoid nulling

@Ichoran
Ichoran Nov 10, 2016 Contributor

Probably not now, but it's more amenable to future optimization heuristics if there are no intervening instructions.

@@ -101,7 +101,7 @@ trait ResizableArray[A] extends IndexedSeq[A]
if (newSize > Int.MaxValue) newSize = Int.MaxValue
val newArray: Array[AnyRef] = new Array(newSize.toInt)
- scala.compat.Platform.arraycopy(array, 0, newArray, 0, size0)
+ java.lang.System.arraycopy(array, 0, newArray, 0, size0)
@paplorinc
paplorinc Nov 10, 2016 Contributor

👍

@paplorinc
paplorinc Nov 10, 2016 edited Contributor

If you want to change all Platform.arrayCopy references, consider changing line 120 also

@Ichoran
Ichoran Nov 10, 2016 Contributor

Oops, I thought I did.

@@ -100,7 +100,7 @@ trait BaseTypeSeqs {
def copy(head: Type, offset: Int): BaseTypeSeq = {
val arr = new Array[Type](elems.length + offset)
- scala.compat.Platform.arraycopy(elems, 0, arr, offset, elems.length)
+ java.lang.System.arraycopy(elems, 0, arr, offset, elems.length)
@paplorinc
paplorinc Nov 10, 2016 Contributor

👍

@@ -68,7 +68,7 @@ trait Names extends api.Names {
while (i < len) {
if (nc + i == chrs.length) {
val newchrs = new Array[Char](chrs.length * 2)
- scala.compat.Platform.arraycopy(chrs, 0, newchrs, 0, chrs.length)
+ java.lang.System.arraycopy(chrs, 0, newchrs, 0, chrs.length)
@paplorinc
paplorinc Nov 10, 2016 Contributor

👍

@@ -220,7 +220,7 @@ trait Names extends api.Names {
/** Copy bytes of this name to buffer cs, starting at position `offset`. */
final def copyChars(cs: Array[Char], offset: Int) =
- scala.compat.Platform.arraycopy(chrs, index, cs, offset, len)
+ java.lang.System.arraycopy(chrs, index, cs, offset, len)
@paplorinc
paplorinc Nov 10, 2016 Contributor

same, no allocation here.
Unless you decide to remove Platform.arraycopy completely, I don't think the changes where there's no allocation, just a copy are worth the trouble :)

@retronym
retronym Nov 10, 2016 Member

We can't remove Platform.arraycopy in 2.12.1 for compatibility constraints. But I support the move to uniformly avoid using it, which is a bit simpler than doing so selectively for particular code patterns.

@paplorinc
Contributor
paplorinc commented Nov 10, 2016 edited

@Ichoran, please update the title and description of the PR to reflect the new changes also :)

@retronym
Member

Since there is a very specific test for the inlining of Platform.arrayCopy, I suggest we look there for why it's passing:

The call is inlined, but there is an additional check that Platform$.MODULE$ is not-null (ie, its super calls has completed). This is actually a step forward for correctness, but in this particular case hinders performance. It is worth revisiting, but I think the manual inlining solution here is pragmatic, especially given that Platform._ seems to be a remnant from the days of Scala.NET.

@viktorklang
Contributor

Perhaps a dumb question, but with 2.12 being Java8+ wouldn't invokedynamic
with switchpoint be possibly better than always checking nullness?

Cheers,

On Nov 10, 2016 12:12, "Jason Zaugg" notifications@github.com wrote:

Since there is a very specific test for the inlining of
Platform.arrayCopy, I suggest we look there for why it's passing:

The call is inlined, but there is an additional check that
Platform$.MODULE$ is not-null (ie, its super calls has completed). This
is actually a step forward for correctness, but in this particular case
hinders performance. It is worth revisiting, but I think the manual
inlining solution here is pragmatic, especially given that Platform._
seems to be a remnant from the days of Scala.NET.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#5516 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAAqd0rBmj5XQnBKK2IjcJPgM-7-qLb7ks5q8vwXgaJpZM4KtSmX
.

@paplorinc
Contributor

I think there are two issues here that could probably be addressed separately:

  • The inliner behaves differently and - for my untrained mind - slightly unintuitively: I would expect it to behave as Intellij Idea's inline method, i.e. in case of a ("static") void to simply replace the call with the content of the method, i.e. it should resemble the change that @Ichoran did manually, without any additional checks whatsoever in this case (the bytecode seems to contain 12 additional instructions).
    Users expect it to be for free.
    If given some guidelines, I would gladly implement this.
  • Even if the inliner would be working as it was previously, we should get rid of these repetitive methods in Platform, as apparently their only task is to get inlined anyway. Therefore I suggest we deprecate Platform's arraycopy in this commit and replace all of its occurrences :)
@retronym
Member
retronym commented Nov 10, 2016 edited

@inline is supposed to preserve the semantics. That means, throw an NPE if the receiver of an inlined virtual call is null, and execute the expression that yields the receiver. I agree it would be nice to have a way to omit these effects. Mostly this is a UI problem, the backend changes should be easy enough. Maybe @lrytz has give that some thought?

Deprecation of this method (and maybe Platform altogether) may well be the best course of action.

That said, our first priority is to fix the performance regression in 2.12.1, so I'm happy with this PR in it current state to meet that goal.

@retronym
Member

Perhaps a dumb question, but with 2.12 being Java8+ wouldn't invokedynamic with switchpoint be possibly better than always checking nullness?

Null checks that are always false are basically free in C2 compiled code. This case is really quite special, the null check itself isn't expensive, but it changes the bytecode shape for the JIT optimization of array instantation/copy.

@lrytz
Member
lrytz commented Nov 10, 2016

simply replace the call with the content of the method

Yes, that should indeed be the case. But we have to make sure the inliner preserves semantics, so a null check on the receiver is necessary, unless the compiler can prove that the receiver cannot be null.

In the case of a module like object Platform, there's even more: when performing inlining, the compiler does not know if the Platform module is already initialized, i.e., its constructor side-effects were executed. So the module load is also there for that reason.

A simple analysis on module classfiles to know that their initialization doesn't have side-effects (only methods, no constructor statements, superclass Object) is actually planned, and would allow eliminating the module load (this analysis would also ensure that the module load is never null). scala/scala-dev#16.

Related, there's also scala/scala-dev#112.

@Ichoran
Contributor
Ichoran commented Nov 10, 2016

Why must an inliner preserve incidental semantics of side-effects not directly referenced in the method itself?

I would be totally fine with this giving 35 and not printing anything:

class Foo { println("I just made a foo!"); @inline times(a: Int, b: Int) = a*b }
val f: Foo = null
f.times(5, 7)
@paplorinc
Contributor
paplorinc commented Nov 10, 2016 edited

I agree with @Ichoran, people would expect a lot simpler behavior for such a simple concept.
Maybe rather add a force parameter to the annotation that ignores any such checks, since this annotation is meant for forcing an optimization anyway?

@lrytz
Member
lrytz commented Nov 10, 2016

Why must an inliner preserve incidental semantics

I think it should, this seems to be also a core principle of the JVM. Making code execute in a way that is not according to the language spec can lead to bugs / behaviors that are surprising / hard to understand. Of course I'm all for optimizing more, but I think we should keep to the spec. I think there's a lot of room to do more even with this restriction.

It's true that we're not very consistent at the moment either: scala/scala-dev#112 is an example. But for example we also defer class loading (when inlining a static method, or eliminating unreachable code), which can cause observable differences (defer side-effects, eliminate deadlocks). I think we never really discussed and defined precisely where to draw the line.

@Ichoran
Contributor
Ichoran commented Nov 10, 2016

The problem with preserving incidental semantics is that they're incidental. Very often you don't care whether a module is initialized or not. You just want to execute some code, and there is no way to run it without dragging in module initialization. If there is no way to avoid incidentals that you don't want, you're forced to do silly time-wasting stuff like inline code by hand, a purely mechanical process that you're going through because your language won't do it for you.

I agree that it's tricky to know what is incidental and what is intended, and I also agree that it can be very surprising when behavior changes based upon inlining or not (because of incidental behavior). But it's also important, as this slowdown has illustrated.

@retronym
Member

I'm happy that the followup ideas to improve the optimizer are covered by those tickets @lrytz linked to, so I'm going to merge this one. Thanks for your detailed review, @paplorinc!

@retronym retronym merged commit e6cb4d9 into scala:2.12.x Nov 11, 2016

6 checks passed

cla @Ichoran signed the Scala CLA. Thanks!
Details
combined All previous commits successful.
integrate-ide [3491] SUCCESS. Took 2 s.
Details
validate-main [4022] SUCCESS. Took 60 min.
Details
validate-publish-core [3892] SUCCESS. Took 4 min.
Details
validate-test [3399] SUCCESS. Took 55 min.
Details
@Ichoran
Contributor
Ichoran commented Nov 11, 2016

@retronym - I hadn't had a chance to catch a Platform that somehow I missed in ResizableArray, but I guess that's not actually worth waiting for. Not sure that copy method is ever used, and if it is, it's not necessarily going to be next to an allocation. (I should have just used sed rather than an interactive editor....)

@paplorinc
Contributor
paplorinc commented Nov 11, 2016 edited

I've checked the rest of the affected methods (i.e. not just slice) and most seem to be very close to their previous performance now! Congrats @Ichoran!
The only one in Vector that I found to be measurably slower is groupBy, by 80-90%, but that might still be acceptable.

Using the Javaslang collection benchmarks with
2.11.8

Benchmark                                (CONTAINER_SIZE) Mode   Score      Error     Units
VectorBenchmark.GroupBy.scala_persistent 100              thrpt  215444.420 ±9500.270 ops/s

and
2.12.1-SNAPSHOT

Benchmark                                (CONTAINER_SIZE) Mode   Score      Error     Units
VectorBenchmark.GroupBy.scala_persistent 100              thrpt  173771.549 ±2979.146 ops/s

Running the rest of the benchmarks, I noticed that MutableList.update got considerably slower also, i.e.:
2.11.8

Benchmark                          (CONTAINER_SIZE)  Mode   Score    Error Units
ListBenchmark.Update.scala_mutable            1000  thrpt 979.478 ±9.109 ops/s

and
2.12.1-SNAPSHOT

Benchmark                          (CONTAINER_SIZE)  Mode   Score  Error Units
ListBenchmark.Update.scala_mutable            1000  thrpt 520.756 ±7.598 ops/s

I will look into these in the weekend :)

@Ichoran
Contributor
Ichoran commented Nov 12, 2016

@paplorinc @lrytz - There are definitely other problems out there. GroupBy is slow in general. I haven't yet tracked down whether it's just builders or whether it's mutable map getOrElseUpdate also, but there's definitely a builder component because the slowdown is a function of the collection type even when there aren't very many distinct keys (so it's mostly key lookup and adding to builders). I'll write a ticket when I have time, and try to fix it this weekend.

If it turns out to be more optimizer stuff, I worry that we're going to be putting out small fires all over the place until we have a more aggressive inlining strategy.

@Ichoran
Contributor
Ichoran commented Nov 13, 2016

@paplorinc - I have the groupBy reported at https://issues.scala-lang.org/browse/SI-10049

Please comment there if you have details to add

@paplorinc
Contributor

Thanks, will sign up and comment.

Until now I've found that HashMap.getOrElseUpdate is a lot slower now (e.g. 25 vs 36 million ops/s), but even HashMap.get is a lot slower (e.g. 100 vs 160 million ops/s), which may be the source of the error - and may have more serious consequences.

Adding a LinuxPerfAsmProfiler (or JProfiler also) seems to indicate that HashTable.HashUtils#improve may be the cause, but I have no clue why (may be a trait inlining issue?).

@Ichoran
Contributor
Ichoran commented Nov 13, 2016

@paplorinc - Let's continue on the ticket. I think there must be some details that differ, as I don't see a slowdown in get. But maybe it is at fault in part, but the optimizer can handle it in my case but not in yours (based on what should be unimportant differences in details).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment