Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substantial slowdown in groupBy (all collections) #10049

Closed
scabug opened this Issue Nov 13, 2016 · 21 comments

Comments

Projects
None yet
2 participants
@scabug
Copy link

scabug commented Nov 13, 2016

groupBy is substantially slower in 2.12, with throughput down to as little as 60% of 2.11 on my machine. This seems to affect all collections (Vector, List, and Array tested).

Manually inlining the groupBy code does not seem to help. Indeed, benchmarking of HashMap.getOrElseUpdate shows a considerable decrease in throughput, possibly enough to account for the entire effect. AnyRefMap does not appear to be affected.

Manually inlining the get, match, and update calls from getOrElseUpdate does appear to solve the issue (at least mostly).

The old bytecode in scala.collection.mutable.MapLike$class was:

  public static java.lang.Object getOrElseUpdate(scala.collection.mutable.MapLike, java.lang.Object, scala.Function0);
    Code:
       0: aload_0
       1: aload_1
       2: invokeinterface #35,  2           // InterfaceMethod scala/collection/mutable/MapLike.get:(Ljava/lang/Object;)Lscala/Option;
       7: astore_3
       8: aload_3
       9: instanceof    #119                // class scala/Some
      12: ifeq          31
      15: aload_3
      16: checkcast     #119                // class scala/Some
      19: astore        4
      21: aload         4
      23: invokevirtual #123                // Method scala/Some.x:()Ljava/lang/Object;
      26: astore        5
      28: goto          62
      31: getstatic     #128                // Field scala/None$.MODULE$:Lscala/None$;
      34: aload_3
      35: invokevirtual #132                // Method java/lang/Object.equals:(Ljava/lang/Object;)Z
      38: ifeq          65
      41: aload_2
      42: invokeinterface #137,  1          // InterfaceMethod scala/Function0.apply:()Ljava/lang/Object;
      47: astore        6
      49: aload_0
      50: aload_1
      51: aload         6
      53: invokeinterface #39,  3           // InterfaceMethod scala/collection/mutable/MapLike.update:(Ljava/lang/Object;Ljava/lang/Object;)V
      58: aload         6
      60: astore        5
      62: aload         5
      64: areturn
      65: new           #139                // class scala/MatchError
      68: dup
      69: aload_3
      70: invokespecial #142                // Method scala/MatchError."<init>":(Ljava/lang/Object;)V
      73: athrow

Now, however, the bytecode is as follows, with extra instructions highlighted with >>>>>:

  public V getOrElseUpdate(K, scala.Function0<V>);
    Code:
       0: aload_0
       1: aload_1
       2: invokeinterface #109,  2          // InterfaceMethod get:(Ljava/lang/Object;)Lscala/Option;
       7: astore        4
       9: aload         4
      11: instanceof    #221                // class scala/Some
      14: ifeq          37
      17: aload         4
      19: checkcast     #221                // class scala/Some
      22: astore        5
      24: aload         5
      26: invokevirtual #224                // Method scala/Some.value:()Ljava/lang/Object;
      29: astore        6
>>>>> 31: aload         6
>>>>> 33: astore_3
      34: goto          87
>>>>> 37: goto          40
      40: getstatic     #229                // Field scala/None$.MODULE$:Lscala/None$;
      43: aload         4
      45: invokevirtual #233                // Method java/lang/Object.equals:(Ljava/lang/Object;)Z
      48: ifeq          74
      51: aload_2
      52: invokeinterface #237,  1          // InterfaceMethod scala/Function0.apply:()Ljava/lang/Object;
      57: astore        7
      59: aload_0
      60: aload_1
      61: aload         7
      63: invokeinterface #113,  3          // InterfaceMethod update:(Ljava/lang/Object;Ljava/lang/Object;)V
      68: aload         7
      70: astore_3
>>>>> 71: goto          87
>>>>> 74: goto          77
      77: new           #239                // class scala/MatchError
      80: dup
      81: aload         4
      83: invokespecial #242                // Method scala/MatchError."<init>":(Ljava/lang/Object;)V
      86: athrow
      87: aload_3
      88: areturn

Whether these extra jumps themselves are the extra cost, or whether they prevent some JIT optimization, I have not yet investigated. However, the bytecode appears less efficiently organized (note the swap of the exception vs return which requires one of the extra jumps in the success case).

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 13, 2016

Imported From: https://issues.scala-lang.org/browse/SI-10049?orig=1
Reporter: @Ichoran
Affected Versions: 2.12.0

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 13, 2016

@Ichoran said (edited on Nov 13, 2016 9:07:18 PM UTC):
Note--I did not see any difference in get, but maybe my test was still within the ability of the optimizer to compensate for, as Pap Lőrinc reports a slowdown in get as well. Alternatively, maybe it's the pattern match in conjunction that is the issue (I have not yet tested manually inlining the exact pattern match used to destructure the result of the get).

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 13, 2016

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 13, 2016

@paplorinc said:
I think it might be related to the (mutable?) HashMap's slowdown, but my benchmarks may be biased:

Creating an i -> i mutable Map and requesting each one (and separately requesting the same number of missing values) shows a slowdown in 2.11.8 vs 2.12.0, using latest JMH, in 10/10 runs and 1026 values.
Again, not sure if this is the cause, as it turns out to be quite difficult for me to profile.

@Benchmark
public void scala_persistent_some(Blackhole bh) {
    for (int i = 0; i < CONTAINER_SIZE; i++) {
        Option<Object> value = map.get(i);
        assert !value.isEmpty();
        bh.consume(value);
    }
}

@Benchmark
public void scala_persistent_none(Blackhole bh) {
    for (int i = CONTAINER_SIZE; i < 2 * CONTAINER_SIZE; i++) {
        Option<Object> value = map.get(i);
        assert value.isEmpty();
        bh.consume(value);
    }
}

resulting in:

2.11.8
    Impl                       Score
    scala_persistent_none  77,367.12
    scala_persistent_some  47,659.77

2.12.0
    Impl                       Score
    scala_persistent_none  47,823.76
    scala_persistent_some  33,916.23

Note: the asserts were validated in a separate, non-benchmarked run!

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 13, 2016

@paplorinc said (edited on Nov 13, 2016 11:00:57 PM UTC):

2.12.0                                             |       2.11.8
                                                   |
; - s.c.m.HashMap   ::table@1      (line 40)       |       ; - s.c.m.HashMap         ::table@1      (line 40)
; - s.c.m.HashTable ::index@1      (line 367)      |       ; - s.c.m.HashTable$class ::index@1      (line 364) 
; - s.c.m.HashTable ::index$@2     (line 366)      |       ; - s.c.m.HashMap         ::index@2      (line 40)
; - s.c.m.HashMap   ::index@2      (line 40)       |       
; - s.c.m.HashTable ::findEntry@10 (line 132)      |       ; - s.c.m.HashTable$class ::findEntry@10 (line 132)
; - s.c.m.HashTable ::findEntry$@2 (line 131)      |      
; - s.c.m.HashMap   ::findEntry@2  (line 40)       |       ; - s.c.m.HashMap         ::findEntry@2  (line 40)
; - s.c.m.HashMap   ::get@2        (line 70)       |       ; - s.c.m.HashMap         ::get@2        (line 70)

i.e. it seems to me that the 2.12 bytecode contains index and findEntry as default methods, while the other one seems to call a static method.
But wasn't this the whole point of the 2.12 rewrite, shouldn't this be faster?

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 13, 2016

@Ichoran said:
@paplorinc - Can you check whether getOrElseUpdate is faster in your hands if the implementation is changed to the following:

    val x = get(key)
    if (x eq None) { val d = op; this(key) = d; d }
    else x.asInstanceOf[Some[V]].get

In my hands, it's full speed or faster, either when there are lots of successful lookups, or when new keys are added almost every time.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 14, 2016

@retronym said:
@Ichoran Could you share a benchmark that shows the biggest difference?

I've played with a few variations in retronym/compiler-benchmark@a2260bb

I don't have a clear model about what's going on here.

One observation is that the new trait encoding requires some additional indirections which can eat into HotSpot's inlining depth budget, which is by default 9. I'm trying things like looking at the effect of -XX:MaxInlineLevel=16 in the 2.12.0 run.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 14, 2016

@retronym said:
@paplorinc I just saw your comments after I posted.

Unfortunately we had to introduce some indirection in the dispatch of trait methods.

Gory details:

scala/scala#5177
scala/scala-dev#143
scala/scala#5317
scala/scala#5429
scala/scala-dev#228

Here's the inlining log for my benchmark of Map#get

info]                                 @ 8   scala.collection.mutable.HashMap::get (29 bytes)   inline (hot)
[info]                                   @ 2   scala.collection.mutable.HashMap::findEntry (6 bytes)   inline (hot)
[info]                                     @ 2   scala.collection.mutable.HashTable::findEntry$ (6 bytes)   inline (hot)
[info]                                       @ 2   scala.collection.mutable.HashTable::findEntry (19 bytes)   inline (hot)
[info]                                         @ 5   scala.collection.mutable.HashMap::elemHashCode (6 bytes)   inline (hot)
[info]                                          \-> TypeProfile (577289/577289 counts) = scala/collection/mutable/HashMap
[info]                                           @ 2   scala.collection.mutable.HashTable$HashUtils::elemHashCode$ (6 bytes)   inline (hot)
[info]                                             @ 2   scala.collection.mutable.HashTable$HashUtils::elemHashCode (5 bytes)   inline (hot)
[info]                                               @ 1   scala.runtime.Statics::anyHash (65 bytes)   inline (hot)
[info]                                                 @ 61   java.lang.Integer::hashCode (8 bytes)   inline (hot)
[info]                                                   @ 4   java.lang.Integer::hashCode (2 bytes)   inlining too deep
[info]                                         @ 10   scala.collection.mutable.HashMap::index (6 bytes)   inline (hot)
[info]                                           @ 2   scala.collection.mutable.HashTable::index$ (6 bytes)   inline (hot)
[info]                                             @ 2   scala.collection.mutable.HashTable::index (34 bytes)   inline (hot)
[info]                                               @ 1   scala.collection.mutable.HashMap::table (5 bytes)   accessor
[info]                                               @ 13   scala.collection.mutable.HashMap::seedvalue (5 bytes)   accessor
[info]                                               @ 18   scala.collection.mutable.HashMap::improve (7 bytes)   inline (hot)
[info]                                                 @ 3   scala.collection.mutable.HashTable$HashUtils::improve$ (7 bytes)   inline (hot)
[info]                                                   @ 3   scala.collection.mutable.HashTable$HashUtils::improve (27 bytes)   inlining too deep
[info]                                               @ 26   java.lang.Integer::bitCount (49 bytes)   (intrinsic)
[info]                                         @ 15   scala.collection.mutable.HashTable::findEntry0 (44 bytes)   inline (hot)
[info]                                           @ 1   scala.collection.mutable.HashMap::table (5 bytes)   accessor
[info]                                           @ 15   scala.collection.mutable.DefaultEntry::key (5 bytes)   accessor
[info]                                            \-> TypeProfile (38245/38245 counts) = scala/collection/mutable/DefaultEntry
[info]                                           @ 21   scala.collection.mutable.HashMap::elemEquals (7 bytes)   inline (hot)
[info]                                             @ 3   scala.collection.mutable.HashTable::elemEquals$ (7 bytes)   inline (hot)
[info]                                               @ 3   scala.collection.mutable.HashTable::elemEquals (12 bytes)   inline (hot)
[info]                                                 @ 2   scala.runtime.BoxesRunTime::equals (13 bytes)   inline (hot)
[info]                                                   @ 9   scala.runtime.BoxesRunTime::equals2 (52 bytes)   too big
[info]                                           @ 30   scala.collection.mutable.DefaultEntry::next (5 bytes)   inline (hot)
[info]                                             @ 1   scala.collection.mutable.DefaultEntry::next (5 bytes)   accessor
[info]                                   @ 22   scala.collection.mutable.DefaultEntry::value (5 bytes)   accessor
[info]                                   @ 25   scala.Some::<init> (10 bytes)   inline (hot)
[info]                                     @ 6   scala.Option::<init> (9 bytes)   inline (hot)
[info]                                       @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
[info]                                       @ 5   scala.Product::$init$ (1 bytes)   inline (hot)

Hot calls into higher level methods (like groupBy) can run into the inliing depth limit at a slightly different spot.

That HotSpot limit is known to be a bit on the low side for historical reasons. The implementation of "lambda forms" in the java.lang.invoke implementation ran into it, but rather than raising it added special cases into HotSpot for lambda form frames.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 14, 2016

@paplorinc said (edited on Nov 14, 2016 8:20:23 PM UTC):
Thanks @Ichoran. I've tried your pattern matching unrolling, but for some reason it only slowed down my test (30k vs 32k ops/s):

public static class Test extends Base {
    scala.collection.mutable.HashMap<Object, Object> map = HashMap$.MODULE$.empty();

    @State(Scope.Thread)
    public class Initialized {
        @Setup(Level.Invocation)
        public void initializeMutable(Base state) {
            for (int i = 0; i < state.CONTAINER_SIZE; i++) {
                map.put(i, i);
            }
        }

        @TearDown(Level.Invocation)
        public void tearDown() {
            map.clear();
        }
    }

    @Benchmark
    public void scala_persistent_some(Blackhole bh) {
        for (int i = 0; i < CONTAINER_SIZE; i++) {
            Object value = map.getOrElseUpdate(i, () -> 0);
            bh.consume(value);
        }
        assert map.size() == CONTAINER_SIZE;
    }
}

Note: I prefer testing Scala speed from Java, as the bugs we're testing might affect the outcome if we're self-diagnosing.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 14, 2016

@paplorinc said (edited on Nov 14, 2016 8:20:02 PM UTC):
@retronym, thanks for sharing your benchmarks.
It seems you were able to reproduce the get slowdown.
I usually prefer exhaustive benchmarks (i.e. one, which consumes all elements), otherwise weird biases may arise (e.g. you're not testing not-found elements, or same-bucket ones - needed, especially since the equals2 appeared as hotspot also).

A viable fix would be to change getOrElseUpdate (which has a weird name ... if the map doesn't have the element, it's not an update anymore, it's a put) to calculate the index only once, and don't check for availability the second time - but this may need some interface changes also.
I would gladly create a PR for it, but I may need some guidance :)

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 14, 2016

@Ichoran said:
@retronym - It's rather fiddly. I keep losing the effect when I try to generalize tests because the JVM doesn't do a very robust job of fully optimizing the 2.11 code either.

However, the best I've got so far is at https://gist.github.com/Ichoran/94b4698905b905934dfd644f6cef7b4b

This is using Thyme (now up in a snapshot for 2.11 and 2.12) so I can iterate quickly on ideas. (Note--some machines may need the Thyme.warmed(0.03) to be changed to Thyme.warmed(0.05); the number is a target stability, and some machines never meet the smaller number.) Given the relative lack of stability of results, though, it may be better to switch to a JMH-based test suite. On the other hand, given the lack of stability I am uncertain how microbenchmarking results are going to translate into real-world performance.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 14, 2016

@paplorinc said (edited on Nov 14, 2016 8:21:41 PM UTC):
I have created a PR with a fix and would really appreciate your feedback, @retronym, @Ichoran!

While the underlying problem has rather been avoided (i.e. why is HashMap.get slower in the first place and why aren't default methods inlined properly, etc), it does seem to resolve the issue of slow groupBy s.

It's my first Scala PR, all comments are very welcome! :)

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 14, 2016

@paplorinc said:
The build passed: scala/scala#5528

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 15, 2016

@retronym said:
I'm toying with a patch to OpenJDK (https://gist.github.com/retronym/d27090a68570485c3e329a52db0c7b25) that will unconditionally inline these forwarder methods that we emit without eating into the inliing depth budget.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 15, 2016

@paplorinc said:
Wow, interesting :)
So every forwarder method (where the params would equal the parent method's params) would increase the max inline depth?

@scabug scabug closed this Nov 22, 2016

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 24, 2016

@paplorinc said (edited on Nov 24, 2016 10:04:45 AM UTC):
I could port this to 2.11.9 and 2.13.0 once scala/scala#5537 is merged :)

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 30, 2016

@retronym said:
Yes, a backport to 2.11.x would be welcome. No need to forward port to 2.13.x, we periodically merge in that direction. Please label your commit(s) in the backport with [backport] (in the commit title) so we know to exclude them in the next 2.11.x => 2.12.x merge.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 30, 2016

@paplorinc said (edited on Nov 30, 2016 10:02:53 AM UTC):
Forward port was recommended by @Ichoran you (scala/scala#5528 (comment)) as it may include visibility changes to the API, i.e. no need to reimplement methods from HashTable.
FYI: the build failed with NPE for my commit after merge, is it a randomly failing test, or a real one? (https://scala-ci.typesafe.com/job/scala-2.12.x-validate-test/3717/console)

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 30, 2016

@retronym said:
Yes, please open an issue for that one. Even if the status quo wins the day, it will be a useful record of the discussion.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Nov 30, 2016

@paplorinc said:
Done, thanks: #10084

Also, I noticed that the hashing does a manual bit rotation, changed that in a separate PR also: scala/scala#5569

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.