New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SD-232 Recycle classloaders to be anti-hostile to JIT #2754

Merged
merged 2 commits into from Sep 26, 2016

Conversation

Projects
None yet
7 participants
@retronym
Member

retronym commented Sep 26, 2016

The compiler interface subclasses scala.tools.nsc.Global,
and loading this new subclass before each compile task forces
HotSpot JIT to deoptimize larges swathes of compiled code. It's
a bit like SBT has rigged the dice to always descend the longest
ladder in a game of Snakes and Ladders.

The slowdown seems to be larger with Scala 2.12. There are a number
of variables at play, but I think the main factor here is that
we now rely on JIT to devirtualize calls to final methods in traits
whereas we used to emit static calls. JIT does a good job at this,
so long as classloading doesn't undo that good work.

This commit extends the existing ClassLoaderCache to encompass
the classloader that includes the compiler interface JAR. I've
resorted to adding a var to AnalyzingCompiler to inject the
dependency to get the cache to the spot I need it without binary
incompatible changes to the intervening method signatures.

Here are a few numbers showing the speedup of the peak (warmed up) performance:

Codebase Scala Version Direct SBT 0.13.12 This PR
src/compiler 2.12.0-SNAPSHOT 19s 32s (1.68x) 24s (1.26x)
scalap 2.11.8 1.19 s 1.72s (1.45x) 1.39s (1.17x)
scalap 2.12.0-SNAPSHOT 1.02 s 1.95s (1.91x) 1.20s (1.17x)
SD-232 Recycle classloaders to be anti-hostile to JIT.
The compiler interface subclasses `scala.tools.nsc.Global`,
and loading this new subclass before each `compile` task forces
HotSpot JIT to deoptimize larges swathes of compiled code. It's
a bit like SBT has rigged the dice to always descend the longest
ladder in a game of Snakes and Ladders.

The slowdown seems to be larger with Scala 2.12. There are a number
of variables at play, but I think the main factor here is that
we now rely on JIT to devirtualize calls to final methods in traits
whereas we used to emit static calls. JIT does a good job at this,
so long as classloading doesn't undo that good work.

This commit extends the existing `ClassLoaderCache` to encompass
the classloader that includes the compiler interface JAR. I've
resorted to adding a var to `AnalyzingCompiler` to inject the
dependency to get the cache to the spot I need it without binary
incompatible changes to the intervening method signatures.

@retronym retronym changed the title from SD-232 Recycle classloaders to be anti-hostile to JIT. to SD-232 Recycle classloaders to be anti-hostile to JIT Sep 26, 2016

@retronym

This comment has been minimized.

Show comment
Hide comment
@retronym

retronym Sep 26, 2016

Member

Note that the 2.12 numbers are based on a branch that includes scala/scala#5417, a performance regression in Scala 2.12s code generation.

Member

retronym commented Sep 26, 2016

Note that the 2.12 numbers are based on a branch that includes scala/scala#5417, a performance regression in Scala 2.12s code generation.

@dwijnand

I'm not sure what game of Snakes and Ladders you played, because you're meant to ascend with the ladders and descend with the snakes.

I'm a little bit on the fence here about whether we should do the necessary song-and-dance to burrow State's classLoaderCache to AnalzyingCompiler.

So I'll LGTM, and see what @eed3si9n says.

We need some notes here, both to flash how much better it performs and to document sbt.disable.interface.classloader.cache.

@retronym

This comment has been minimized.

Show comment
Hide comment
@retronym

retronym Sep 26, 2016

Member

Must have been Co-Snakes-and-ladders.

Member

retronym commented Sep 26, 2016

Must have been Co-Snakes-and-ladders.

@retronym

This comment has been minimized.

Show comment
Hide comment
@retronym

retronym Sep 26, 2016

Member

I just tried this out on SBT itself. Steady state performance of compile (measured by running ;clean;update;, then compile repeatedly in the shell) improved a little 34s -> 31s.

Member

retronym commented Sep 26, 2016

I just tried this out on SBT itself. Steady state performance of compile (measured by running ;clean;update;, then compile repeatedly in the shell) improved a little 34s -> 31s.

@dwijnand

This comment has been minimized.

Show comment
Hide comment
@dwijnand

dwijnand Sep 26, 2016

Member

@retronym I removed the var and setter in dwijnand@b6a4789.

Have a look and see if you want to cherry-pick it.

Member

dwijnand commented Sep 26, 2016

@retronym I removed the var and setter in dwijnand@b6a4789.

Have a look and see if you want to cherry-pick it.

@smarter

This comment has been minimized.

Show comment
Hide comment
@smarter

smarter Sep 26, 2016

Contributor

The compiler interface subclasses scala.tools.nsc.Global,
and loading this new subclass before each compile task forces
HotSpot JIT to deoptimize larges swathes of compiled code

Interesting, do you mean that it's deoptimizing code unrelated to scala.tools.nsc.Global? Do you know why?
Also note to self: I should probably imitate that caching in https://github.com/lampepfl/dotty/blob/master/bridge/src/main/scala/xsbt/CompilerClassLoader.scala until I can port this to sbt proper.

Contributor

smarter commented Sep 26, 2016

The compiler interface subclasses scala.tools.nsc.Global,
and loading this new subclass before each compile task forces
HotSpot JIT to deoptimize larges swathes of compiled code

Interesting, do you mean that it's deoptimizing code unrelated to scala.tools.nsc.Global? Do you know why?
Also note to self: I should probably imitate that caching in https://github.com/lampepfl/dotty/blob/master/bridge/src/main/scala/xsbt/CompilerClassLoader.scala until I can port this to sbt proper.

@retronym

This comment has been minimized.

Show comment
Hide comment
@retronym

retronym Sep 26, 2016

Member

@smarter Take a look at the diagnostic flags and output in scala/scala-dev#232 for the trail of clues.

In general, loading classes can deoptimize speculative optimizations based on Call Hierarachy Analysis. Similarly, loading and using a new Global subclass can invalidate devirtualized calls sites (a new reciever class appears which invalidates the assumption that this call site was effectively mono- or bi-morphic). I think that checkcast or instanceof checks could also also be affected.

Here's the JDK code that prints out the deoptimization messages: http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/share/vm/runtime/deoptimization.cpp

Member

retronym commented Sep 26, 2016

@smarter Take a look at the diagnostic flags and output in scala/scala-dev#232 for the trail of clues.

In general, loading classes can deoptimize speculative optimizations based on Call Hierarachy Analysis. Similarly, loading and using a new Global subclass can invalidate devirtualized calls sites (a new reciever class appears which invalidates the assumption that this call site was effectively mono- or bi-morphic). I think that checkcast or instanceof checks could also also be affected.

Here's the JDK code that prints out the deoptimization messages: http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/share/vm/runtime/deoptimization.cpp

@retronym

This comment has been minimized.

Show comment
Hide comment
@retronym

retronym Sep 26, 2016

Member

A theory for why the improvement in peak performance in the SBT's own build was only modest is that when multiple sub projects compiled in parallel the single copy of Global (loaded in a shared classloader) immediately was seen to have numerous active subclasses, so call sites didn't "overfit" to an assumption that would be soon invalidated. One could test this hypothesis by disabling parallel compilation of subprojects.

Member

retronym commented Sep 26, 2016

A theory for why the improvement in peak performance in the SBT's own build was only modest is that when multiple sub projects compiled in parallel the single copy of Global (loaded in a shared classloader) immediately was seen to have numerous active subclasses, so call sites didn't "overfit" to an assumption that would be soon invalidated. One could test this hypothesis by disabling parallel compilation of subprojects.

@eed3si9n

Overall the change looks good, but I'm a bit concerned about how SoftReference behaves against several other uses of SoftReference, like in Analysis cache with many-subproject build.
If we're putting this in to 0.13.13, then RC should be bumped back.

@retronym

This comment has been minimized.

Show comment
Hide comment
@retronym

retronym Sep 26, 2016

Member

@eed3si9n I've run a test under this patch in which I constrained the heap and metaspace, and then compiled a project with a sequence of different Scala versions within a single SBT session.

https://gist.github.com/retronym/d621926d6f9ea9d3c3c992ac44fa87a7

Member

retronym commented Sep 26, 2016

@eed3si9n I've run a test under this patch in which I constrained the heap and metaspace, and then compiled a project with a sequence of different Scala versions within a single SBT session.

https://gist.github.com/retronym/d621926d6f9ea9d3c3c992ac44fa87a7

@eed3si9n

This comment has been minimized.

Show comment
Hide comment
@eed3si9n

eed3si9n Sep 26, 2016

Member

@retronym Thanks for the additional testing.

Member

eed3si9n commented Sep 26, 2016

@retronym Thanks for the additional testing.

@eed3si9n eed3si9n merged commit b4bc9f1 into sbt:0.13.13 Sep 26, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@dwijnand dwijnand referenced this pull request Oct 25, 2016

Closed

notes on possible 2.12 release notes improvements #202

12 of 16 tasks complete
@retronym

This comment has been minimized.

Show comment
Hide comment
@retronym

retronym Nov 8, 2016

Member

I've just verified this again on scala/scala 2.12.x branch now that we've upgraded to SBT 0.13.13.

Below are the times to perform a clean compile of our sources as the JVM warms up, with the classloader caching in this PR disabled then enabled.

⚡ for b in true false; do echo "disabled=$b"; (((for i in {1..5}; do printf ';clean;update\ncompile\n'; done) | sbt -Dsbt.disable.interface.classloader.cache=$b) 2>&1) | egrep 'Total time: \d\d' ; done
disabled=true
[success] Total time: 131 s, completed 08/11/2016 11:47:50 AM
[success] Total time: 107 s, completed 08/11/2016 11:49:39 AM
[success] Total time: 105 s, completed 08/11/2016 11:51:25 AM
[success] Total time: 105 s, completed 08/11/2016 11:53:11 AM
[success] Total time: 105 s, completed 08/11/2016 11:54:57 AM
disabled=false
[success] Total time: 118 s, completed 08/11/2016 11:57:04 AM
[success] Total time: 88 s, completed 08/11/2016 11:58:32 AM
[success] Total time: 85 s, completed 08/11/2016 11:59:59 AM
[success] Total time: 86 s, completed 08/11/2016 12:01:26 PM
[success] Total time: 83 s, completed 08/11/2016 12:02:50 PM
Member

retronym commented Nov 8, 2016

I've just verified this again on scala/scala 2.12.x branch now that we've upgraded to SBT 0.13.13.

Below are the times to perform a clean compile of our sources as the JVM warms up, with the classloader caching in this PR disabled then enabled.

⚡ for b in true false; do echo "disabled=$b"; (((for i in {1..5}; do printf ';clean;update\ncompile\n'; done) | sbt -Dsbt.disable.interface.classloader.cache=$b) 2>&1) | egrep 'Total time: \d\d' ; done
disabled=true
[success] Total time: 131 s, completed 08/11/2016 11:47:50 AM
[success] Total time: 107 s, completed 08/11/2016 11:49:39 AM
[success] Total time: 105 s, completed 08/11/2016 11:51:25 AM
[success] Total time: 105 s, completed 08/11/2016 11:53:11 AM
[success] Total time: 105 s, completed 08/11/2016 11:54:57 AM
disabled=false
[success] Total time: 118 s, completed 08/11/2016 11:57:04 AM
[success] Total time: 88 s, completed 08/11/2016 11:58:32 AM
[success] Total time: 85 s, completed 08/11/2016 11:59:59 AM
[success] Total time: 86 s, completed 08/11/2016 12:01:26 PM
[success] Total time: 83 s, completed 08/11/2016 12:02:50 PM
@retronym

This comment has been minimized.

Show comment
Hide comment
@retronym

retronym Nov 8, 2016

Member

I've repeated the experiment on our 2.11.x branch (with one local change to bump from SBT 0.13.12 to 0.13.12). Similar numbers:

⚡ for b in true false; do echo "disabled=$b"; (((for i in {1..4}; do printf ';clean;update\ncompile\n'; done) | sbt -Dsbt.disable.interface.classloader.cache=$b) 2>&1) | egrep 'Total time: \d\d' ; done
disabled=true
[success] Total time: 138 s, completed 08/11/2016 12:08:32 PM
[success] Total time: 106 s, completed 08/11/2016 12:10:20 PM
[success] Total time: 102 s, completed 08/11/2016 12:12:04 PM
[success] Total time: 102 s, completed 08/11/2016 12:13:48 PM
disabled=false
[success] Total time: 118 s, completed 08/11/2016 12:15:55 PM
[success] Total time: 86 s, completed 08/11/2016 12:17:23 PM
[success] Total time: 84 s, completed 08/11/2016 12:18:48 PM
[success] Total time: 82 s, completed 08/11/2016 12:20:12 PM

That's good news, I wasn't sure whether the gain would be the same using 2.11 or 2.12 as a compiler.

Member

retronym commented Nov 8, 2016

I've repeated the experiment on our 2.11.x branch (with one local change to bump from SBT 0.13.12 to 0.13.12). Similar numbers:

⚡ for b in true false; do echo "disabled=$b"; (((for i in {1..4}; do printf ';clean;update\ncompile\n'; done) | sbt -Dsbt.disable.interface.classloader.cache=$b) 2>&1) | egrep 'Total time: \d\d' ; done
disabled=true
[success] Total time: 138 s, completed 08/11/2016 12:08:32 PM
[success] Total time: 106 s, completed 08/11/2016 12:10:20 PM
[success] Total time: 102 s, completed 08/11/2016 12:12:04 PM
[success] Total time: 102 s, completed 08/11/2016 12:13:48 PM
disabled=false
[success] Total time: 118 s, completed 08/11/2016 12:15:55 PM
[success] Total time: 86 s, completed 08/11/2016 12:17:23 PM
[success] Total time: 84 s, completed 08/11/2016 12:18:48 PM
[success] Total time: 82 s, completed 08/11/2016 12:20:12 PM

That's good news, I wasn't sure whether the gain would be the same using 2.11 or 2.12 as a compiler.

@eed3si9n

This comment has been minimized.

Show comment
Hide comment
@eed3si9n

eed3si9n Nov 8, 2016

Member

Nice

Member

eed3si9n commented Nov 8, 2016

Nice

@retronym

This comment has been minimized.

Show comment
Hide comment
@retronym

retronym Nov 8, 2016

Member

@eed3si9n Do you know if/how this problem relates to Maven/Gradle/IntelliJ using zinc without SBT?

Member

retronym commented Nov 8, 2016

@eed3si9n Do you know if/how this problem relates to Maven/Gradle/IntelliJ using zinc without SBT?

@eed3si9n

This comment has been minimized.

Show comment
Hide comment
@eed3si9n

eed3si9n Nov 8, 2016

Member

Given that this setting is passed in via Defaults.scala, I would assume that none of the other build tools would be aware of this change unless we publish typesafehub/zinc and let them them know specifically for it.

Member

eed3si9n commented Nov 8, 2016

Given that this setting is passed in via Defaults.scala, I would assume that none of the other build tools would be aware of this change unless we publish typesafehub/zinc and let them them know specifically for it.

@retronym

This comment has been minimized.

Show comment
Hide comment
@retronym

retronym Nov 8, 2016

Member

@eed3si9n Okay, tracking that task as typesafehub/zinc#105

Member

retronym commented Nov 8, 2016

@eed3si9n Okay, tracking that task as typesafehub/zinc#105

@stuhood

This comment has been minimized.

Show comment
Hide comment
@stuhood

stuhood Nov 8, 2016

Contributor

https://github.com/pantsbuild/pants will happily pick this up assuming it is available in https://github.com/sbt/zinc, which we consume directly.

Thanks!

Contributor

stuhood commented Nov 8, 2016

https://github.com/pantsbuild/pants will happily pick this up assuming it is available in https://github.com/sbt/zinc, which we consume directly.

Thanks!

@pedrofurla

This comment has been minimized.

Show comment
Hide comment
@pedrofurla

pedrofurla Nov 11, 2016

Some interesting results from my one of the scala project at x.ai:

The code base has 209 files, 70k LoC (excluding blanks and comments).

There are 174 traits, 193 classes (93 are implicit classes), 514 case classes, 1029 objects.

There is a small abuse of Shapeless, that produces coproducts for most of our domain objects, when fully compiled it adds a signficant slowdown. The same can be said of our centralized serialization source.

On sbt 0.13.12

First try on a fresh sbt session, after ;clean; test:clean; update

sbt> test:compile
...
[success] Total time: 255 s, completed Nov 10, 2016 4:22:17 PM

Same SBT session, after another ;clean; test:clean; update

sbt> test:compile
...
[success] Total time: 245 s, completed Nov 10, 2016 7:50:02 PM

On SBT 0.13.13

First try on a fresh sbt session, after ;clean; test:clean; update

sbt> test:compile
...
[success] Total time: 228 s, completed Nov 10, 2016 4:01:28 PM

Same SBT session, after another ;clean; test:clean; update

sbt> test:compile
...
[success] Total time: 185 s, completed Nov 10, 2016 4:04:39 PM

Some interesting results from my one of the scala project at x.ai:

The code base has 209 files, 70k LoC (excluding blanks and comments).

There are 174 traits, 193 classes (93 are implicit classes), 514 case classes, 1029 objects.

There is a small abuse of Shapeless, that produces coproducts for most of our domain objects, when fully compiled it adds a signficant slowdown. The same can be said of our centralized serialization source.

On sbt 0.13.12

First try on a fresh sbt session, after ;clean; test:clean; update

sbt> test:compile
...
[success] Total time: 255 s, completed Nov 10, 2016 4:22:17 PM

Same SBT session, after another ;clean; test:clean; update

sbt> test:compile
...
[success] Total time: 245 s, completed Nov 10, 2016 7:50:02 PM

On SBT 0.13.13

First try on a fresh sbt session, after ;clean; test:clean; update

sbt> test:compile
...
[success] Total time: 228 s, completed Nov 10, 2016 4:01:28 PM

Same SBT session, after another ;clean; test:clean; update

sbt> test:compile
...
[success] Total time: 185 s, completed Nov 10, 2016 4:04:39 PM
@retronym

This comment has been minimized.

Show comment
Hide comment
@retronym

retronym Nov 11, 2016

Member

Thanks for sharing those numbers, @pedrofurla!

Member

retronym commented Nov 11, 2016

Thanks for sharing those numbers, @pedrofurla!

@adriaanm

This comment has been minimized.

Show comment
Hide comment
@adriaanm

adriaanm Apr 14, 2017

Contributor

Related PR in dotty's compiler bridge: lampepfl/dotty#2262
Is this something to adopt in the Scala 2.x bridges as well?

Contributor

adriaanm commented Apr 14, 2017

Related PR in dotty's compiler bridge: lampepfl/dotty#2262
Is this something to adopt in the Scala 2.x bridges as well?

@smarter

This comment has been minimized.

Show comment
Hide comment
@smarter

smarter Apr 14, 2017

Contributor

@adriaanm I don't think there's anything that needs to be done in the scalac bridge, the dotty bridge needs to create its own classloader (because the sbt phases are part of the compiler and not part of the bridge, and the classloader constructed by sbt doesn't support that), but the scalac bridge just uses the classloader that should already be cached by sbt thanks to this PR.

Contributor

smarter commented Apr 14, 2017

@adriaanm I don't think there's anything that needs to be done in the scalac bridge, the dotty bridge needs to create its own classloader (because the sbt phases are part of the compiler and not part of the bridge, and the classloader constructed by sbt doesn't support that), but the scalac bridge just uses the classloader that should already be cached by sbt thanks to this PR.

@smarter

This comment has been minimized.

Show comment
Hide comment
@smarter

smarter Apr 14, 2017

Contributor

(because the sbt phases are part of the compiler and not part of the bridge, and the classloader constructed by sbt doesn't support that)

For the record, see https://github.com/lampepfl/dotty/blob/9e45ad16d012e6a2ff3be411c2fe101b1c74b831/sbt-bridge/src/xsbt/CompilerClassLoader.scala#L45-L72 for the gory details.

Contributor

smarter commented Apr 14, 2017

(because the sbt phases are part of the compiler and not part of the bridge, and the classloader constructed by sbt doesn't support that)

For the record, see https://github.com/lampepfl/dotty/blob/9e45ad16d012e6a2ff3be411c2fe101b1c74b831/sbt-bridge/src/xsbt/CompilerClassLoader.scala#L45-L72 for the gory details.

@adriaanm

This comment has been minimized.

Show comment
Hide comment
@adriaanm

adriaanm Apr 14, 2017

Contributor

Ok, thanks for explaining!

Contributor

adriaanm commented Apr 14, 2017

Ok, thanks for explaining!

@retronym retronym referenced this pull request Dec 10, 2017

Merged

Add bench/ #1181

smarter added a commit to smarter/mill that referenced this pull request Jul 27, 2018

Make hot compilation 2x faster by properly reusing classloaders
So far, Mill was caching ScalaInstance which contains a classloader but
this is not enough: Zinc creates its own classloader by combining the
ScalaInstance classloader with the path to the compiler-bridge. Zinc
takes care of caching this classloader in each instance of
ScalaCompiler.

We can take advantage of this by caching an instance of Compilers since
it contains everything we want to cache (ScalaCompiler and
ScalaInstance).

This significantly reduces the amount of classloading that happens
at each compilation step, as measured by running:

export _JAVA_OPTIONS="-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+TraceClassUnloading"
mill -i foo.compile

Then looking at the output of `out/mill-worker-1/logs` while adding a
new line in a source file in `foo` and running `mill -i foo.compile` again.

The speedup is going to depend on the project, but I measured a ~2x
improvement when running on the Mill build `time(core.compile())` and
`rm -rf out/core/compile/` in a loop until results stabilized. See also
the results that were obtained when a very similar issue was fixed in
sbt itself: sbt/sbt#2754

lihaoyi added a commit to lihaoyi/mill that referenced this pull request Jul 27, 2018

Make hot compilation 2x faster by properly reusing classloaders (#393)
So far, Mill was caching ScalaInstance which contains a classloader but
this is not enough: Zinc creates its own classloader by combining the
ScalaInstance classloader with the path to the compiler-bridge. Zinc
takes care of caching this classloader in each instance of
ScalaCompiler.

We can take advantage of this by caching an instance of Compilers since
it contains everything we want to cache (ScalaCompiler and
ScalaInstance).

This significantly reduces the amount of classloading that happens
at each compilation step, as measured by running:

export _JAVA_OPTIONS="-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+TraceClassUnloading"
mill -i foo.compile

Then looking at the output of `out/mill-worker-1/logs` while adding a
new line in a source file in `foo` and running `mill -i foo.compile` again.

The speedup is going to depend on the project, but I measured a ~2x
improvement when running on the Mill build `time(core.compile())` and
`rm -rf out/core/compile/` in a loop until results stabilized. See also
the results that were obtained when a very similar issue was fixed in
sbt itself: sbt/sbt#2754
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment