Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sbt 1.x Performance improvement targets #4012

Open
jrudolph opened this issue Mar 14, 2018 · 11 comments
Open

sbt 1.x Performance improvement targets #4012

jrudolph opened this issue Mar 14, 2018 · 11 comments
Labels

Comments

@jrudolph
Copy link
Member

jrudolph commented Mar 14, 2018

During the last weeks I spent a bit of time analyzing the performance of starting an already fully cached sbt build (i.e. taking the times with time sbt exit && time sbt exit && time sbt exit and ignoring the first result). I probably won't have time to act on these but I wanted to dump these here for anyone who wants to help improving the performance.

  • Calculating aggregations is slow. Aggregations are cached at the beginning. For each key the settings structure is queried for "What's the value of aggregate in X". This calculation is somewhat slow because aggregation usually falls back to the default value in global scope. To find this value usually the whole delegation chain has to be walked.

  • Validating key references. Before evaluating settings, sbt validates if the settings dependency structure is valid. This is slow because it uses slow generic abstractions like the AList and creates lots of garbage. I haven't checked but I wonder if it necessary at all to check dependency chains before settings evaluation or if dependency problems could be collected while evaluating the settings tree.

  • Parallel settings evaluation is ineffective. Settings are evaluated concurrently using the INode setup. In my tests, I didn't observe any notable parallelism even in cases with at least two slow tasks that didn't seem to depend on each other. I haven't looked into that deeply, but I suspect a few issues with parallel settings evaluation:

    • Task splitup is much too fine-grained. Often there are only super small chunks of code to be evaluated. Each task submission for parallel execution has a cost and I suspect that for small tasks the cost outweigh the potential benefit through parallelism. It seems there are often linear dependency chains that would probably be faster executed linearly on one thread.
    • Task scheduling is non-optimal. Scheduling is a very hard problem but there might be some clever ways to schedule things if you know enough about the structure of settings dependencies.
    • It seems like settings evaluation would be a good candidate for using a fork join pool. The current thread pool executor might introduce additional latency and probably has worse scheduling than what you could by using the fork join pool directly.

    I had a try at removing the parallel execution completely, which seemed to work and wasn't slower in my limited testing at least. However, it wasn't quite correct as some builds (cinnamon) started to fail with unresolvable setting dependencies. (I thought that after topological sort, dependencies should have always been executed before the dependees but that assumption might be wrong e.g. for Bind nodes). Parallel setting execution might be worth it on some builds, though.

  • splitExpressions during parsing .sbt files is slow because a Scala compiler has to be initialized for parsing. I played around with caching the results of splitExpressions in jrudolph@b7a9f74 which seemed to work quite well (I didn't figure out how to use sbt's caching infrastructure correctly, though).

  • As already noted in Performance observations (sbt 1.0 startup performance) #3694, a main performance problem is the de(serialization) of update task caches (which are accessed to build the meta-project and the project itself). There are several issues:

    • Tracked.lastOutput isn't well-suited for the task at hand, as the cache reading code runs markAsCached to set a flag that the data comes from the cache, after the data was read, and afterwards the whole data that was just deserialized from the cache is again written to the cache (see jrudolph@3836732 for a rather manual attempt to fix that).
    • sjson-new deserialization is super slow. Reading the ~5MB caches for the akka project (and again for the meta project) takes about 1 second on my machine. The main reason for the slowness is that data is copied around lots of times because of putting a facade in front of scala-json and then copying data around. I attempted improving the performance in eed3si9n/sjson-new@c248ba0 but there is still lots of room for improvement. Improving that cache will likely also benefit the update task when run in the project itself. (Imo using a binary format like protobuf for the update cache makes most sense as that will likely be faster than any json representation).

I attached a zip with the flamegraphs captured with async-profiler. One including GC and JIT compilation threads and one with only the Java threads.

flame-graphs.zip

@lihaoyi
Copy link
Contributor

lihaoyi commented May 28, 2018

Some learnings from Mill:

splitExpressions during parsing .sbt files is slow because a Scala compiler has to be initialized for parsing. I played around with caching the results of splitExpressions in jrudolph/xsbt@b7a9f74 which seemed to work quite well (I didn't figure out how to use sbt's caching infrastructure correctly, though).

Mill/Ammonite use ScalaParse to split expressions in the top-level script file, but even that takes a while to initialize the first time purely due to classloading, and so it caches the output of splitExpressions. Seems like it shouldn't be a hard thing to do to since it's just a pure function String => Seq[String]?

sjson-new deserialization is super slow. Reading the ~5MB caches for the akka project (and again for the meta project) takes about 1 second on my machine

Would swapping in a fast JSON serializer be a solution here? Mill uses uPickle, which in my arbitrary benchmarks (which are more complex/branchy/intricate that typical cache JSON, which is full of long strings) does 65-70 mb/s on my macbook by default, 85-90mb/s if you cache the implicitly constructed serializer objects (Mill does when hot). This is for java.lang.String <-> case class conversion, and both read & write are similar speeds. That's fast enough that Mill's JSON handling basically doesn't turn up in the profiles at all, and is dwarfed by things like stat-ing files or reading the individual JSON cache files off disk.

If you don't want to use uPickle, Circe's performance is within a factor of 2 (~60mb/s?), and Play-Json within a factor of 3-4 (~30mb/s?), which may be enough to make serialization disappear off your profiles


One thing that Mill faced that your profiles show SBT facing too is classloading time: according to akka-sbt-1.1.1-original.svg, your cold SBT startup is spending more than half it's time in the C2 JIT compiler! Mill is also mostly classloading bound, and most of it's 2s cold startup is classloading.

There's not much to do here except to aggressively cache things so that cached startups don't need to load as many classes. e.g. classloading scala.tools.nsc easily takes an additional 1-2 seconds (even without running anything!!!) and Mill/Ammonite go to great pains to make sure that when things are cached, scala.tools.nsc is not classloaded at all (we have a unit test for this!).

Notably, heavy dependencies like Scalaz or Cats work against you in trying to fight classloading time: even if you don't run much code, just touching the library in a few places is enough to force the bulk of classloading to take place. That's one reason in Ammonite/Mill I have been very aggressive about using culling libraries with large transitive dependency graphs in favor of 0-dependency libraries like uPickle

@eed3si9n
Copy link
Member

@lihaoyi Thanks for the tips! These are all useful points to consider.
I think swapping out serialization should be on the table, especially for update task, which caches the entire UpdateReport.

@retronym
Copy link
Member

retronym commented Jun 1, 2018

With regards to cold performance (both of SBT and Scalac), it is worth thinking about how to integrate the now-free-in-OpenJDK feature, Application Class Data Sharing, to shave a few hundred millis of startup time. I'm also keenly waiting for JWarmup to persist JIT profiles from a previous run to speedup warmup.

@dwijnand
Copy link
Member

dwijnand commented Jun 1, 2018

I had a little go with the -XX:AppCDS and it didn't seem to change anything out the box:

08:33:18 $ JENV_VERSION=10 time sbtx exit
Can't parse java version from: 10
Java HotSpot(TM) 64-Bit Server VM warning: Ignoring option MaxPermSize; support was removed in 8.0
[info] Loading settings from idea.sbt,global-plugins.sbt ...
[info] Loading global plugins from /Users/dnw/.dotfiles/.sbt/1.0/plugins
[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /s/t-sbt-AppCDS/project
[info] Loading settings from build.sbt ...
[info] Set current project to t (in build file:/s/t-sbt-AppCDS/)
        7.77 real        25.08 user         1.10 sys

08:33:36 $ JENV_VERSION=10 time sbtx exit
Can't parse java version from: 10
Java HotSpot(TM) 64-Bit Server VM warning: Ignoring option MaxPermSize; support was removed in 8.0
[info] Loading settings from idea.sbt,global-plugins.sbt ...
[info] Loading global plugins from /Users/dnw/.dotfiles/.sbt/1.0/plugins
[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /s/t-sbt-AppCDS/project
[info] Loading settings from build.sbt ...
[info] Set current project to t (in build file:/s/t-sbt-AppCDS/)
        7.70 real        24.19 user         1.11 sys

08:33:45 $ JENV_VERSION=10 time sbtx -J-XX:+UnlockCommercialFeatures -J-XX:+UseAppCDS exit
Can't parse java version from: 10
Java HotSpot(TM) 64-Bit Server VM warning: Ignoring option MaxPermSize; support was removed in 8.0
[info] Loading settings from idea.sbt,global-plugins.sbt ...
[info] Loading global plugins from /Users/dnw/.dotfiles/.sbt/1.0/plugins
[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /s/t-sbt-AppCDS/project
[info] Loading settings from build.sbt ...
[info] Set current project to t (in build file:/s/t-sbt-AppCDS/)
        7.75 real        24.97 user         1.10 sys

08:33:55 $ JENV_VERSION=10 time sbtx -J-XX:+UnlockCommercialFeatures -J-XX:+UseAppCDS exit
Can't parse java version from: 10
Java HotSpot(TM) 64-Bit Server VM warning: Ignoring option MaxPermSize; support was removed in 8.0
[info] Loading settings from idea.sbt,global-plugins.sbt ...
[info] Loading global plugins from /Users/dnw/.dotfiles/.sbt/1.0/plugins
[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /s/t-sbt-AppCDS/project
[info] Loading settings from build.sbt ...
[info] Set current project to t (in build file:/s/t-sbt-AppCDS/)
        8.10 real        25.24 user         1.20 sys

@dwijnand
Copy link
Member

dwijnand commented Jun 1, 2018

More experiments, little progress:

22:08:43 $ JENV_VERSION=10 time sbtx -J-XX:+UnlockCommercialFeatures -J-XX:+UseAppCDS -J-Xshare:off -J-XX:DumpLoadedClassList=sbt.lst exit
Can't parse java version from: 10
Java HotSpot(TM) 64-Bit Server VM warning: Ignoring option MaxPermSize; support was removed in 8.0
skip writing class com/sun/proxy/$Proxy0 from source __JVM_DefineClass__ to classlist file
skip writing class com/sun/proxy/$Proxy4 from source __JVM_DefineClass__ to classlist file
skip writing class com/sun/proxy/$Proxy5 from source __JVM_DefineClass__ to classlist file
skip writing class com/sun/proxy/$Proxy8 from source __JVM_DefineClass__ to classlist file
skip writing class java/lang/invoke/BoundMethodHandle$Species_LLLLLLLLI from source __JVM_DefineClass__ to classlist file
skip writing class java/lang/invoke/BoundMethodHandle$Species_LLLLLLLLII from source __JVM_DefineClass__ to classlist file
skip writing class java/lang/invoke/BoundMethodHandle$Species_LLLLLLLLIIL from source __JVM_DefineClass__ to classlist file
skip writing class java/lang/invoke/BoundMethodHandle$Species_LLLLLLLLIILL from source __JVM_DefineClass__ to classlist file
skip writing class com/sun/proxy/jdk/proxy1/$Proxy10 from source __JVM_DefineClass__ to classlist file
skip writing class com/sun/proxy/jdk/proxy1/$Proxy11 from source __JVM_DefineClass__ to classlist file
[info] Loading settings from idea.sbt,global-plugins.sbt ...
[info] Loading global plugins from /Users/dnw/.dotfiles/.sbt/1.0/plugins
skip writing class java/lang/invoke/BoundMethodHandle$Species_LII from source __JVM_DefineClass__ to classlist file
skip writing class java/lang/invoke/BoundMethodHandle$Species_LIIL from source __JVM_DefineClass__ to classlist file
skip writing class java/lang/invoke/BoundMethodHandle$Species_LIILL from source __JVM_DefineClass__ to classlist file
skip writing class java/lang/invoke/BoundMethodHandle$Species_LIILLL from source __JVM_DefineClass__ to classlist file
[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /s/t-foo/project
[info] Loading settings from build.sbt ...
[info] Set current project to t (in build file:/s/t-foo/)
22:08:55 $ JENV_VERSION=10 time sbtx -J-XX:+UnlockCommercialFeatures -J-XX:+UseAppCDS -J-Xshare:dump -J-XX:SharedClassListFile=sbt.lst -J-XX:SharedArchiveFile=sbt.jsa exit
Can't parse java version from: 10
Java HotSpot(TM) 64-Bit Server VM warning: Ignoring option MaxPermSize; support was removed in 8.0
narrow_klass_base = 0x0000000800000000, narrow_klass_shift = 3
Allocated temporary class space: 1073741824 bytes at 0x00000008c0000000
Allocated shared space: 3221225472 bytes at 0x0000000800000000
Loading classes to share ...
Preload Warning: Verification failed for xsbt.boot.Launch
Loading classes to share: done.
Rewriting and linking classes ...
Preload Warning: Removed error class: xsbt.boot.Launch
Rewriting and linking classes: done
Number of classes 2821
    instance classes   =  2743
    obj array classes  =    70
    type array classes =     8
Updating ConstMethods ... done.
Removing unshareable information ... done.
Scanning all metaspace objects ...
Allocating RW objects ...
Allocating RO objects ...
Relocating embedded pointers ...
Relocating external roots ...
Dumping symbol table ...
Dumping String objects to closed archive heap region ...
Dumping objects to open archive heap region ...
Relocating SystemDictionary::_well_known_klasses[] ...
Removing java_mirror ... done.
mc  space:      9424 [  0.0% of total] out of     12288 bytes [ 76.7% used] at 0x0000000800000000
rw  space:   7493400 [ 23.5% of total] out of   7495680 bytes [100.0% used] at 0x0000000800003000
ro  space:  12367464 [ 38.8% of total] out of  12369920 bytes [100.0% used] at 0x0000000800729000
md  space:      6160 [  0.0% of total] out of      8192 bytes [ 75.2% used] at 0x00000008012f5000
od  space:  11153856 [ 35.0% of total] out of  11157504 bytes [100.0% used] at 0x00000008012f7000
st0 space:    716800 [  2.2% of total] out of    716800 bytes [100.0% used] at 0x00000007bff00000
oa0 space:     98304 [  0.3% of total] out of     98304 bytes [100.0% used] at 0x00000007bfe00000
total    :  31845408 [100.0% of total] out of  31858688 bytes [100.0% used]
        1.79 real         1.43 user         0.32 sys

22:09:55 $ JENV_VERSION=10 time sbtx -J-XX:+UnlockCommercialFeatures -J-XX:+UseAppCDS -J-Xshare:on -J-XX:SharedArchiveFile=sbt.jsa exit
Can't parse java version from: 10
Java HotSpot(TM) 64-Bit Server VM warning: Ignoring option MaxPermSize; support was removed in 8.0
[info] Loading settings from idea.sbt,global-plugins.sbt ...
[info] Loading global plugins from /Users/dnw/.dotfiles/.sbt/1.0/plugins
[info] Loading settings from plugins.sbt ...
[info] Loading project definition from /s/t-foo/project
[info] Loading settings from build.sbt ...
[info] Set current project to t (in build file:/s/t-foo/)
        7.64 real        24.81 user         1.19 sys

@dwijnand
Copy link
Member

dwijnand commented Jun 2, 2018

@eed3si9n eed3si9n added the uncategorized Used for Waffle integration label Sep 18, 2018
@pshirshov
Copy link

Did you guys abandon CP sharing idea?

@azolotko
Copy link
Contributor

azolotko commented Aug 8, 2021

your cold SBT startup is spending more than half it's time in the C2 JIT compiler!

Would it be appropriate for sbt to use -XX:+TieredCompilation -XX:TieredStopAtLevel=1? It shows a noticeable startup time improvement on one of my projects:

Without -XX:+TieredCompilation -XX:TieredStopAtLevel=1:

➜ hyperfine --warmup 5 --min-runs 5 'sbt exit' 
Benchmark #1: sbt exit
  Time (mean ± σ):      9.041 s ±  0.146 s    [User: 37.029 s, System: 2.140 s]
  Range (min … max):    8.795 s …  9.172 s    5 runs

With -XX:+TieredCompilation -XX:TieredStopAtLevel=1:

➜ hyperfine --warmup 5 --min-runs 5 'sbt exit' 
Benchmark #1: sbt exit
  Time (mean ± σ):      7.203 s ±  0.122 s    [User: 13.557 s, System: 1.243 s]
  Range (min … max):    7.009 s …  7.327 s    5 runs

With -XX:+TieredCompilation -XX:TieredStopAtLevel=1 and -Xverify:none (deprecated in JDK 13 and will likely be removed in a future release):

➜ hyperfine --warmup 5 --min-runs 5 'sbt exit' 
Benchmark #1: sbt exit
  Time (mean ± σ):      6.349 s ±  0.068 s    [User: 11.892 s, System: 1.254 s]
  Range (min … max):    6.260 s …  6.436 s    5 runs

With -XX:+TieredCompilation -XX:TieredStopAtLevel=1, -Xverify:none and -XX:SharedArchiveFile:

➜  hyperfine --warmup 5 --min-runs 5 'sbt exit'
Benchmark #1: sbt exit
  Time (mean ± σ):      5.425 s ±  0.026 s    [User: 10.782 s, System: 1.154 s]
  Range (min … max):    5.386 s …  5.457 s    5 runs

@jrudolph
Copy link
Member Author

jrudolph commented Aug 8, 2021

Have you checked actual compilation? Try compiling a loop of compilations (e.g. ;clean;compile;clean;compile etc) that takes a few minutes under default settings and then only with C1.

@azolotko
Copy link
Contributor

azolotko commented Aug 8, 2021

Indeed, the compilation performance is not impressive 👎

➜  hyperfine --min-runs 2 'sbt "clean; compile; clean; compile; clean; compile; clean; compile; clean; compile"'
Benchmark #1: sbt "clean; compile; clean; compile; clean; compile; clean; compile; clean; compile"
  Time (mean ± σ):     212.903 s ±  2.530 s    [User: 709.295 s, System: 45.613 s]
  Range (min … max):   211.114 s … 214.692 s    2 runs

➜  export JAVA_OPTS="-XX:+TieredCompilation -XX:TieredStopAtLevel=1"

➜  hyperfine --min-runs 2 'sbt "clean; compile; clean; compile; clean; compile; clean; compile; clean; compile"'
Benchmark #1: sbt "clean; compile; clean; compile; clean; compile; clean; compile; clean; compile"
  Time (mean ± σ):     319.712 s ±  3.586 s    [User: 554.071 s, System: 35.555 s]
  Range (min … max):   317.177 s … 322.248 s    2 runs

@azolotko
Copy link
Contributor

azolotko commented Aug 8, 2021

I've been reading through advancedThresholdPolicy.hpp
Here's another interesting combination:

➜  export JAVA_OPTS="-XX:Tier3DelayOn=0 -XX:Tier3DelayOff=0 -XX:CICompilerCount=2"

➜  hyperfine --warmup 5 --min-runs 5 'sbt exit'
Benchmark #1: sbt exit
  Time (mean ± σ):      7.880 s ±  0.107 s    [User: 22.218 s, System: 1.584 s]
  Range (min … max):    7.703 s …  7.991 s    5 runs
 
➜  hyperfine --min-runs 2 'sbt "clean; compile; clean; compile; clean; compile; clean; compile; clean; compile"'
Benchmark #1: sbt "clean; compile; clean; compile; clean; compile; clean; compile; clean; compile"
  Time (mean ± σ):     218.390 s ±  4.132 s    [User: 594.795 s, System: 41.521 s]
  Range (min … max):   215.468 s … 221.312 s    2 runs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants