-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More capable inliner [ci: last-only] #7133
Conversation
This comment has been minimized.
This comment has been minimized.
d384e38
to
2971c8d
Compare
a0e2ea4
to
6b4da82
Compare
9c4f5de
to
39f680f
Compare
099b2b4
to
441e7d0
Compare
This is finally getting ready now! I summarized the changes in the PR description. Feedback is very welcome, /cc @Ichoran, @non, @mkeskells, @fwbrasil |
@lrytz interesting work! I wonder if static inlining could hurt performance, though. The JIT compilers can make much more informed decisions based on the runtime profile. |
Yes, static inlining can certainly hurt performance, mostly because the JVM prefers working with small methods (the exception to that rule is probably the rather low default The goal of the Scala inliner and optimizer is to produce bytecode that the JVM can handle better. Or in other words, perform some optimizations that are common in Scala and that we know the JVM doesn't handle very well. There are basically two areas of interest:
We inline very small methods as this does not increase method sizes and can sometimes lead to better static method-local analysis (for example when inlining a factory method). All of the JVM discussion above is about HotSpot (version 8 in particular - I don't know if there were substantial changes in this area in 9/10/11). Once GraalVM becomes more prevalent we'll have to re-asses the situation. Graal definitely has a more powerful inliner with heuristics geared towards higher-order methods / functional patterns, and partial escape analysis might also have a substantial impact on Scala code. In the best case the Scala optimizer becomes obsolete, in the worst case the rewrites performed by the Scala optimizer have a negative effect when running on Graal. |
724aa17
to
8d17277
Compare
Pushed changes (proposed and written by @retronym) to cut dependencies of Java sources on Scala files in the library, and switched to JavaThenScala. This allows inlining code that references those Java-defined types, see #6085 (comment). This shows a slight performance improvement, see 3rd data piont in this graph |
Inliner heuristic to inline array_apply/update if the array has a statically known type. Then the type tests of the callee can be evaluated at compile-time. Also cleanups in LocalOpt
…tanceof optimization
When a argument value (on the stack) at an inlined callsite already has a local variable holding the same value, don't create a new local variable, but use the existing one.
Don't let methods grow over 3000 asm instructions. Don't inline into synthetic forwarders. Inline forwarders, especially ones that box/unbox. Categories for inline requests, higher-order are still inlined when size is > 2000, but forwarders aren't. Fix @noinline on paramless callsite
this issue is old (since 2.12)
…red for stack map frame calculation
It hurts performance. Analyses are really memory-intensive. Caching them uses a lot of memory, and basically defers collecting them from minor to major GC (they are stored using SoftReference). With a 1.5G heap, the JVM is basically doing major GC all the time when running the optimizer, which drasitcally hurts performance. With a 3G heap this looks better. However, even with a 3G heap the overall performance is worse than without any caching on a 1.5G heap. Generally, the optimizer has a huge amount of memory churn, this can be observed easily with a profiler. The best way to improve on that is probably to run fewer analyzers, i.e., to fuse local optimizations that use the same kind of analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Epic!
@lrytz Here's a commit to update the |
@lrytz Impressive! Looks like the work to improve expanded to be a rewrite!
|
Both `.` and `/` work as package separator, the matcher simply replaces `.` by `/`. But using dots is recommended in the setting's help.s
Very good point, I'll test the parallel backend. It will not help for inlining, because inlining works across compilation units, so we cannot parallelize. But it works for all the intra-method optimizations, which is where most of the time is spent anyway.
Technically I don't see a problem backporting it to 2.12. So it's a question of risk, it's a big change that impacts generated bytecode. We should at least gain a bit of confidence first and see how well it does on 2.13.x.
No, it's a good suggestion. We might learn a few useful things from looking at those warnings. Last time I looked, the main problem was mixed compilation: when a Scala method references a Java type for which there's no bytecode available, we cannot inline that method. This was very common because we have some core classes in the library written in Java. This PR fixes that by going back to JavaThenScala. |
To support the expanded range of choices for `-opt`, and also `-opt-inline-from` which hand't been added.
🎉 🎉 🎉 |
Doesn't this effectively rule out a change for Scala 2.12.x? |
I don't think that it's ruled out by any principles. Binary compatible bytecode changes are OK, so it's a risk-value tradeoff. |
Yeah thats what I meant by effectively, judging by the size of the PR and how much has changed I suspect it would be very hard to make sure its completely binary compatible with the 2.12.x series |
I think the risk that (a backport of) this PR breaks binary compatibility is very small. What changes is optimizations within methods, and the selection of callsites that are inlined. None of that changes the binary interface of the released scala-library. The risks I see are
|
This PR brings several improvements to the inliner and optimizer.
Teaser
State of 2.12
In 2.12, there is a single round of inlining.
Changes to inlining procedure
This PR changes the inlnier and closure optimizer to run in a loop (or, more precisely an inner loop of the inliner and an outer loop of inliner and closure optimizer):
Inliner heuristics
The inliner heuristics are adjusted, changes in this PR are marked (new). They work as follows
@noinline
are not inlinedfoo$anonfun$adapted
, bridges, boxing/unboxing forwarders)@inline
are inlinedXRef
parameter are inlined. When nested methods update variables of the outer method, those variables are boxed intoXRef
objects. Inlining the nested method usually allows eliminating theXRef
box.ScalaRunTime.array_apply
and.array_update
are inlined if the receiver is statically known to be an array. Those methods have a big pattern match for all primitive array types. This pattern match is later reduced (by cleanup optimizations) to the single matching case.anonfun$adapted
methods, trivial methods such as_+1
closure bodies, etc.invokespecial
to call the default method) are not inlined, because theinvokespecial
would have different semantics when inlined into some different class. (*)(*) Note that trait super accessors are still inlined when selected by a differnt heuristic, for example if they are higher-order methods with a function literal argument. In this case the static forwarder is inlined, and then also the
invokespecial
is inlined. So the actual default method body is ultimately inlined.Size limits
Inlining is now subject to size limits. The limits were chosen by looking at method sizes when compiling the compiler and how they affect running time of the ASM analyzers. They are definitely up for dicsussion, and should maybe have command-line flags.
@inline
-annotated, higher-order, synthetic forwarder andarray_apply/update
methods are inlined if the resulting method has <= 3000 ASM instruction nodesXRef
params, factories and forwarders that box/unbox are inlined if the result is <= 2000 ASM instructionsImprovements to local optimizations
There are several improvements to local optimizations (cleanups)
-opt:-assume-modules-non-null
.Predef
,ScalaRunTime
,ClassTag
). This can delay / skip initialization of those modules. The behavior can be disabled with-opt:-allow-skip-core-module-init
Foo.getClass
. This can be disabled with-opt:-allow-skip-class-loading
.instanceof
andcheckcast
optimization is improvedINSTANCEOF
checksClassTag(classOf[X]).newArray
is rewritten tonew Array[X]
java.lang.reflect.Arrays.getLength(x)
whenx
is statically known to be an array is rewritten toARRAYLENGTH
x.getClass
whenx
is statically known to be a primitive array is rewritten toLDC
Compiler performance
First the good news: the compiler built with the new inliner/optimizer runs ~2.5% faster on the larger codebases
scalap
andscala
, around 1% faster on the smallbetter-files
codebase. Results here.The not so good news: running the optimizer is a significantly slower than before. Compiling the compiler with the optimizer enabled: the jvm phase went from 43 seconds to 93 seconds.
By far the most amount of time is spent in the
SourceValue
analysis, which computes producers and consumers. This analysis is used in a number of places: rewrite closure invocations, push-pop, box-unbox, remove stale stores. It might be possible to fuse the local optimizations that use this analysis into one.Besides taking a lot of cpu cycles, asm analyses also have a huge memory churn. This can be easily observed by attaching a profiler, there is basically constant minor GC going on as soon as the compiler reaches the backend.
Bugfixes
Fixes some older bugs
Int.unbox("")
should not be eliminated, as it throws a CCE (bug in 2.12)