Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More capable inliner [ci: last-only] #7133

Merged
merged 56 commits into from
Oct 26, 2018
Merged

More capable inliner [ci: last-only] #7133

merged 56 commits into from
Oct 26, 2018

Conversation

lrytz
Copy link
Member

@lrytz lrytz commented Aug 24, 2018

This PR brings several improvements to the inliner and optimizer.

Teaser

class C {
  def a = Array.fill[Option[String]](10)(None)
  def b(a: Array[Int]) = a.map(_ + 1)
}
$> qsc Test.scala -opt:l:inline '-opt-inline-from:**'

$> cfr-decompiler C.class
public class C {
    public Option<String>[] a() {
        int fill_n = 10;
        if (fill_n <= 0) {
            return new Option[0];
        }
        Option[] fill_array = new Option[fill_n];
        for (int fill_i = 0; fill_i < fill_n; ++fill_i) {
            None$ none$ = None$.MODULE$;
            fill_array[fill_i] = none$;
        }
        return fill_array;
    }

    public int[] b(int[] a) {
        int n = a.length;
        int[] arrn = new int[n];
        for (int map$extension_i = 0; map$extension_i < n; ++map$extension_i) {
            int n2;
            arrn[map$extension_i] = n2 = a[map$extension_i] + 1;
        }
        return arrn;
    }
}

State of 2.12

In 2.12, there is a single round of inlining.

  • After building the call graph, the heuristic looks at every callsite and decides which ones to select for inlining
  • These inline requests are then filtered (avoid cycles) and ordered (leaves of the request tree are inlined first)
  • The selected callsites are inlined
  • Then the closure optimizer runs and re-writes closure invocations that appear within the same method as the closure allocation (to directly call the closure body method instead).
  • Local optimizatinos then clean up (closure allocation is eliminated, lots of other cleanups)

Changes to inlining procedure

This PR changes the inlnier and closure optimizer to run in a loop (or, more precisely an inner loop of the inliner and an outer loop of inliner and closure optimizer):

  • The beginning is the same as in 2.12 (build call graph and select inline requests)
  • The inliner no longer "fixes up" the call graph of the callsite method (remove inlied callsite, add those of inlined code). Instead, methods changed by the inliner removed and re-added to the call graph. This re-runs the type analysis, which may result in a more precise call graph (callsites may get a more precise receiver type after inlining).
  • Callsites in methods changed by the inliner are re-checked by the inliner heuristic, potentially creating new inline requests (for callsites that could not be inlined in the previous round). This is the inner loop.
  • Once inlining is done the closure optimizer runs
  • Methods changed by the closure optimizer are once again passed to the inliner to check for more callsites to inline. The inliner and closure optimizer run in a loop until they are done, this is the outer loop.

Inliner heuristics

The inliner heuristics are adjusted, changes in this PR are marked (new). They work as follows

  • Methods or callsites annotated @noinline are not inlined
  • We don't inline into forwarder methods (including synthetic ones, such as foo$anonfun$adapted, bridges, boxing/unboxing forwarders)
  • Methods or callsites annotated @inline are inlined
  • Higher-order methods with a function literal as argument are inlined
  • Higher-order methods where a parameter function of the callsite method is forwarded to the callee are inlined
  • (new) Methods with an XRef parameter are inlined. When nested methods update variables of the outer method, those variables are boxed into XRef objects. Inlining the nested method usually allows eliminating the XRef box.
  • (new) ScalaRunTime.array_apply and .array_update are inlined if the receiver is statically known to be an array. Those methods have a big pattern match for all primitive array types. This pattern match is later reduced (by cleanup optimizations) to the single matching case.
  • (new) Forwarders, factory methods and trivial methods are inlined. This includes bridges, anonfun$adapted methods, trivial methods such as _+1 closure bodies, etc.
    • Field accessors are not inlined, because fields are private. Inlining a field accessor changes a call to a public getter into a field read of a private field. This would mean that methods accessing fields can then no longer be inlined into different classes (it would cause illegal access errors).
    • Trait super accessors (static methods in interfaces that use invokespecial to call the default method) are not inlined, because the invokespecial would have different semantics when inlined into some different class. (*)

(*) Note that trait super accessors are still inlined when selected by a differnt heuristic, for example if they are higher-order methods with a function literal argument. In this case the static forwarder is inlined, and then also the invokespecial is inlined. So the actual default method body is ultimately inlined.

Size limits

Inlining is now subject to size limits. The limits were chosen by looking at method sizes when compiling the compiler and how they affect running time of the ASM analyzers. They are definitely up for dicsussion, and should maybe have command-line flags.

  • @inline-annotated, higher-order, synthetic forwarder and array_apply/update methods are inlined if the resulting method has <= 3000 ASM instruction nodes
  • higher-order callsites with a forwarderd function parameter, methods with XRef params, factories and forwarders that box/unbox are inlined if the result is <= 2000 ASM instructions
  • generic forwarders and trivial methos are inlined up to 1000 ASM instructions

Improvements to local optimizations

There are several improvements to local optimizations (cleanups)

  • Nullness analysis is now branch-sensitive
  • Local variables created by the inliner for arguments of the callee, and local variables inlined into a method, are nulled out at the end of the inlined code. This fixes Inlining can prolong object lifetimes leading to OutOfMemoryError bug#9137.
  • The inliner creates fewer local variables hodling arguments: if there is already some local variable holding the argument at the callsite, it is re-used in the inlined code.
  • The optimizer treats module loads as always non-null. This slightly twists semantics: if a module is accessed in its super construtor, the field is still null. The behavior can be disabled with -opt:-assume-modules-non-null.
  • The optimizer eliminates module loads of some built-in modules (Predef, ScalaRunTime, ClassTag). This can delay / skip initialization of those modules. The behavior can be disabled with -opt:-allow-skip-core-module-init
  • The optimizer can delay class loading, for example by eliminating an unused Foo.getClass. This can be disabled with -opt:-allow-skip-class-loading.
  • instanceof and checkcast optimization is improved
    • it knows about array types, which means array type tests can be eliminated statically
    • it uses a nullness analyzer to rewrite INSTANCEOF checks
  • Some intrinsics are re-written by the optimizer
    • ClassTag(classOf[X]).newArray is rewritten to new Array[X]
    • java.lang.reflect.Arrays.getLength(x) when x is statically known to be an array is rewritten to ARRAYLENGTH
    • x.getClass when x is statically known to be a primitive array is rewritten to LDC

Compiler performance

First the good news: the compiler built with the new inliner/optimizer runs ~2.5% faster on the larger codebases scalap and scala, around 1% faster on the small better-files codebase. Results here.

The not so good news: running the optimizer is a significantly slower than before. Compiling the compiler with the optimizer enabled: the jvm phase went from 43 seconds to 93 seconds.

By far the most amount of time is spent in the SourceValue analysis, which computes producers and consumers. This analysis is used in a number of places: rewrite closure invocations, push-pop, box-unbox, remove stale stores. It might be possible to fuse the local optimizations that use this analysis into one.

Besides taking a lot of cpu cycles, asm analyses also have a huge memory churn. This can be easily observed by attaching a profiler, there is basically constant minor GC going on as soon as the compiler reaches the backend.

Bugfixes

Fixes some older bugs

  • An unused Int.unbox("") should not be eliminated, as it throws a CCE (bug in 2.12)
  • Inlining may cause a crash when writing the classfile and computing stack map frames (ClassBType of inlined code not created and cached during inlining, also bug in 2.12)

@lrytz

This comment has been minimized.

@lrytz lrytz force-pushed the inlineRounds branch 2 times, most recently from d384e38 to 2971c8d Compare August 27, 2018 12:09
@lrytz lrytz force-pushed the inlineRounds branch 2 times, most recently from a0e2ea4 to 6b4da82 Compare August 29, 2018 10:24
@lrytz lrytz force-pushed the inlineRounds branch 5 times, most recently from 9c4f5de to 39f680f Compare September 14, 2018 13:47
@lrytz lrytz force-pushed the inlineRounds branch 2 times, most recently from 099b2b4 to 441e7d0 Compare September 23, 2018 18:46
@lrytz lrytz removed the WIP label Sep 27, 2018
@lrytz lrytz changed the title [WIP] More capable inliner [ci: last-only] More capable inliner [ci: last-only] Sep 27, 2018
@lrytz
Copy link
Member Author

lrytz commented Sep 27, 2018

This is finally getting ready now! I summarized the changes in the PR description. Feedback is very welcome, /cc @Ichoran, @non, @mkeskells, @fwbrasil

@fwbrasil
Copy link

@lrytz interesting work! I wonder if static inlining could hurt performance, though. The JIT compilers can make much more informed decisions based on the runtime profile.

@lrytz
Copy link
Member Author

lrytz commented Sep 27, 2018

Yes, static inlining can certainly hurt performance, mostly because the JVM prefers working with small methods (the exception to that rule is probably the rather low default -XX:MaxInlineLevel).

The goal of the Scala inliner and optimizer is to produce bytecode that the JVM can handle better. Or in other words, perform some optimizations that are common in Scala and that we know the JVM doesn't handle very well. There are basically two areas of interest:

  • Higher-order methods: they lead to megamorphic callsites if not inlined / specialized enough. Also closure elimination can improve performance as the JVM's escape analysis seems to run into limitations when the code becomes more complex.
  • Boxing: inlining can sometimes enable the elimination of boxing and unboxing. This has limited scope, it doesn't help for primitives in data structures (for example using an ArrayBuffer[Int]).

We inline very small methods as this does not increase method sizes and can sometimes lead to better static method-local analysis (for example when inlining a factory method).

All of the JVM discussion above is about HotSpot (version 8 in particular - I don't know if there were substantial changes in this area in 9/10/11). Once GraalVM becomes more prevalent we'll have to re-asses the situation. Graal definitely has a more powerful inliner with heuristics geared towards higher-order methods / functional patterns, and partial escape analysis might also have a substantial impact on Scala code. In the best case the Scala optimizer becomes obsolete, in the worst case the rewrites performed by the Scala optimizer have a negative effect when running on Graal.

@lrytz
Copy link
Member Author

lrytz commented Sep 28, 2018

Pushed changes (proposed and written by @retronym) to cut dependencies of Java sources on Scala files in the library, and switched to JavaThenScala. This allows inlining code that references those Java-defined types, see #6085 (comment). This shows a slight performance improvement, see 3rd data piont in this graph

lrytz and others added 12 commits October 15, 2018 21:35
Inliner heuristic to inline array_apply/update if the array has a
statically known type. Then the type tests of the callee can be
evaluated at compile-time.

Also cleanups in LocalOpt
When a argument value (on the stack) at an inlined callsite already has
a local variable holding the same value, don't create a new local
variable, but use the existing one.
Don't let methods grow over 3000 asm instructions.

Don't inline into synthetic forwarders.

Inline forwarders, especially ones that box/unbox.

Categories for inline requests, higher-order are still inlined when
size is > 2000, but forwarders aren't.

Fix @noinline on paramless callsite
It hurts performance. Analyses are really memory-intensive. Caching
them uses a lot of memory, and basically defers collecting them from
minor to major GC (they are stored using SoftReference).
With a 1.5G heap, the JVM is basically doing major GC all the time
when running the optimizer, which drasitcally hurts performance. With
a 3G heap this looks better. However, even with a 3G heap the overall
performance is worse than without any caching on a 1.5G heap.

Generally, the optimizer has a huge amount of memory churn, this can
be observed easily with a profiler. The best way to improve on that is
probably to run fewer analyzers, i.e., to fuse local optimizations that
use the same kind of analysis.
Copy link
Member

@retronym retronym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic!

@retronym
Copy link
Member

@lrytz Here's a commit to update the ScalaOptionParser in our SBT build: https://github.com/scala/scala/compare/2.13.x...retronym:review/7133?expand=1

@mkeskells
Copy link
Contributor

@lrytz Impressive! Looks like the work to improve expanded to be a rewrite!
I am very late to the party here and will make some time to have a look in more detail this week
I have a few questions in the interim though

  • has this been tested with the parallel backend - if nothing else this should be able to reduce some of the extra overheads of inliner as inliner is part of the for local optimisations isn't it?

  • is this change readily back-portable to 2.12 - I know there are 200+ files changed but a lot of non jvm stuff that looks mechanical (FunctionN etc), and I thought that jvm was quite similar

  • did you already look at the inliner warnings in the current build - I presume this will be a separate PR(e.g. a PR - better use the new inliner), but I recall there were a bunch of inliner warnings that I used to see in making a new dist

Both `.` and `/` work as package separator, the matcher simply replaces
`.` by `/`. But using dots is recommended in the setting's help.s
@lrytz
Copy link
Member Author

lrytz commented Oct 17, 2018

has this been tested with the parallel backend - if nothing else this should be able to reduce some of the extra overheads of inliner as inliner is part of the for local optimisations isn't it?

Very good point, I'll test the parallel backend. It will not help for inlining, because inlining works across compilation units, so we cannot parallelize. But it works for all the intra-method optimizations, which is where most of the time is spent anyway.

is this change readily back-portable to 2.12

Technically I don't see a problem backporting it to 2.12. So it's a question of risk, it's a big change that impacts generated bytecode. We should at least gain a bit of confidence first and see how well it does on 2.13.x.

Did you already look at the inliner warnings in the current build

No, it's a good suggestion. We might learn a few useful things from looking at those warnings. Last time I looked, the main problem was mixed compilation: when a Scala method references a Java type for which there's no bytecode available, we cannot inline that method. This was very common because we have some core classes in the library written in Java. This PR fixes that by going back to JavaThenScala.

lrytz and others added 2 commits October 26, 2018 14:12
To support the expanded range of choices for `-opt`, and also
`-opt-inline-from` which hand't been added.
@lrytz lrytz merged commit 051e7b6 into scala:2.13.x Oct 26, 2018
@SethTisue
Copy link
Member

🎉 🎉 🎉

@mdedetrich
Copy link
Contributor

Technically I don't see a problem backporting it to 2.12. So it's a question of risk, it's a big change that impacts generated bytecode. We should at least gain a bit of confidence first and see how well it does on 2.13.x.

Doesn't this effectively rule out a change for Scala 2.12.x?

@lrytz
Copy link
Member Author

lrytz commented Oct 26, 2018

Doesn't this effectively rule out a change for Scala 2.12.x?

I don't think that it's ruled out by any principles. Binary compatible bytecode changes are OK, so it's a risk-value tradeoff.

@mdedetrich
Copy link
Contributor

Yeah thats what I meant by effectively, judging by the size of the PR and how much has changed I suspect it would be very hard to make sure its completely binary compatible with the 2.12.x series

@lrytz
Copy link
Member Author

lrytz commented Oct 29, 2018

I suspect it would be very hard to make sure its completely binary compatible with the 2.12.x series

I think the risk that (a backport of) this PR breaks binary compatibility is very small. What changes is optimizations within methods, and the selection of callsites that are inlined. None of that changes the binary interface of the released scala-library.

The risks I see are

  • bugs in the inliner / optimizer, which could lead to wrong or invalid bytecode when compiling a project with the optimizer enabled (or even wrong/invalid bytecode in the released scala-library)
  • unexpected / large compile time increases. It's know that the additional inlining introduced in this PR causes the optimizer to take more time. I took measures to keep it under control (size limits), but there's a risk that certain code patterns need more time to analyze than expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes worth highlighting in next release notes
Projects
None yet
8 participants