New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More capable inliner [ci: last-only] #7133

Merged
merged 56 commits into from Oct 26, 2018

Conversation

Projects
None yet
8 participants
@lrytz
Member

lrytz commented Aug 24, 2018

This PR brings several improvements to the inliner and optimizer.

Teaser

class C {
  def a = Array.fill[Option[String]](10)(None)
  def b(a: Array[Int]) = a.map(_ + 1)
}
$> qsc Test.scala -opt:l:inline '-opt-inline-from:**'

$> cfr-decompiler C.class
public class C {
    public Option<String>[] a() {
        int fill_n = 10;
        if (fill_n <= 0) {
            return new Option[0];
        }
        Option[] fill_array = new Option[fill_n];
        for (int fill_i = 0; fill_i < fill_n; ++fill_i) {
            None$ none$ = None$.MODULE$;
            fill_array[fill_i] = none$;
        }
        return fill_array;
    }

    public int[] b(int[] a) {
        int n = a.length;
        int[] arrn = new int[n];
        for (int map$extension_i = 0; map$extension_i < n; ++map$extension_i) {
            int n2;
            arrn[map$extension_i] = n2 = a[map$extension_i] + 1;
        }
        return arrn;
    }
}

State of 2.12

In 2.12, there is a single round of inlining.

  • After building the call graph, the heuristic looks at every callsite and decides which ones to select for inlining
  • These inline requests are then filtered (avoid cycles) and ordered (leaves of the request tree are inlined first)
  • The selected callsites are inlined
  • Then the closure optimizer runs and re-writes closure invocations that appear within the same method as the closure allocation (to directly call the closure body method instead).
  • Local optimizatinos then clean up (closure allocation is eliminated, lots of other cleanups)

Changes to inlining procedure

This PR changes the inlnier and closure optimizer to run in a loop (or, more precisely an inner loop of the inliner and an outer loop of inliner and closure optimizer):

  • The beginning is the same as in 2.12 (build call graph and select inline requests)
  • The inliner no longer "fixes up" the call graph of the callsite method (remove inlied callsite, add those of inlined code). Instead, methods changed by the inliner removed and re-added to the call graph. This re-runs the type analysis, which may result in a more precise call graph (callsites may get a more precise receiver type after inlining).
  • Callsites in methods changed by the inliner are re-checked by the inliner heuristic, potentially creating new inline requests (for callsites that could not be inlined in the previous round). This is the inner loop.
  • Once inlining is done the closure optimizer runs
  • Methods changed by the closure optimizer are once again passed to the inliner to check for more callsites to inline. The inliner and closure optimizer run in a loop until they are done, this is the outer loop.

Inliner heuristics

The inliner heuristics are adjusted, changes in this PR are marked (new). They work as follows

  • Methods or callsites annotated @noinline are not inlined
  • We don't inline into forwarder methods (including synthetic ones, such as foo$anonfun$adapted, bridges, boxing/unboxing forwarders)
  • Methods or callsites annotated @inline are inlined
  • Higher-order methods with a function literal as argument are inlined
  • Higher-order methods where a parameter function of the callsite method is forwarded to the callee are inlined
  • (new) Methods with an XRef parameter are inlined. When nested methods update variables of the outer method, those variables are boxed into XRef objects. Inlining the nested method usually allows eliminating the XRef box.
  • (new) ScalaRunTime.array_apply and .array_update are inlined if the receiver is statically known to be an array. Those methods have a big pattern match for all primitive array types. This pattern match is later reduced (by cleanup optimizations) to the single matching case.
  • (new) Forwarders, factory methods and trivial methods are inlined. This includes bridges, anonfun$adapted methods, trivial methods such as _+1 closure bodies, etc.
    • Field accessors are not inlined, because fields are private. Inlining a field accessor changes a call to a public getter into a field read of a private field. This would mean that methods accessing fields can then no longer be inlined into different classes (it would cause illegal access errors).
    • Trait super accessors (static methods in interfaces that use invokespecial to call the default method) are not inlined, because the invokespecial would have different semantics when inlined into some different class. (*)

(*) Note that trait super accessors are still inlined when selected by a differnt heuristic, for example if they are higher-order methods with a function literal argument. In this case the static forwarder is inlined, and then also the invokespecial is inlined. So the actual default method body is ultimately inlined.

Size limits

Inlining is now subject to size limits. The limits were chosen by looking at method sizes when compiling the compiler and how they affect running time of the ASM analyzers. They are definitely up for dicsussion, and should maybe have command-line flags.

  • @inline-annotated, higher-order, synthetic forwarder and array_apply/update methods are inlined if the resulting method has <= 3000 ASM instruction nodes
  • higher-order callsites with a forwarderd function parameter, methods with XRef params, factories and forwarders that box/unbox are inlined if the result is <= 2000 ASM instructions
  • generic forwarders and trivial methos are inlined up to 1000 ASM instructions

Improvements to local optimizations

There are several improvements to local optimizations (cleanups)

  • Nullness analysis is now branch-sensitive
  • Local variables created by the inliner for arguments of the callee, and local variables inlined into a method, are nulled out at the end of the inlined code. This fixes scala/bug#9137.
  • The inliner creates fewer local variables hodling arguments: if there is already some local variable holding the argument at the callsite, it is re-used in the inlined code.
  • The optimizer treats module loads as always non-null. This slightly twists semantics: if a module is accessed in its super construtor, the field is still null. The behavior can be disabled with -opt:-assume-modules-non-null.
  • The optimizer eliminates module loads of some built-in modules (Predef, ScalaRunTime, ClassTag). This can delay / skip initialization of those modules. The behavior can be disabled with -opt:-allow-skip-core-module-init
  • The optimizer can delay class loading, for example by eliminating an unused Foo.getClass. This can be disabled with -opt:-allow-skip-class-loading.
  • instanceof and checkcast optimization is improved
    • it knows about array types, which means array type tests can be eliminated statically
    • it uses a nullness analyzer to rewrite INSTANCEOF checks
  • Some intrinsics are re-written by the optimizer
    • ClassTag(classOf[X]).newArray is rewritten to new Array[X]
    • java.lang.reflect.Arrays.getLength(x) when x is statically known to be an array is rewritten to ARRAYLENGTH
    • x.getClass when x is statically known to be a primitive array is rewritten to LDC

Compiler performance

First the good news: the compiler built with the new inliner/optimizer runs ~2.5% faster on the larger codebases scalap and scala, around 1% faster on the small better-files codebase. Results here.

The not so good news: running the optimizer is a significantly slower than before. Compiling the compiler with the optimizer enabled: the jvm phase went from 43 seconds to 93 seconds.

By far the most amount of time is spent in the SourceValue analysis, which computes producers and consumers. This analysis is used in a number of places: rewrite closure invocations, push-pop, box-unbox, remove stale stores. It might be possible to fuse the local optimizations that use this analysis into one.

Besides taking a lot of cpu cycles, asm analyses also have a huge memory churn. This can be easily observed by attaching a profiler, there is basically constant minor GC going on as soon as the compiler reaches the backend.

Bugfixes

Fixes some older bugs

  • An unused Int.unbox("") should not be eliminated, as it throws a CCE (bug in 2.12)
  • Inlining may cause a crash when writing the classfile and computing stack map frames (ClassBType of inlined code not created and cached during inlining, also bug in 2.12)

@lrytz lrytz added the WIP label Aug 24, 2018

@scala-jenkins scala-jenkins added this to the 2.13.0-RC1 milestone Aug 24, 2018

@lrytz lrytz force-pushed the lrytz:inlineRounds branch from d8bb261 to e1f6d2e Aug 24, 2018

@lrytz

This comment was marked as outdated.

Member

lrytz commented Aug 24, 2018

Test failure due to more inlining, leading to scala/bug#9137. Oh and travis doesn't like the long time spent in the inliner :) No output has been received in the last 10m0s.

@lrytz lrytz force-pushed the lrytz:inlineRounds branch 2 times, most recently from d384e38 to 2971c8d Aug 25, 2018

@lrytz lrytz force-pushed the lrytz:inlineRounds branch 2 times, most recently from a0e2ea4 to 6b4da82 Aug 27, 2018

@lrytz lrytz force-pushed the lrytz:inlineRounds branch 5 times, most recently from 9c4f5de to 39f680f Sep 7, 2018

@lrytz lrytz force-pushed the lrytz:inlineRounds branch 2 times, most recently from 099b2b4 to 441e7d0 Sep 17, 2018

@lrytz lrytz force-pushed the lrytz:inlineRounds branch from bb60c1f to 1933c6b Sep 26, 2018

@lrytz lrytz removed the WIP label Sep 27, 2018

@lrytz lrytz changed the title from [WIP] More capable inliner [ci: last-only] to More capable inliner [ci: last-only] Sep 27, 2018

@lrytz

This comment has been minimized.

Member

lrytz commented Sep 27, 2018

This is finally getting ready now! I summarized the changes in the PR description. Feedback is very welcome, /cc @Ichoran, @non, @mkeskells, @fwbrasil

@fwbrasil

This comment has been minimized.

fwbrasil commented Sep 27, 2018

@lrytz interesting work! I wonder if static inlining could hurt performance, though. The JIT compilers can make much more informed decisions based on the runtime profile.

@lrytz

This comment has been minimized.

Member

lrytz commented Sep 27, 2018

Yes, static inlining can certainly hurt performance, mostly because the JVM prefers working with small methods (the exception to that rule is probably the rather low default -XX:MaxInlineLevel).

The goal of the Scala inliner and optimizer is to produce bytecode that the JVM can handle better. Or in other words, perform some optimizations that are common in Scala and that we know the JVM doesn't handle very well. There are basically two areas of interest:

  • Higher-order methods: they lead to megamorphic callsites if not inlined / specialized enough. Also closure elimination can improve performance as the JVM's escape analysis seems to run into limitations when the code becomes more complex.
  • Boxing: inlining can sometimes enable the elimination of boxing and unboxing. This has limited scope, it doesn't help for primitives in data structures (for example using an ArrayBuffer[Int]).

We inline very small methods as this does not increase method sizes and can sometimes lead to better static method-local analysis (for example when inlining a factory method).

All of the JVM discussion above is about HotSpot (version 8 in particular - I don't know if there were substantial changes in this area in 9/10/11). Once GraalVM becomes more prevalent we'll have to re-asses the situation. Graal definitely has a more powerful inliner with heuristics geared towards higher-order methods / functional patterns, and partial escape analysis might also have a substantial impact on Scala code. In the best case the Scala optimizer becomes obsolete, in the worst case the rewrites performed by the Scala optimizer have a negative effect when running on Graal.

@lrytz lrytz requested a review from retronym Sep 28, 2018

@lrytz lrytz force-pushed the lrytz:inlineRounds branch from 3196467 to 99bf708 Sep 28, 2018

@lrytz lrytz closed this Sep 28, 2018

@lrytz lrytz reopened this Sep 28, 2018

@lrytz lrytz force-pushed the lrytz:inlineRounds branch 2 times, most recently from 724aa17 to 8d17277 Sep 28, 2018

@lrytz

This comment has been minimized.

Member

lrytz commented Sep 28, 2018

Pushed changes (proposed and written by @retronym) to cut dependencies of Java sources on Scala files in the library, and switched to JavaThenScala. This allows inlining code that references those Java-defined types, see #6085 (comment). This shows a slight performance improvement, see 3rd data piont in this graph

lrytz and others added some commits Sep 13, 2018

Inline array_apply/update after rewriting ClassTag(x).newArray
Inliner heuristic to inline array_apply/update if the array has a
statically known type. Then the type tests of the callee can be
evaluated at compile-time.

Also cleanups in LocalOpt
Allocate fewer local variables in the inliner
When a argument value (on the stack) at an inlined callsite already has
a local variable holding the same value, don't create a new local
variable, but use the existing one.
Limit method size when inlining. Improve inliner heuristics
Don't let methods grow over 3000 asm instructions.

Don't inline into synthetic forwarders.

Inline forwarders, especially ones that box/unbox.

Categories for inline requests, higher-order are still inlined when
size is > 2000, but forwarders aren't.

Fix @noinline on paramless callsite
Remove analyzer caching
It hurts performance. Analyses are really memory-intensive. Caching
them uses a lot of memory, and basically defers collecting them from
minor to major GC (they are stored using SoftReference).
With a 1.5G heap, the JVM is basically doing major GC all the time
when running the optimizer, which drasitcally hurts performance. With
a 3G heap this looks better. However, even with a 3G heap the overall
performance is worse than without any caching on a 1.5G heap.

Generally, the optimizer has a huge amount of memory churn, this can
be observed easily with a profiler. The best way to improve on that is
probably to run fewer analyzers, i.e., to fuse local optimizations that
use the same kind of analysis.

@lrytz lrytz force-pushed the lrytz:inlineRounds branch from a2e4ec6 to 9f95dec Oct 15, 2018

@retronym

Epic!

@retronym

This comment has been minimized.

Member

retronym commented Oct 16, 2018

@lrytz Here's a commit to update the ScalaOptionParser in our SBT build: https://github.com/scala/scala/compare/2.13.x...retronym:review/7133?expand=1

@mkeskells

This comment has been minimized.

Contributor

mkeskells commented Oct 16, 2018

@lrytz Impressive! Looks like the work to improve expanded to be a rewrite!
I am very late to the party here and will make some time to have a look in more detail this week
I have a few questions in the interim though

  • has this been tested with the parallel backend - if nothing else this should be able to reduce some of the extra overheads of inliner as inliner is part of the for local optimisations isn't it?

  • is this change readily back-portable to 2.12 - I know there are 200+ files changed but a lot of non jvm stuff that looks mechanical (FunctionN etc), and I thought that jvm was quite similar

  • did you already look at the inliner warnings in the current build - I presume this will be a separate PR(e.g. a PR - better use the new inliner), but I recall there were a bunch of inliner warnings that I used to see in making a new dist

Use recommended syntax for opt-inline-from
Both `.` and `/` work as package separator, the matcher simply replaces
`.` by `/`. But using dots is recommended in the setting's help.s
@lrytz

This comment has been minimized.

Member

lrytz commented Oct 17, 2018

has this been tested with the parallel backend - if nothing else this should be able to reduce some of the extra overheads of inliner as inliner is part of the for local optimisations isn't it?

Very good point, I'll test the parallel backend. It will not help for inlining, because inlining works across compilation units, so we cannot parallelize. But it works for all the intra-method optimizations, which is where most of the time is spent anyway.

is this change readily back-portable to 2.12

Technically I don't see a problem backporting it to 2.12. So it's a question of risk, it's a big change that impacts generated bytecode. We should at least gain a bit of confidence first and see how well it does on 2.13.x.

Did you already look at the inliner warnings in the current build

No, it's a good suggestion. We might learn a few useful things from looking at those warnings. Last time I looked, the main problem was mixed compilation: when a Scala method references a Java type for which there's no bytecode available, we cannot inline that method. This was very common because we have some core classes in the library written in Java. This PR fixes that by going back to JavaThenScala.

lrytz and others added some commits Oct 26, 2018

Update our scala option parser for the scala input task in SBT
To support the expanded range of choices for `-opt`, and also
`-opt-inline-from` which hand't been added.

@lrytz lrytz merged commit 051e7b6 into scala:2.13.x Oct 26, 2018

2 of 3 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
cla @lrytz signed the Scala CLA. Thanks!
Details
validate-main [5235] SUCCESS. Took 44 min.
Details
@SethTisue

This comment has been minimized.

Member

SethTisue commented Oct 26, 2018

🎉 🎉 🎉

@mdedetrich

This comment has been minimized.

Contributor

mdedetrich commented Oct 26, 2018

Technically I don't see a problem backporting it to 2.12. So it's a question of risk, it's a big change that impacts generated bytecode. We should at least gain a bit of confidence first and see how well it does on 2.13.x.

Doesn't this effectively rule out a change for Scala 2.12.x?

@lrytz

This comment has been minimized.

Member

lrytz commented Oct 26, 2018

Doesn't this effectively rule out a change for Scala 2.12.x?

I don't think that it's ruled out by any principles. Binary compatible bytecode changes are OK, so it's a risk-value tradeoff.

@mdedetrich

This comment has been minimized.

Contributor

mdedetrich commented Oct 27, 2018

Yeah thats what I meant by effectively, judging by the size of the PR and how much has changed I suspect it would be very hard to make sure its completely binary compatible with the 2.12.x series

@lrytz

This comment has been minimized.

Member

lrytz commented Oct 29, 2018

I suspect it would be very hard to make sure its completely binary compatible with the 2.12.x series

I think the risk that (a backport of) this PR breaks binary compatibility is very small. What changes is optimizations within methods, and the selection of callsites that are inlined. None of that changes the binary interface of the released scala-library.

The risks I see are

  • bugs in the inliner / optimizer, which could lead to wrong or invalid bytecode when compiling a project with the optimizer enabled (or even wrong/invalid bytecode in the released scala-library)
  • unexpected / large compile time increases. It's know that the additional inlining introduced in this PR causes the optimizer to take more time. I took measures to keep it under control (size limits), but there's a risk that certain code patterns need more time to analyze than expected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment