More capable inliner [ci: last-only] #7133

lrytz · 2018-08-24T10:02:47Z

This PR brings several improvements to the inliner and optimizer.

Teaser

class C {
  def a = Array.fill[Option[String]](10)(None)
  def b(a: Array[Int]) = a.map(_ + 1)
}

$> qsc Test.scala -opt:l:inline '-opt-inline-from:**'

$> cfr-decompiler C.class

public class C {
    public Option<String>[] a() {
        int fill_n = 10;
        if (fill_n <= 0) {
            return new Option[0];
        }
        Option[] fill_array = new Option[fill_n];
        for (int fill_i = 0; fill_i < fill_n; ++fill_i) {
            None$ none$ = None$.MODULE$;
            fill_array[fill_i] = none$;
        }
        return fill_array;
    }

    public int[] b(int[] a) {
        int n = a.length;
        int[] arrn = new int[n];
        for (int map$extension_i = 0; map$extension_i < n; ++map$extension_i) {
            int n2;
            arrn[map$extension_i] = n2 = a[map$extension_i] + 1;
        }
        return arrn;
    }
}

State of 2.12

In 2.12, there is a single round of inlining.

After building the call graph, the heuristic looks at every callsite and decides which ones to select for inlining
These inline requests are then filtered (avoid cycles) and ordered (leaves of the request tree are inlined first)
The selected callsites are inlined
Then the closure optimizer runs and re-writes closure invocations that appear within the same method as the closure allocation (to directly call the closure body method instead).
Local optimizatinos then clean up (closure allocation is eliminated, lots of other cleanups)

Changes to inlining procedure

This PR changes the inlnier and closure optimizer to run in a loop (or, more precisely an inner loop of the inliner and an outer loop of inliner and closure optimizer):

The beginning is the same as in 2.12 (build call graph and select inline requests)
The inliner no longer "fixes up" the call graph of the callsite method (remove inlied callsite, add those of inlined code). Instead, methods changed by the inliner removed and re-added to the call graph. This re-runs the type analysis, which may result in a more precise call graph (callsites may get a more precise receiver type after inlining).
Callsites in methods changed by the inliner are re-checked by the inliner heuristic, potentially creating new inline requests (for callsites that could not be inlined in the previous round). This is the inner loop.
Once inlining is done the closure optimizer runs
Methods changed by the closure optimizer are once again passed to the inliner to check for more callsites to inline. The inliner and closure optimizer run in a loop until they are done, this is the outer loop.

Inliner heuristics

The inliner heuristics are adjusted, changes in this PR are marked (new). They work as follows

Methods or callsites annotated @noinline are not inlined
We don't inline into forwarder methods (including synthetic ones, such as foo$anonfun$adapted, bridges, boxing/unboxing forwarders)
Methods or callsites annotated @inline are inlined
Higher-order methods with a function literal as argument are inlined
Higher-order methods where a parameter function of the callsite method is forwarded to the callee are inlined
(new) Methods with an XRef parameter are inlined. When nested methods update variables of the outer method, those variables are boxed into XRef objects. Inlining the nested method usually allows eliminating the XRef box.
(new) ScalaRunTime.array_apply and .array_update are inlined if the receiver is statically known to be an array. Those methods have a big pattern match for all primitive array types. This pattern match is later reduced (by cleanup optimizations) to the single matching case.
(new) Forwarders, factory methods and trivial methods are inlined. This includes bridges, anonfun$adapted methods, trivial methods such as _+1 closure bodies, etc.
- Field accessors are not inlined, because fields are private. Inlining a field accessor changes a call to a public getter into a field read of a private field. This would mean that methods accessing fields can then no longer be inlined into different classes (it would cause illegal access errors).
- Trait super accessors (static methods in interfaces that use invokespecial to call the default method) are not inlined, because the invokespecial would have different semantics when inlined into some different class. (*)

(*) Note that trait super accessors are still inlined when selected by a differnt heuristic, for example if they are higher-order methods with a function literal argument. In this case the static forwarder is inlined, and then also the invokespecial is inlined. So the actual default method body is ultimately inlined.

Size limits

Inlining is now subject to size limits. The limits were chosen by looking at method sizes when compiling the compiler and how they affect running time of the ASM analyzers. They are definitely up for dicsussion, and should maybe have command-line flags.

@inline-annotated, higher-order, synthetic forwarder and array_apply/update methods are inlined if the resulting method has <= 3000 ASM instruction nodes
higher-order callsites with a forwarderd function parameter, methods with XRef params, factories and forwarders that box/unbox are inlined if the result is <= 2000 ASM instructions
generic forwarders and trivial methos are inlined up to 1000 ASM instructions

Improvements to local optimizations

There are several improvements to local optimizations (cleanups)

Nullness analysis is now branch-sensitive
Local variables created by the inliner for arguments of the callee, and local variables inlined into a method, are nulled out at the end of the inlined code. This fixes Inlining can prolong object lifetimes leading to OutOfMemoryError bug#9137.
The inliner creates fewer local variables hodling arguments: if there is already some local variable holding the argument at the callsite, it is re-used in the inlined code.
The optimizer treats module loads as always non-null. This slightly twists semantics: if a module is accessed in its super construtor, the field is still null. The behavior can be disabled with -opt:-assume-modules-non-null.
The optimizer eliminates module loads of some built-in modules (Predef, ScalaRunTime, ClassTag). This can delay / skip initialization of those modules. The behavior can be disabled with -opt:-allow-skip-core-module-init
The optimizer can delay class loading, for example by eliminating an unused Foo.getClass. This can be disabled with -opt:-allow-skip-class-loading.
instanceof and checkcast optimization is improved
- it knows about array types, which means array type tests can be eliminated statically
- it uses a nullness analyzer to rewrite INSTANCEOF checks
Some intrinsics are re-written by the optimizer
- ClassTag(classOf[X]).newArray is rewritten to new Array[X]
- java.lang.reflect.Arrays.getLength(x) when x is statically known to be an array is rewritten to ARRAYLENGTH
- x.getClass when x is statically known to be a primitive array is rewritten to LDC

Compiler performance

First the good news: the compiler built with the new inliner/optimizer runs ~2.5% faster on the larger codebases scalap and scala, around 1% faster on the small better-files codebase. Results here.

The not so good news: running the optimizer is a significantly slower than before. Compiling the compiler with the optimizer enabled: the jvm phase went from 43 seconds to 93 seconds.

By far the most amount of time is spent in the SourceValue analysis, which computes producers and consumers. This analysis is used in a number of places: rewrite closure invocations, push-pop, box-unbox, remove stale stores. It might be possible to fuse the local optimizations that use this analysis into one.

Besides taking a lot of cpu cycles, asm analyses also have a huge memory churn. This can be easily observed by attaching a profiler, there is basically constant minor GC going on as soon as the compiler reaches the backend.

Bugfixes

Fixes some older bugs

An unused Int.unbox("") should not be eliminated, as it throws a CCE (bug in 2.12)
Inlining may cause a crash when writing the classfile and computing stack map frames (ClassBType of inlined code not created and cached during inlining, also bug in 2.12)

lrytz · 2018-09-27T14:48:31Z

This is finally getting ready now! I summarized the changes in the PR description. Feedback is very welcome, /cc @Ichoran, @non, @mkeskells, @fwbrasil

fwbrasil · 2018-09-27T15:19:36Z

@lrytz interesting work! I wonder if static inlining could hurt performance, though. The JIT compilers can make much more informed decisions based on the runtime profile.

lrytz · 2018-09-27T19:17:15Z

Yes, static inlining can certainly hurt performance, mostly because the JVM prefers working with small methods (the exception to that rule is probably the rather low default -XX:MaxInlineLevel).

The goal of the Scala inliner and optimizer is to produce bytecode that the JVM can handle better. Or in other words, perform some optimizations that are common in Scala and that we know the JVM doesn't handle very well. There are basically two areas of interest:

Higher-order methods: they lead to megamorphic callsites if not inlined / specialized enough. Also closure elimination can improve performance as the JVM's escape analysis seems to run into limitations when the code becomes more complex.
Boxing: inlining can sometimes enable the elimination of boxing and unboxing. This has limited scope, it doesn't help for primitives in data structures (for example using an ArrayBuffer[Int]).

We inline very small methods as this does not increase method sizes and can sometimes lead to better static method-local analysis (for example when inlining a factory method).

All of the JVM discussion above is about HotSpot (version 8 in particular - I don't know if there were substantial changes in this area in 9/10/11). Once GraalVM becomes more prevalent we'll have to re-asses the situation. Graal definitely has a more powerful inliner with heuristics geared towards higher-order methods / functional patterns, and partial escape analysis might also have a substantial impact on Scala code. In the best case the Scala optimizer becomes obsolete, in the worst case the rewrites performed by the Scala optimizer have a negative effect when running on Graal.

lrytz · 2018-09-28T17:32:52Z

Pushed changes (proposed and written by @retronym) to cut dependencies of Java sources on Scala files in the library, and switched to JavaThenScala. This allows inlining code that references those Java-defined types, see #6085 (comment). This shows a slight performance improvement, see 3rd data piont in this graph

Inliner heuristic to inline array_apply/update if the array has a statically known type. Then the type tests of the callee can be evaluated at compile-time. Also cleanups in LocalOpt

…tanceof optimization

When a argument value (on the stack) at an inlined callsite already has a local variable holding the same value, don't create a new local variable, but use the existing one.

@noinline

Don't let methods grow over 3000 asm instructions. Don't inline into synthetic forwarders. Inline forwarders, especially ones that box/unbox. Categories for inline requests, higher-order are still inlined when size is > 2000, but forwarders aren't. Fix @noinline on paramless callsite

this issue is old (since 2.12)

…red for stack map frame calculation

It hurts performance. Analyses are really memory-intensive. Caching them uses a lot of memory, and basically defers collecting them from minor to major GC (they are stored using SoftReference). With a 1.5G heap, the JVM is basically doing major GC all the time when running the optimizer, which drasitcally hurts performance. With a 3G heap this looks better. However, even with a 3G heap the overall performance is worse than without any caching on a 1.5G heap. Generally, the optimizer has a huge amount of memory churn, this can be observed easily with a profiler. The best way to improve on that is probably to run fewer analyzers, i.e., to fuse local optimizations that use the same kind of analysis.

retronym

Epic!

retronym · 2018-10-16T04:41:47Z

@lrytz Here's a commit to update the ScalaOptionParser in our SBT build: https://github.com/scala/scala/compare/2.13.x...retronym:review/7133?expand=1

mkeskells · 2018-10-16T05:57:12Z

@lrytz Impressive! Looks like the work to improve expanded to be a rewrite!
I am very late to the party here and will make some time to have a look in more detail this week
I have a few questions in the interim though

has this been tested with the parallel backend - if nothing else this should be able to reduce some of the extra overheads of inliner as inliner is part of the for local optimisations isn't it?
is this change readily back-portable to 2.12 - I know there are 200+ files changed but a lot of non jvm stuff that looks mechanical (FunctionN etc), and I thought that jvm was quite similar
did you already look at the inliner warnings in the current build - I presume this will be a separate PR(e.g. a PR - better use the new inliner), but I recall there were a bunch of inliner warnings that I used to see in making a new dist

Both `.` and `/` work as package separator, the matcher simply replaces `.` by `/`. But using dots is recommended in the setting's help.s

lrytz · 2018-10-17T10:07:36Z

has this been tested with the parallel backend - if nothing else this should be able to reduce some of the extra overheads of inliner as inliner is part of the for local optimisations isn't it?

Very good point, I'll test the parallel backend. It will not help for inlining, because inlining works across compilation units, so we cannot parallelize. But it works for all the intra-method optimizations, which is where most of the time is spent anyway.

is this change readily back-portable to 2.12

Technically I don't see a problem backporting it to 2.12. So it's a question of risk, it's a big change that impacts generated bytecode. We should at least gain a bit of confidence first and see how well it does on 2.13.x.

Did you already look at the inliner warnings in the current build

No, it's a good suggestion. We might learn a few useful things from looking at those warnings. Last time I looked, the main problem was mixed compilation: when a Scala method references a Java type for which there's no bytecode available, we cannot inline that method. This was very common because we have some core classes in the library written in Java. This PR fixes that by going back to JavaThenScala.

To support the expanded range of choices for `-opt`, and also `-opt-inline-from` which hand't been added.

SethTisue · 2018-10-26T14:27:35Z

🎉 🎉 🎉

mdedetrich · 2018-10-26T14:46:07Z

Technically I don't see a problem backporting it to 2.12. So it's a question of risk, it's a big change that impacts generated bytecode. We should at least gain a bit of confidence first and see how well it does on 2.13.x.

Doesn't this effectively rule out a change for Scala 2.12.x?

lrytz · 2018-10-26T17:52:59Z

Doesn't this effectively rule out a change for Scala 2.12.x?

I don't think that it's ruled out by any principles. Binary compatible bytecode changes are OK, so it's a risk-value tradeoff.

mdedetrich · 2018-10-27T10:29:42Z

Yeah thats what I meant by effectively, judging by the size of the PR and how much has changed I suspect it would be very hard to make sure its completely binary compatible with the 2.12.x series

lrytz · 2018-10-29T09:54:21Z

I suspect it would be very hard to make sure its completely binary compatible with the 2.12.x series

I think the risk that (a backport of) this PR breaks binary compatibility is very small. What changes is optimizations within methods, and the selection of callsites that are inlined. None of that changes the binary interface of the released scala-library.

The risks I see are

bugs in the inliner / optimizer, which could lead to wrong or invalid bytecode when compiling a project with the optimizer enabled (or even wrong/invalid bytecode in the released scala-library)
unexpected / large compile time increases. It's know that the additional inlining introduced in this PR causes the optimizer to take more time. I took measures to keep it under control (size limits), but there's a risk that certain code patterns need more time to analyze than expected.

lrytz added the WIP label Aug 24, 2018

scala-jenkins added this to the 2.13.0-RC1 milestone Aug 24, 2018

lrytz force-pushed the inlineRounds branch from d8bb261 to e1f6d2e Compare August 24, 2018 11:10

This comment has been minimized.

Sign in to view

lrytz force-pushed the inlineRounds branch 2 times, most recently from d384e38 to 2971c8d Compare August 27, 2018 12:09

lrytz mentioned this pull request Aug 27, 2018

Inlining can prolong object lifetimes leading to OutOfMemoryError scala/bug#9137

Closed

lrytz force-pushed the inlineRounds branch 2 times, most recently from a0e2ea4 to 6b4da82 Compare August 29, 2018 10:24

lrytz force-pushed the inlineRounds branch 5 times, most recently from 9c4f5de to 39f680f Compare September 14, 2018 13:47

lrytz force-pushed the inlineRounds branch 2 times, most recently from 099b2b4 to 441e7d0 Compare September 23, 2018 18:46

lrytz force-pushed the inlineRounds branch from bb60c1f to 1933c6b Compare September 26, 2018 10:24

lrytz removed the WIP label Sep 27, 2018

lrytz changed the title ~~[WIP] More capable inliner [ci: last-only]~~ More capable inliner [ci: last-only] Sep 27, 2018

lrytz requested a review from retronym September 28, 2018 12:47

lrytz mentioned this pull request Sep 28, 2018

JavaThenScala for compiling the library #7283

Closed

lrytz force-pushed the inlineRounds branch from 3196467 to 99bf708 Compare September 28, 2018 13:56

lrytz closed this Sep 28, 2018

lrytz reopened this Sep 28, 2018

lrytz force-pushed the inlineRounds branch 2 times, most recently from 724aa17 to 8d17277 Compare September 28, 2018 14:05

lrytz and others added 12 commits October 15, 2018 21:35

Inline array_apply/update after rewriting ClassTag(x).newArray

2771617

Inliner heuristic to inline array_apply/update if the array has a statically known type. Then the type tests of the callee can be evaluated at compile-time. Also cleanups in LocalOpt

fix BType.conformsTo, update BCode.adapt. Also fix isUnrelated in ins…

3e82932

…tanceof optimization

Allocate fewer local variables in the inliner

bfa1380

When a argument value (on the stack) at an inlined callsite already has a local variable holding the same value, don't create a new local variable, but use the existing one.

fix bug, clean analyzer cache early enough

cdf11fc

don't inline accessors

a344038

fix over-eager elimination of unbox calls (they may cause CCE)

c565f3e

this issue is old (since 2.12)

Ensure ClassBTypes for return types of inlined code are cached, requi…

7657e87

…red for stack map frame calculation

fix tests

a8c90f1

Java sources in src/library now don't depend on Scala sources

76f60db

Port StructuralCallSite, LambdaDeserialize to Scala

9f95dec

lrytz force-pushed the inlineRounds branch from a2e4ec6 to 9f95dec Compare October 15, 2018 19:35

retronym approved these changes Oct 16, 2018

View reviewed changes

Use recommended syntax for opt-inline-from

986b065

Both `.` and `/` work as package separator, the matcher simply replaces `.` by `/`. But using dots is recommended in the setting's help.s

lrytz mentioned this pull request Oct 19, 2018

Branch-sensitive nullness analysis / optimization #6858

Closed

lrytz and others added 2 commits October 26, 2018 14:12

Limit the loop of inlinnig and closure rewriting to max 10 iterations

112999f

Update our scala option parser for the scala input task in SBT

30fd12f

To support the expanded range of choices for `-opt`, and also `-opt-inline-from` which hand't been added.

lrytz merged commit 051e7b6 into scala:2.13.x Oct 26, 2018

SethTisue mentioned this pull request Oct 26, 2018

multiple PRs are green on Jenkins, red on Travis-CI scala/scala-dev#570

Closed

lrytz mentioned this pull request Nov 12, 2018

Ensure laziness of LazyList methods #7355

Merged

lrytz mentioned this pull request Jun 26, 2019

scala-library.jar grew a lot from 2.13.0-M5 to 2.13.0-RC1 scala/bug#11577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More capable inliner [ci: last-only] #7133

More capable inliner [ci: last-only] #7133

lrytz commented Aug 24, 2018 •

edited

Loading

This comment has been minimized.

lrytz commented Sep 27, 2018

fwbrasil commented Sep 27, 2018

lrytz commented Sep 27, 2018

lrytz commented Sep 28, 2018 •

edited

Loading

retronym left a comment

retronym commented Oct 16, 2018

mkeskells commented Oct 16, 2018

lrytz commented Oct 17, 2018

SethTisue commented Oct 26, 2018

mdedetrich commented Oct 26, 2018

lrytz commented Oct 26, 2018

mdedetrich commented Oct 27, 2018

lrytz commented Oct 29, 2018

More capable inliner [ci: last-only] #7133

More capable inliner [ci: last-only] #7133

Conversation

lrytz commented Aug 24, 2018 • edited Loading

Teaser

State of 2.12

Changes to inlining procedure

Inliner heuristics

Size limits

Improvements to local optimizations

Compiler performance

Bugfixes

This comment has been minimized.

lrytz commented Sep 27, 2018

fwbrasil commented Sep 27, 2018

lrytz commented Sep 27, 2018

lrytz commented Sep 28, 2018 • edited Loading

retronym left a comment

Choose a reason for hiding this comment

retronym commented Oct 16, 2018

mkeskells commented Oct 16, 2018

lrytz commented Oct 17, 2018

SethTisue commented Oct 26, 2018

mdedetrich commented Oct 26, 2018

lrytz commented Oct 26, 2018

mdedetrich commented Oct 27, 2018

lrytz commented Oct 29, 2018

lrytz commented Aug 24, 2018 •

edited

Loading

lrytz commented Sep 28, 2018 •

edited

Loading