Skip to content

Fix GraalVM arguments in the README #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 30, 2018

Conversation

vjovanov
Copy link
Contributor

No description provided.

@vjovanov vjovanov force-pushed the fix-graal-arguments branch from 12bf885 to d09ae00 Compare January 18, 2018 17:39
@retronym
Copy link
Member

retronym commented Jan 23, 2018

Thanks @vjovanov.

I'm still hoping to better understand which optimizations Graal performs that makes it run scalac 30% faster than C2. (details in #35)

@retronym retronym merged commit 367811e into scala:master Jan 30, 2018
@vjovanov
Copy link
Contributor Author

This paper could help in understanding.
This talk is also good. The paper will appear on CGO 2018: "Dominance-based Duplication Simulation (DBDS) - Code Duplication to Enable Compiler Optimizations".

@retronym
Copy link
Member

retronym commented Jan 30, 2018

Thanks for the links, Vojin.

I've run this benchmark suite with Graal and C2 and collected profiles from each with a few different tools.

C2
Graal

The forward flamegraphs (C2 / Graal) show that the Graal compiler was still quite busy when the profiler started after 100s warmup. I suppose that makes sense with Graal's more aggressive approach to inlining. We might need to give Graal longer to warm up to get a real measurement of the peak performance.

Looking at the respective reverse flamegraphs (C2 / Graal), I can immediately see less time in itable stubs, suggesting that Graal has managed to devirtualize more. We've got some known problems in the 2.12 collections related to excessive forwarder methods that exhaust C2's default MaxInlineDepth, which we've worked around manually in HashMap but are still present in LinkedHashMap.

Comparing the perfnorm runs, (C2 / Graal), I see ~2x improvement in Graal on iTLB loads and misses. Graal is generally better across the other stats. Clocks Per Instruction improves from 0.819 to 0.738. I'm not particularly good at drawing conclusions from these counters as I'm still got a lot to learn here! I'll also note that because Graal was still rather active during the benchmark (~25%, albeit on background thread(s)), these numbers are measuring the performance of scalac and Graal itself, and I need to repeat the experiment with a longer warmup to have an apples for apples comparison.

I'm hopeful that further analysis and testing will help Graal (by providing a good datapoint and by flushing out some bugs), but might also help us in the near term find a few areas in scalac that could be hand-tuned to bridges some of the performance gap.

@vjovanov
Copy link
Contributor Author

Regarding the ongoing compilation:

  1. Although compilation is running in another thread this can congest the memory bus, as well as, affect frequency scaling. I have noticed that without frequency scaling enabled, and on machines with more memory bandwidth, I get better results. This is could very well be the reason. So yes, we need to increase the warmup time. Using -XX:+PrintCompilation is a good way to see when benchmarks are warmed up--I prefer this to just looking at the performance of the last round.

  2. In the flamegraph I can see that about 10% of the compilation time is also spent in the C1 compiler. This could mean that the benchmark is not completely warmed up even on HotSpot.

  3. I ran the benchmark with the following flags hot -psource=scalap -wi 30 -i 5 -f1 -jvm graalvm/Contents/Home/bin/java -jvmArgs -XX:+PrintCompilation, and it seems that we have some deoptimization loops. Methods like scala.reflect.internal.tpe.TypeComparers$class::firstTry$1 get recompiled on every iteration. There is a few dozen of these that we need to investigate.

Before we get rid of compilation in the background it is hard to judge the other numbers exactly. The iTLB loads look interesting indeed.

The devirtualization results are quite interesting. I guess most of it comes from better inlining heuristics. To better understand what inlining decisions are made, which could help in manually optimizing scalac code, you can use the following flags -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining. Unfortunately, this flag prints all inlining decisions. I have opened an issue that will allow you to use the Graal logging scopes to narrow down printing to a single method. I think it would be easier to wait a bit for better printing of inlining decisions than to use the existing flags.

I would also compare include comparison with Graal core in the graphs. For Scala, it is already doing a better job than C2. Maybe it is better investing time in optimizing the scalac code for that compiler.

@retronym
Copy link
Member

retronym commented Feb 1, 2018

We have frequency scaling disabled during benchmark runs (using this script).

Our profile runs also collect the output of -prof hs_comp, which shows e.g. the total number of methods compiled after each iteration. That's what I tend to use to determine whether things are warm enough. Here's how it looks for Graal with twice the warmup (200s), vs C2

I've subscribed to oracle/graal#293 and look forward to what you discover. I know that running scalac on a single thread throws up some tricky problems for the JIT compiler, e.g if type tests and other branches that look to be constant during typer but can explore different paths when new AST nodes show up or disappear later on.

I actually went and turned the default implementations of Type.mapOver, Tree.traverse and Tree.transform into virtual calls to avoid these problems in some areas (and (to avoid O(N) cost of a pattern matching running through successive instanceOfs).

I do plan to automate benchmarking against different JDKs (e.g 8/9/HS+Graal/OpenJ9) soon, but it requires a bit of annoying plumbing work in our results database first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants