-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable parallel optimizing and writing of classes by GenBCode #6124
Enable parallel optimizing and writing of classes by GenBCode #6124
Conversation
here are some stats for this change, run prior to last rebase, and should be close to current - I will generate more stats for a run later with the new baseline and rebased code Configuration - each test compile is run 20 times in the same VM to warm up all timings are in ms.The second colun is not so interesting - number of phases included
memory allocation (in GenBCode) are reduced slightly from 423.26 MB to 412.32 Mb for single threaded. MT figures are not available due to lack of instrumentation |
This PR is not complete |
Memory pressure is reduced
|
Exciting! I'll take a look on Friday. |
b1b0dd9
to
dd3ce70
Compare
Updated some more extensive testing results - same windows machine, but 150 iterations of each compile so that the number are more stable.
|
update - mostly tidup-ups combine sync and async writers code path as the back pressure fix will significantly affect timing I will re-run and update the previous published timings, when ready. It takes a few hours of CPU to do this though |
It seems the the individual changes you list in the commit messages seem to be independent of each other, so they should be in separate commits. Having commits focused on one topic saves a lot of time when going through the repo history (in the future). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at everything in all details, in particular the third commit. In general it looks very good! I think there is a bit of over-abstraction in ClassfileWriter
/ ClassHandler
/ UnitInfoLookup
and their subclasses. I'll look at it more later, maybe some of the things can be simplified.
@@ -261,7 +261,7 @@ abstract class BTypesFromSymbols[G <: Global](val global: G) extends BTypes { | |||
r | |||
})(collection.breakOut) | |||
|
|||
private def setClassInfo(classSym: Symbol, classBType: ClassBType): ClassBType = { | |||
private def computeJavaClassInfo(classSym: Symbol, classBType: ClassBType): Right[Nothing, ClassInfo] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is not only used for java-defined classes, so it should be named computeClassInfo
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. I have no memory as to why I changed it ...
def unapply(cr:ClassBType) = Some(cr.internalName) | ||
|
||
def apply(internalName: InternalName, cache: mutable.Map[InternalName, ClassBType])(init: (ClassBType) => Either[NoClassBTypeInfo, ClassInfo]) = { | ||
assert (Thread.holdsLock(frontendAccess.frontendLock)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not obvious to me. Before this PR, we frontendLock
was guarding accesses to the frontend, and the Lazy
class used in ClassInfo
. IUC, this PR puts CodeGen.genClassDef
under that lock, too. But I think there are other places where ClassBType
s are created, for example during inlining (classBTypeFromParsedClassfile
). Did you test running the parallel backend with the optimizer enabled?
def apply(internalName: InternalName, cache: mutable.Map[InternalName, ClassBType])(init: (ClassBType) => Either[NoClassBTypeInfo, ClassInfo]) = { | ||
assert (Thread.holdsLock(frontendAccess.frontendLock)) | ||
val res = new ClassBType(internalName) | ||
res.synchronized { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we want to ensure that there is a single ClassBType
created for each InternalName
, also if there are multiple threads. IUC the synchronization here (and within def info
above) is to ensure that no thread sees a partially constructed instance.
But I think in the current state of this PR, there could still be a race condition. If two threads invoke classBTypeFromParsedClassfile
for the same InternalName
, both could get None
from cachedClassBType
and create a new instance.
@@ -67,6 +67,7 @@ abstract class ByteCodeRepository { | |||
* of a ClassNode is about 30 kb. I observed having 17k+ classes in the cache, i.e., 500 mb. | |||
*/ | |||
private def limitCacheSize(): Unit = { | |||
println(s"limitCacheSize ${parsedClasses.size} $maxCacheSize") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
println
@@ -9,10 +9,13 @@ import scala.collection.mutable.ListBuffer | |||
* The trait provides an `initialize` method that runs all initializers added through `perRunLazy`. | |||
*/ | |||
trait PerRunInit { | |||
//We have to synchronize on inits, as many of the initialisers are themselves lazy, | |||
// so the back end may initialise them in parallel, and ListBuffer is not threadsafe | |||
// Not sure it is sensible to have a perRunInit that is itself a lazy val |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea is that only those PerRunInit
s that are actually used are also created, and re-initialized on the next run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but we have lazy val which are per run init, so the registration of them is potentially in multiple threads, either at class load time or when the val is first touched. I am not sure why a per-run is lazy anyway, but I could not reason as to it being guaranteed to be threadsafe, and at best it would be fragile.
The problem is the actual lit f per-runs not in the actual target
compilerSettings.optInlinerEnabled || compilerSettings.optClosureInvocations | ||
} | ||
|
||
private var generatedHandler:ClassHandler = _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This var field doesn't fit in the backend's design. Could this be a component of CodeGen
? If there's state that needs to be re-initialized, it would be good to fit it into the perRunLazy
infrastructure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably could be a per-run. Have you measure the overhead of per-run, or considered the option of an assigned clearable structure, that is directly accessed?
Its effectively a local variable, but because the run loops via apply to run for each CompilationUnit it needs to be shared between the 2 methods and cant be passed directly
@@ -778,6 +778,15 @@ abstract class BTypes { | |||
} while (fcs == null) | |||
fcs | |||
} | |||
|
|||
// equallity and hashcode is based on internalName | |||
override def equals(obj: scala.Any): Boolean = obj match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think reference equality / identity hashCode should be fine here, no? The construction pattern should guarantee that there's a single instance per InternalName
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did this because it failed some unit tests. It is nt used in production though,just a test that ran from a symbol and from a class file
/rebuild |
found an initialization order issue which means that all of the MT results used the default scheduler with 4 threads and no back pressure. will post an update on Sunday with update in bold
|
these changes should not affect the review. they affect scheduling not algo |
latest commit has a marked perf improvement in windows IO - below is the affect in single threaded windows
other figures distorted due to too much GC, but not getting the expected parallelism. Will rerun the stats wih -Yprofile-run-gc and keep digging |
/rebuild |
/rebuild |
spun out some work in #6162 |
@mkeskells before going deeper into reviewing the actual implementation, I'd like to work on the structure of this PR. Could you create individual commits for each topic, don't blend things that can be separated. Examples (mostly from your commit message):
For each reafactoring, it would be nice to have (in the commit message, maybe in code where it makes sense) a high-level overview of the new architecture and a motivation why / how this enables future changes (parallelism). You don't need to break things into separate PRs (only if a change makes sense individually, but things that depend on each other better stay together), but the commits should be on one topic and self-contained. Even before doing that, it would be helpful if you could lay out the main plan that you have in mind for parallelizing the backend, and the observations that you made so far. |
3022b47
to
ebdaea5
Compare
5514fd6
to
1dfbb32
Compare
/rebuild |
Somehow github shows my comments collapsed, but the two remaining nits are |
Hmm - that irratating - will try to do it properly .... |
7c45bd8
to
c8e6887
Compare
@mkeskells to summarize: I took the branch from my second review pass, the one you squashed into a single commit (c8e6887). I rebased my branch to make sure it has the same baseline as your PR. The diff of those two branches is this: https://gist.github.com/lrytz/5b0db9dd3829bb15140f098161fe6fc5, so they are almost the same. The only things that i'd still like to include from that diff
|
@lrytz pushed untested changes for review |
/synch |
ah well, scabot is still familiarizing with the new CI setup.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Woot! |
🎉 |
:-) |
very cool to see this crossing the finish line! |
Minor regression under |
Also under If a bug swarms in the forest and no one is there to be bothered by it, does it still annoy? |
This PR ports JVM backend refactor from Scala 2 as part of the #14912 thread. It squashes changes based on the PRs: - scala/scala#6012 - scala/scala#6057 The last refactor introducing backend parallelism scala/scala#6124 is left for later.
This PR ports scala/scala#6124 introduces backend parallelism, on top of the previously ported changes introduced in #15322 Adds Scala2 flags: * `-Yjar-compression-level:N` * `-Ybackend-parallelism:N` * `-Ybackend-worker-queue:N`
Introduces a compiler flag
-Ybackend-parallelism N
which allows specifying the number of threads for optimizing and writing classes in parallel.The granularity of parallelization is a compilation unit (a source file).
We measured speedup of 4-5% when compiling the scala/scala sources with 8 backend threads (graph).
rework ClassBType to enable parallelism, and move logic in the companion
rewrite ClasfileWriter, specialising for JAR/dir, and providing wrappers for the less common cases
rework directory classfile writing to be async NIO based.
Async the directory creation
make PerRunInit theadsafe
BackendUtils - make some data structure/APIs theadsafe (indyLambdaMethods)
add extra parameter -YmaxAdditionalWriterThreads .. "maximum additional threads to write class files"
add a class handler as a delegate that allows the minimal set in post-processing steps to be performed