Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable parallel optimizing and writing of classes by GenBCode #6124

Merged
merged 3 commits into from
Feb 19, 2018

Conversation

mkeskells
Copy link
Contributor

@mkeskells mkeskells commented Oct 11, 2017

Introduces a compiler flag -Ybackend-parallelism N which allows specifying the number of threads for optimizing and writing classes in parallel.

The granularity of parallelization is a compilation unit (a source file).

We measured speedup of 4-5% when compiling the scala/scala sources with 8 backend threads (graph).

rework ClassBType to enable parallelism, and move logic in the companion
rewrite ClasfileWriter, specialising for JAR/dir, and providing wrappers for the less common cases
rework directory classfile writing to be async NIO based.
Async the directory creation
make PerRunInit theadsafe

BackendUtils - make some data structure/APIs theadsafe (indyLambdaMethods)
add extra parameter -YmaxAdditionalWriterThreads .. "maximum additional threads to write class files"
add a class handler as a delegate that allows the minimal set in post-processing steps to be performed

@scala-jenkins scala-jenkins added this to the 2.12.5 milestone Oct 11, 2017
@mkeskells
Copy link
Contributor Author

mkeskells commented Oct 11, 2017

here are some stats for this change, run prior to last rebase, and should be close to current - I will generate more stats for a run later with the new baseline and rebased code

Configuration -
test is performed on a 4 core I7 (2 rel core + hyperthreading) windows 10 system
baseline is the baseline commit, other lines relate to the new code with varying settings of -YmaxAdditionalWriterThreads

each test compile is run 20 times in the same VM to warm up

all timings are in ms.The second colun is not so interesting - number of phases included

  • ALL - refers to all runs - time is for the full compile
  • after 10 90% - ignores the first 10 runs, and the worse 10 % of results ( e.g. heavy GC) - time is for the full compile
  • after 10 90%, phase jvm, no GC - as above but limits the timing to the jvm phase ( this PR is targetted at the jvm phase, and makes no material change outside that phase), and disguards measurement including a GC
ALL

                  RunName	                AllWallMS	        
      00_backend-baseline	 810	12,148.60 [-22.10% +185.53%]
             00_backend-0	 810	12,144.77 [-19.87% +220.26%]
             00_backend-1	 810	11,817.25 [-18.54% +204.61%]
             00_backend-2	 810	11,353.86 [-23.45% +194.46%]
             00_backend-3	 810	10,224.86 [-20.78% +229.95%]
             00_backend-4	 810	10,190.22 [-21.00% +225.74%]
             00_backend-5	 810	9,808.82 [-22.97% +233.42%]	
             00_backend-6	 810	9,804.54 [-23.69% +229.63%]	
             00_backend-7	 810	9,559.68 [-22.52% +238.89%]	
             00_backend-8	 810	9,473.14 [-23.06% +260.49%]	
after 10 90%

                  RunName	                AllWallMS	        
      00_backend-baseline	 510	10,157.17 [-6.82% +7.54%]	
             00_backend-0	 510	10,180.64 [-4.41% +2.87%]	
             00_backend-1	 510	 9,841.78 [-2.18% +2.58%]	
             00_backend-2	 510	 8,959.68 [-2.99% +4.71%]	
             00_backend-3	 510	 8,424.00 [-3.84% +2.40%]	
             00_backend-4	 510	 8,314.82 [-3.18% +2.75%]	
             00_backend-5	 510	 7,886.69 [-4.20% +3.23%]	
             00_backend-6	 510	 7,868.82 [-4.92% +3.58%]	
             00_backend-7	 510	 7,606.62 [-2.62% +3.20%]	
             00_backend-8	 510	 7,487.36 [-2.66% +3.35%]	
after 10 90%, phase jvm, no GC

                  RunName	                AllWallMS	        
      00_backend-baseline	  16	 3,685.55 [-5.46% +7.87%]	
             00_backend-0	  15	 3,481.92 [-3.84% +4.13%]	
             00_backend-1	  16	 3,394.66 [-4.64% +5.73%]	
             00_backend-2	  18	 2,471.76 [-8.64% +6.61%]	
             00_backend-3	  17	 2,096.19 [-5.71% +4.60%]	
             00_backend-4	  18	1,825.50 [-6.95% +10.41%]	
             00_backend-5	  17	1,491.01 [-14.57% +7.66%]	
             00_backend-6	  16	1,304.25 [-15.29% +13.13%]	
             00_backend-7	  15	1,165.13 [-7.80% +14.10%]	
             00_backend-8	  16	 1,118.78 [-6.36% +6.55%]	

memory allocation (in GenBCode) are reduced slightly from 423.26 MB to 412.32 Mb for single threaded. MT figures are not available due to lack of instrumentation

@mkeskells
Copy link
Contributor Author

This PR is not complete
e.g. It needs rerun of performance on UNIX/Mac, and there is still some tidyup to do and it need a validation of the additional code paths introduced, but as it provides a measurable performance improvement even in the single threaded case, and a marked throughput in the MT case, I thought that it was worth pushing the review a little early

@mkeskells
Copy link
Contributor Author

Memory pressure is reduced

  • the processing is pipelined, with the later phases ( writing the data to disk) recovering the memory as soon as the file is written)
  • the pipeline is limited, and if the writer cant keep up, the early phase has back-pressure

@lrytz
Copy link
Member

lrytz commented Oct 11, 2017

Exciting! I'll take a look on Friday.

@mkeskells
Copy link
Contributor Author

mkeskells commented Oct 12, 2017

Updated some more extensive testing results - same windows machine, but 150 iterations of each compile so that the number are more stable.

ALL
                  RunName	                AllWallMS	        
      00_backend-baseline	4050	10,222.16 [-8.60% +291.22%]	
             00_backend-0	4050	9,874.56 [-8.87% +180.48%]	
             00_backend-1	4050	7,823.89 [-11.05% +335.20%]	
             00_backend-2	4050	7,747.43 [-11.59% +330.97%]	
             00_backend-3	4050	7,891.42 [-13.35% +326.03%]	
             00_backend-4	4050	7,758.51 [-11.48% +338.29%]	
             00_backend-6	4050	7,904.31 [-11.43% +331.10%]	
             00_backend-8	4050	7,963.89 [-11.47% +323.16%]	
            00_backend-10	4050	7,745.32 [-11.52% +329.60%]	
            00_backend-12	4050	7,825.48 [-13.25% +324.15%]	
            00_backend-14	4050	7,814.28 [-10.58% +327.63%]	
            00_backend-16	4050	7,733.57 [-11.01% +323.25%]	
after 10 90%
                  RunName	                AllWallMS	        
      00_backend-baseline	3426	 9,728.14 [-3.96% +4.95%]	
             00_backend-0	3426	 9,458.36 [-4.86% +7.03%]	
             00_backend-1	3426	 7,366.27 [-5.52% +6.12%]	
             00_backend-2	3426	 7,303.72 [-6.22% +5.00%]	
             00_backend-3	3426	 7,379.21 [-7.34% +5.81%]	
             00_backend-4	3426	 7,284.50 [-5.72% +5.69%]	
             00_backend-6	3426	 7,458.70 [-6.14% +5.16%]	
             00_backend-8	3426	 7,517.85 [-6.21% +4.45%]	
            00_backend-10	3426	 7,278.99 [-5.85% +7.05%]	
            00_backend-12	3426	 7,327.12 [-7.35% +7.02%]	
            00_backend-14	3426	 7,364.65 [-5.12% +5.44%]	
            00_backend-16	3426	 7,265.07 [-5.27% +5.65%]	
after 10 90%, phase jvm, no GC
                  RunName	                AllWallMS	        
      00_backend-baseline	 117	 3,867.26 [-3.81% +4.71%]	
             00_backend-0	 117	 3,244.24 [-5.06% +7.08%]	
             00_backend-1	 112	1,344.32 [-12.66% +9.37%]	
             00_backend-2	 117	1,355.05 [-15.18% +9.20%]	
             00_backend-3	  73	1,337.02 [-13.99% +11.27%]	
             00_backend-4	 125	1,351.97 [-17.44% +9.70%]	
             00_backend-6	  67	1,337.12 [-12.08% +9.96%]	
             00_backend-8	  76	1,319.22 [-14.54% +10.70%]	
            00_backend-10	 120	1,343.10 [-12.30% +8.96%]	
            00_backend-12	  95	1,325.05 [-16.23% +11.20%]	
            00_backend-14	 125	1,322.21 [-11.65% +10.53%]	
            00_backend-16	 121	1,325.99 [-16.11% +10.90%]	
after 20 90%
                  RunName	                AllWallMS	        
      00_backend-baseline	3183	 9,702.79 [-3.71% +4.44%]	
             00_backend-0	3183	 9,417.31 [-4.45% +6.09%]	
             00_backend-1	3183	 7,336.39 [-5.14% +6.02%]	
             00_backend-2	3183	 7,279.51 [-5.90% +5.15%]	
             00_backend-3	3183	 7,357.65 [-7.07% +6.01%]	
             00_backend-4	3183	 7,259.71 [-5.40% +5.47%]	
             00_backend-6	3183	 7,438.31 [-5.88% +4.78%]	
             00_backend-8	3183	 7,501.33 [-6.01% +4.42%]	
            00_backend-10	3183	 7,252.99 [-5.52% +6.73%]	
            00_backend-12	3183	 7,305.25 [-7.07% +7.34%]	
            00_backend-14	3183	 7,341.15 [-4.82% +5.65%]	
            00_backend-16	3183	 7,236.57 [-4.90% +5.35%]	
after 20 90%, phase jvm, no GC
                  RunName	                AllWallMS	        
      00_backend-baseline	 108	 3,856.10 [-3.53% +4.33%]	
             00_backend-0	 108	 3,228.20 [-4.59% +7.02%]	
             00_backend-1	 104	1,336.48 [-12.15% +9.06%]	
             00_backend-2	 108	1,348.77 [-14.79% +9.54%]	
             00_backend-3	  65	1,325.20 [-13.22% +10.45%]	
             00_backend-4	 116	1,345.73 [-17.06% +9.49%]	
             00_backend-6	  61	1,328.12 [-11.48% +10.16%]	
             00_backend-8	  69	1,307.97 [-13.80% +11.47%]	
            00_backend-10	 113	1,338.55 [-12.00% +8.88%]	
            00_backend-12	  87	1,316.63 [-15.69% +11.56%]	
            00_backend-14	 116	1,317.76 [-11.35% +10.81%]	
            00_backend-16	 113	1,318.13 [-15.61% +9.09%]	

@mkeskells
Copy link
Contributor Author

mkeskells commented Oct 13, 2017

update - mostly tidup-ups

combine sync and async writers code path
make back-pressure work - this will affect timings
simplify asmp and dump writers, combining with DirWriter
preserve old exception handling model
use java based jar writing

as the back pressure fix will significantly affect timing I will re-run and update the previous published timings, when ready. It takes a few hours of CPU to do this though

@lrytz
Copy link
Member

lrytz commented Oct 13, 2017

It seems the the individual changes you list in the commit messages seem to be independent of each other, so they should be in separate commits. Having commits focused on one topic saves a lot of time when going through the repo history (in the future).

Copy link
Member

@lrytz lrytz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at everything in all details, in particular the third commit. In general it looks very good! I think there is a bit of over-abstraction in ClassfileWriter / ClassHandler / UnitInfoLookup and their subclasses. I'll look at it more later, maybe some of the things can be simplified.

@@ -261,7 +261,7 @@ abstract class BTypesFromSymbols[G <: Global](val global: G) extends BTypes {
r
})(collection.breakOut)

private def setClassInfo(classSym: Symbol, classBType: ClassBType): ClassBType = {
private def computeJavaClassInfo(classSym: Symbol, classBType: ClassBType): Right[Nothing, ClassInfo] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is not only used for java-defined classes, so it should be named computeClassInfo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. I have no memory as to why I changed it ...

def unapply(cr:ClassBType) = Some(cr.internalName)

def apply(internalName: InternalName, cache: mutable.Map[InternalName, ClassBType])(init: (ClassBType) => Either[NoClassBTypeInfo, ClassInfo]) = {
assert (Thread.holdsLock(frontendAccess.frontendLock))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not obvious to me. Before this PR, we frontendLock was guarding accesses to the frontend, and the Lazy class used in ClassInfo. IUC, this PR puts CodeGen.genClassDef under that lock, too. But I think there are other places where ClassBTypes are created, for example during inlining (classBTypeFromParsedClassfile). Did you test running the parallel backend with the optimizer enabled?

def apply(internalName: InternalName, cache: mutable.Map[InternalName, ClassBType])(init: (ClassBType) => Either[NoClassBTypeInfo, ClassInfo]) = {
assert (Thread.holdsLock(frontendAccess.frontendLock))
val res = new ClassBType(internalName)
res.synchronized {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we want to ensure that there is a single ClassBType created for each InternalName, also if there are multiple threads. IUC the synchronization here (and within def info above) is to ensure that no thread sees a partially constructed instance.

But I think in the current state of this PR, there could still be a race condition. If two threads invoke classBTypeFromParsedClassfile for the same InternalName, both could get None from cachedClassBType and create a new instance.

@@ -67,6 +67,7 @@ abstract class ByteCodeRepository {
* of a ClassNode is about 30 kb. I observed having 17k+ classes in the cache, i.e., 500 mb.
*/
private def limitCacheSize(): Unit = {
println(s"limitCacheSize ${parsedClasses.size} $maxCacheSize")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

println

@@ -9,10 +9,13 @@ import scala.collection.mutable.ListBuffer
* The trait provides an `initialize` method that runs all initializers added through `perRunLazy`.
*/
trait PerRunInit {
//We have to synchronize on inits, as many of the initialisers are themselves lazy,
// so the back end may initialise them in parallel, and ListBuffer is not threadsafe
// Not sure it is sensible to have a perRunInit that is itself a lazy val
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea is that only those PerRunInits that are actually used are also created, and re-initialized on the next run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but we have lazy val which are per run init, so the registration of them is potentially in multiple threads, either at class load time or when the val is first touched. I am not sure why a per-run is lazy anyway, but I could not reason as to it being guaranteed to be threadsafe, and at best it would be fragile.

The problem is the actual lit f per-runs not in the actual target

compilerSettings.optInlinerEnabled || compilerSettings.optClosureInvocations
}

private var generatedHandler:ClassHandler = _
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This var field doesn't fit in the backend's design. Could this be a component of CodeGen? If there's state that needs to be re-initialized, it would be good to fit it into the perRunLazy infrastructure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably could be a per-run. Have you measure the overhead of per-run, or considered the option of an assigned clearable structure, that is directly accessed?

Its effectively a local variable, but because the run loops via apply to run for each CompilationUnit it needs to be shared between the 2 methods and cant be passed directly

@@ -778,6 +778,15 @@ abstract class BTypes {
} while (fcs == null)
fcs
}

// equallity and hashcode is based on internalName
override def equals(obj: scala.Any): Boolean = obj match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reference equality / identity hashCode should be fine here, no? The construction pattern should guarantee that there's a single instance per InternalName.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did this because it failed some unit tests. It is nt used in production though,just a test that ran from a symbol and from a class file

@mkeskells
Copy link
Contributor Author

/rebuild

@mkeskells
Copy link
Contributor Author

mkeskells commented Oct 14, 2017

found an initialization order issue which means that all of the MT results used the default scheduler with 4 threads and no back pressure. will post an update on Sunday with update in bold

  1. this fixed done
  2. back pressure working done
  3. investigation on the appropriate and optimum queue size partial added -Ymax-queue
  4. statistics capture on the background thread cpu io and idle time done
    I target 1 second elapsed as the measure of success 😊 done

@mkeskells
Copy link
Contributor Author

these changes should not affect the review. they affect scheduling not algo

@mkeskells
Copy link
Contributor Author

mkeskells commented Oct 16, 2017

latest commit has a marked perf improvement in windows IO - below is the affect in single threaded windows

             RunName	                AllWallMS	        
 00_backend-baseline	 116	 4,653.97 [-2.05% +4.33%]	
        00_backend-0	 119	 2,571.55 [-4.41% +4.94%]	

other figures distorted due to too much GC, but not getting the expected parallelism. Will rerun the stats wih -Yprofile-run-gc and keep digging

@mkeskells
Copy link
Contributor Author

/rebuild

@mkeskells
Copy link
Contributor Author

spun out #6145 related (same phase) but not interlocking as is #6125

@mkeskells
Copy link
Contributor Author

/rebuild

@mkeskells
Copy link
Contributor Author

spun out some work in #6162

@lrytz
Copy link
Member

lrytz commented Nov 10, 2017

@mkeskells before going deeper into reviewing the actual implementation, I'd like to work on the structure of this PR. Could you create individual commits for each topic, don't blend things that can be separated. Examples (mostly from your commit message):

  • rework ClassBType to enable parallelism
  • rework directory classfile writing to be async NIO based
  • refactor ClasfileWriter
  • make PerRunInit theadsafe
  • BackendUtils - make some data structure/APIs theadsafe (indyLambdaMethods and maxLocalStackComputed)
  • commit that actually enables parallelization; can we split it up into specific parts being
    parallelized (io, asm's ClassWriter / serialization, parts of the post processing)

For each reafactoring, it would be nice to have (in the commit message, maybe in code where it makes sense) a high-level overview of the new architecture and a motivation why / how this enables future changes (parallelism).

You don't need to break things into separate PRs (only if a change makes sense individually, but things that depend on each other better stay together), but the commits should be on one topic and self-contained.

Even before doing that, it would be helpful if you could lay out the main plan that you have in mind for parallelizing the backend, and the observations that you made so far.

@mkeskells
Copy link
Contributor Author

/rebuild

@lrytz
Copy link
Member

lrytz commented Feb 16, 2018

Somehow github shows my comments collapsed, but the two remaining nits are

@mkeskells
Copy link
Contributor Author

Hmm - that irratating - will try to do it properly ....

@mkeskells mkeskells force-pushed the mike/2.12.x-backend-parallel-rebase branch 2 times, most recently from 7c45bd8 to c8e6887 Compare February 18, 2018 22:06
@lrytz
Copy link
Member

lrytz commented Feb 19, 2018

@mkeskells to summarize: I took the branch from my second review pass, the one you squashed into a single commit (c8e6887). I rebased my branch to make sure it has the same baseline as your PR. The diff of those two branches is this: https://gist.github.com/lrytz/5b0db9dd3829bb15140f098161fe6fc5, so they are almost the same. The only things that i'd still like to include from that diff

@mkeskells
Copy link
Contributor Author

@lrytz pushed untested changes for review

@lrytz
Copy link
Member

lrytz commented Feb 19, 2018

/synch

@lrytz
Copy link
Member

lrytz commented Feb 19, 2018

ah well, scabot is still familiarizing with the new CI setup..

Copy link
Member

@lrytz lrytz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@lrytz lrytz merged commit ea64efa into scala:2.12.x Feb 19, 2018
@rorygraves
Copy link
Contributor

Woot!

@hrhino
Copy link
Member

hrhino commented Feb 19, 2018

🎉

@mkeskells
Copy link
Contributor Author

:-)
its a long while since this started - #5815
but we snuck in before its birthday

@SethTisue
Copy link
Member

very cool to see this crossing the finish line!

@SethTisue SethTisue mentioned this pull request Feb 23, 2018
@lrytz lrytz added the release-notes worth highlighting in next release notes label Mar 14, 2018
@lrytz lrytz changed the title enable parallel writing of classes by GenBCode Enable parallel optimizing and writing of classes by GenBCode Mar 14, 2018
@lrytz lrytz added 2.12 and removed 2.12 labels Mar 14, 2018
@retronym
Copy link
Member

Minor regression under -d <symlink> fixed in #6450

@som-snytt
Copy link
Contributor

Also under -d script.jar. scala/bug#11815

If a bug swarms in the forest and no one is there to be bothered by it, does it still annoy?

sjrd added a commit to scala/scala3 that referenced this pull request Mar 28, 2023
This PR ports JVM backend refactor from Scala 2 as part of the
#14912 thread.

It squashes changes based on the PRs: 

- scala/scala#6012
- scala/scala#6057

The last refactor introducing backend parallelism
scala/scala#6124 is left for later.
bishabosha added a commit to scala/scala3 that referenced this pull request Oct 10, 2023
This PR ports scala/scala#6124 introduces
backend parallelism, on top of the previously ported changes introduced
in #15322
Adds Scala2 flags: 
* `-Yjar-compression-level:N`
* `-Ybackend-parallelism:N`
* `-Ybackend-worker-queue:N`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes worth highlighting in next release notes
Projects
None yet
9 participants