New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #433: Make classpath hashing more lightweight #452

Merged
merged 1 commit into from Nov 14, 2017

Conversation

@jvican
Member

jvican commented Nov 9, 2017

And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running compile on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing hashCode on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. #371
explains why. The third limitation with this check is that file hashes
are implemented internally as ints, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

  • Caching of classpath entry hashes.
  • Parallelize this IO-bound task.

Results, on my local machine:

  • No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
  • Parallel hashing of the first 500 jars in my ivy cache: 770ms.
  • Second (cached) parallel hashing of the first 500 jars in my ivy cache: 1ms.
@fommil

This comment has been minimized.

fommil commented Nov 9, 2017

can you please validate against the demo project at https://github.com/cakesolutions/sbt-cake/tree/sbt-perf-regression ?

I have no idea how to run a custom zinc, so I am unable to do so. Hence the monkey patch PRs to sbt.

@@ -162,6 +165,13 @@ final class MixedAnalyzingCompiler(
* of cross Java-Scala compilation.
*/
object MixedAnalyzingCompiler {
private[this] val cacheMetadataJar = new ConcurrentHashMap[File, (FileTime, FileHash)]()

This comment has been minimized.

@fommil

fommil Nov 9, 2017

this will leak... you need to use something like a Gauva weak cache.

Also I recommend using Path instead of File, because it has a much smaller memory footprint.

In fact File (and URL) are the biggest heap hogs in sbt on large projects... mostly because the path prefix is duplicated. I'm talking GBs.

This comment has been minimized.

@jvican

jvican Nov 9, 2017

Member

You're right that I should use a path here. I'll use a Synchronized map wrapping a weak cache ref. Ironically that was my first approach but didn't quite like it.

This comment has been minimized.

@fommil

fommil Nov 9, 2017

doesn't sbt have a synchronised weak cache already?

This comment has been minimized.

@jvican

jvican Nov 9, 2017

Member

Not to the best of my knowledge...

Finally, I've decided not to use a synchronize weak map here. We could get fancy, but this cache is only for jars and it's going to have tops 1000 entries. So I'll keep it as it is.

FileHash.of(x, Stamper.forHash(x).hashCode)
val parallelClasspathHashing = classpath.toParArray.map { file =>
val attrs = Files.readAttributes(file.toPath, classOf[BasicFileAttributes])
val currentFileTime = attrs.lastModifiedTime()

This comment has been minimized.

@fommil

fommil Nov 9, 2017

using size as well as modified is effectively free at his point (it's in attrs) so we could use it to be safe.

This comment has been minimized.

@jvican

jvican Nov 9, 2017

Member

Sure.

@@ -181,9 +191,16 @@ object MixedAnalyzingCompiler {
incrementalCompilerOptions: IncOptions,
extra: List[(String, String)]
): CompileConfiguration = {
val classpathHash = classpath map { x =>
FileHash.of(x, Stamper.forHash(x).hashCode)
val parallelClasspathHashing = classpath.toParArray.map { file =>

This comment has been minimized.

@fommil

fommil Nov 9, 2017

.toParArray? Is there a better way? The parallel collections rarely speed things up. Certainly hitting the cache in parallel will be slower.

Perhaps parallelisation can be done in a follow up so its impact can be isolated?

This comment has been minimized.

@jvican

jvican Nov 9, 2017

Member

That's not my experience. Parallel collections do speed things up and they're reliable. I can tell you hitting the cache is not slower, I've done measurements locally.

This comment has been minimized.

@jvican

jvican Nov 9, 2017

Member

Even for 10 jars, doing it in parallel is faster than doing it in serial.

This comment has been minimized.

@propensive

propensive Nov 9, 2017

It might be worth comparing the performance on an SSD versus a mechanical disk. It depends on a lot of factors (and I've no idea how modern technology mitigates them), but parallelizing IO operations on a mechanical disk might actually slow it down.

This comment has been minimized.

@jvican

jvican Nov 9, 2017

Member

In my experience, IO-bound operations yield very good results when they're parallelized. I think there are old textbooks on parallel computing that claim so too, before the pre-SSD era.

In any case, I benchmarked it just in case in a LAMP server. Results:

  • Parallelized hashing: 549ms.
  • Serial hashing: 1912ms.

I'm testing it on /dev/vda, which as you see is a rotational disk:

platform@scalagesrv2:~/tests/zinc$ lsblk -d -o name,rota
NAME ROTA
fd0     1
sr0     1
vda     1
@jvican

This comment has been minimized.

Member

jvican commented Nov 9, 2017

@fommil I will not validate it myself. I have already a test that proves this works. If you're interested in checking this fix out, you can publish zinc, change the zinc version in sbt/sbt, publish sbt/sbt locally and use that version in build.properties.

@fommil

This comment has been minimized.

fommil commented Nov 9, 2017

I tried that process, but lost several hours, and then gave up and basically did this

@jvican jvican force-pushed the scalacenter:issue-433 branch 2 times, most recently from e7db2f9 to 17d451b Nov 9, 2017

@@ -162,6 +165,17 @@ final class MixedAnalyzingCompiler(
* of cross Java-Scala compilation.
*/
object MixedAnalyzingCompiler {
private type JarMetadata = (FileTime, Long)
// Using paths instead of files as key because they have more lightweight memory consumption
private[this] val cacheMetadataJar = new ConcurrentHashMap[File, (JarMetadata, FileHash)]()

This comment has been minimized.

@leonardehrenfried

leonardehrenfried Nov 9, 2017

Doesn't the code do something different from what the comment says? The key is a File.

This comment has been minimized.

@fommil

fommil Nov 9, 2017

and that is why comments SUCK

This comment has been minimized.

@jvican

jvican Nov 9, 2017

Member

Yup, that's a stale comment, thanks for the catch.

@jvican jvican force-pushed the scalacenter:issue-433 branch from 17d451b to 1297263 Nov 9, 2017

FileHash.of(x, Stamper.forHash(x).hashCode)
// #433: Cache jars with their metadata to avoid recomputing hashes transitively in other projects
val parallelClasspathHashing = classpath.toParArray.map { file =>
if (!file.exists()) emptyFileHash(file)

This comment has been minimized.

@fommil

fommil Nov 9, 2017

I'm pretty sure you can use the file attributes on a non-existent file and then use the response to detect existence... if that is the case it would be best to use it here and avoid a stack frame (which is slow on windows so best to batch / minimise all I/O calls)

This comment has been minimized.

@jvican

jvican Nov 9, 2017

Member

I was just getting IO errors because of this, it was unintuitive to me too.

This comment has been minimized.

@jvican

jvican Nov 9, 2017

Member

@fommil I got a proof, /caca/peluda does not exist:

scala> val attrs = java.nio.file.Files.readAttributes((new java.io.File("/caca/peluda").toPath), classOf[BasicFileAttributes])
java.nio.file.NoSuchFileException: /caca/peluda
  at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
  at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
  at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
  at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
  at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
  at sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
  at java.nio.file.Files.readAttributes(Files.java:1737)
  ... 28 elided

This comment has been minimized.

@fommil

fommil Nov 9, 2017

ok cool, that's good to know.

@jvican jvican force-pushed the scalacenter:issue-433 branch from 1297263 to 07aa2c9 Nov 9, 2017

@jvican jvican removed the in progress label Nov 9, 2017

@fommil

fommil approved these changes Nov 9, 2017

@stuhood

stuhood approved these changes Nov 9, 2017

Looks fine from my perspective, although aligning more cleanly with #427 would be good.

@@ -162,6 +165,17 @@ final class MixedAnalyzingCompiler(
* of cross Java-Scala compilation.
*/
object MixedAnalyzingCompiler {
// For more safety, store both the time and size
private type JarMetadata = (FileTime, Long)
private[this] val cacheMetadataJar = new ConcurrentHashMap[File, (JarMetadata, FileHash)]()

This comment has been minimized.

@stuhood

stuhood Nov 9, 2017

Contributor

Given that #427 is in, it would seem that all of this code should be located on the companion object of the "default" implementation of hashing. Put it in the top of MixedAnalyzingCompiler makes that class into more of a grabbag.

This comment has been minimized.

@jvican

jvican Nov 9, 2017

Member

By the companion object of the default implementation of hashing, do you mean FileHash or Stamper? It's not clear to me we want to reuse this code across all the users of IOHash or FileHash.

This comment has been minimized.

@stuhood

stuhood Nov 9, 2017

Contributor

If there isn't a relevant companion class, then I guess my suggestion would be to define a new one. Not a blocker. Just that logically I would expect this to look something like:

val hashingStrategy = if (flag) DefaultStrategy else UserDefinedStrategy(...)
???

But looking at it again, ExternalHooks doesn't make it possible to do this. So only the "grabbag" part is really relevant.

This comment has been minimized.

@jvican

jvican Nov 9, 2017

Member

I agree it looks like a grabbag, so it's better to move it somewhere else. thanks for the suggestion 😄.

This comment has been minimized.

@jvican

jvican Nov 10, 2017

Member

@stuhood I've implemented your suggestion 😉. The Zinc API has been historically such a mess that sometimes I forget it can and should be improved 😄.

@jvican jvican requested review from eed3si9n and Duhemm Nov 9, 2017

@fommil

This comment has been minimized.

fommil commented Nov 9, 2017

is this PR against the 1.0.x line or 1.1 line?

I think you might need to backport to 1.0.x if it could be included in an sbt point release.

@jvican

This comment has been minimized.

Member

jvican commented Nov 9, 2017

It's in the 1.0.x line.

@jvican

This comment has been minimized.

Member

jvican commented Nov 9, 2017

I've just realized that #427 is in 1.x, and I need to backport it to 1.0.x before I apply my patch. I think it would be a good idea to revisit all this 1.0.x and 1.x thing...

@fommil

This comment has been minimized.

fommil commented Nov 9, 2017

@jvican yeah that's what I thought. But you don't need to backport 427 for this. You can just implement this as the default and that won't impact the binary API for 1.0.x.

@jvican

This comment has been minimized.

Member

jvican commented Nov 9, 2017

Right, that's better @fommil. This will be shipped in a minor release, so it's better to avoid breaking bincompat in 1.0.x and then adapt @romanowski's approach to my PR.

Make classpath hashing more lightweight
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. #371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.

Fixes #433.

@jvican jvican force-pushed the scalacenter:issue-433 branch from 07aa2c9 to a90a0e9 Nov 10, 2017

@godenji

This comment has been minimized.

godenji commented Nov 11, 2017

So I just pulled this PR; published local Zinc snapshot along with Sbt snapshot that uses it.

Not seeing any difference wrt to hashing time, sbt.io.Hash$.apply consumes same amount of time as with Sbt 1.0.3/Zinc 1.0.

Here's a Visualvm screenshot (running compile several times in succession).

Maybe this PR solves a different problem? sbt -Dsbt.task.timings=true shows all of the time is spent in compileIncremental; this in a 30kloc app spread across 650 scala source files with 30 subprojects.

@jvican

This comment has been minimized.

Member

jvican commented Nov 11, 2017

@godenji Are you sure you've correctly published local Zinc and sbt? I really recommend to double check that, and change the versions as much as possible to avoid problems with cached jars. Make sure you use an sbt version that hasn't been used before.

Since I don't have access to your project, it would really help if you can instrument this PR to report data and identify which classpath entries are being cached and which are not, as well as auxiliary useful information.

@godenji

This comment has been minimized.

godenji commented Nov 11, 2017

Are you sure you've correctly published local Zinc and sbt?

@jvican I considered that, but after blowing away ~/.ivy2/local/org.scala-sbt/ the non-change in performance persists. Starting up test project (vanilla giter8 play project) does show sbt pulling in the local artifacts (e.g. "using local for ... zinc-compiler-1.1.0-SNAPSHOT"), and published source jars do reflect the changes in this PR.

I'll dig around some more, bottleneck may be elsewhere than what this PR resolves.

@jvican

This comment has been minimized.

Member

jvican commented Nov 12, 2017

@godenji You also need to blow away the corresponding folder in ~/.sbt/boot. I encourage you to use a completely different version, something like 37.1.1-jar-change.

@godenji

This comment has been minimized.

godenji commented Nov 12, 2017

You also need to blow away the corresponding folder in ~/.sbt/boot

@jvican that was it, noop compile now takes 0ms, yes ;-)

Since this is going into Zinc 1.0.x and Sbt 1.0.4 is slated for this coming week, is there any chance that this magnificent fix can make its way into Sbt? That would be most welcome, incremental builds are horrendous ATM.

@jvican

This comment has been minimized.

Member

jvican commented Nov 13, 2017

I've benchmarked this in a personal project too and I confirm it's fixed. Thanks for volunteering to try this out @godenji.

@jvican

This comment has been minimized.

Member

jvican commented Nov 13, 2017

@eed3si9n Can you review this please 😄?

@eed3si9n

LGTM
I like that the change is contained to implementation side.

@Duhemm

Duhemm approved these changes Nov 14, 2017

LGTM

@Duhemm Duhemm merged commit 5dcf3e6 into sbt:1.0.x Nov 14, 2017

1 check passed

continuous-integration/drone/pr the build was successful
Details

dwijnand added a commit to dwijnand/sbt-zinc that referenced this pull request Nov 22, 2017

Merge branch '1.0.x' into merge-1.0.x-into-1.x
* 1.0.x: (25 commits)
  Add yourkit acknoledgement in the README
  Add header to cached hashing spec
  Add headers to missing files
  Fix sbt#332: Add sbt-header back to the build
  Update sbt-scalafmt to 1.12
  Make classpath hashing more lightweight
  Fix sbt#442: Name hash of value class should include underlying type
  source-dependencies/value-class-underlying: fix test
  Ignore null in generic lambda tparams
  Improve and make scripted parallel
  Fix sbt#436: Remove annoying log4j scripted exception
  Fix sbt#127: Use `unexpanded` name instead of `name`
  Add pending test case for issue/127
  source-dependencies / patMat-scope workaround
  Fixes undercompilation on inheritance on same source
  Add real reproduction case for sbt#417
  Add trait-trait-212 for Scala 2.12.3
  Fix source-dependencies/sealed
  Import statement no longer needed
  Move mima exclusions to its own file
  ...

 Conflicts:
	internal/zinc-apiinfo/src/main/scala/sbt/internal/inc/ClassToAPI.scala
	zinc/src/main/scala/sbt/internal/inc/MixedAnalyzingCompiler.scala

The ClassToAPI conflict is due to:
* sbt#393 (a 1.x PR), conflicting with
* sbt#446 (a 1.0.x PR).

The MixedAnalyzingCompiler conflict is due to:
* sbt#427 (a 1.x PR), conflicting with
* sbt#452 (a 1.0.x PR).

@dwijnand dwijnand referenced this pull request Nov 22, 2017

Merged

Merge 1.0.x into 1.x #455

dwijnand added a commit to dwijnand/sbt-zinc that referenced this pull request Nov 23, 2017

Merge branch '1.0.x' into merge-1.0.x-into-1.x
* 1.0.x: (28 commits)
  Split compiler bridge tests to another subproject
  Implement compiler bridge for 2.13.0-M2
  Add yourkit acknoledgement in the README
  "sbt '++ 2.13.0-M2!' compile" does not work with sbt 1.0.0
  Add header to cached hashing spec
  Add headers to missing files
  Fix sbt#332: Add sbt-header back to the build
  Update sbt-scalafmt to 1.12
  Make classpath hashing more lightweight
  Fix sbt#442: Name hash of value class should include underlying type
  source-dependencies/value-class-underlying: fix test
  Ignore null in generic lambda tparams
  Improve and make scripted parallel
  Fix sbt#436: Remove annoying log4j scripted exception
  Fix sbt#127: Use `unexpanded` name instead of `name`
  Add pending test case for issue/127
  source-dependencies / patMat-scope workaround
  Fixes undercompilation on inheritance on same source
  Add real reproduction case for sbt#417
  Add trait-trait-212 for Scala 2.12.3
  ...

 Conflicts:
	internal/zinc-apiinfo/src/main/scala/sbt/internal/inc/ClassToAPI.scala
	project/build.properties
	zinc/src/main/scala/sbt/internal/inc/MixedAnalyzingCompiler.scala

The ClassToAPI conflict is due to:
* sbt#393 (a 1.x PR), conflicting with
* sbt#446 (a 1.0.x PR).

The build.properties conflict is due to different PRs bumping
sbt.version from 1.0.0 to 1.0.2 to 1.0.3. (sbt#413, sbt#418, sbt#453).

The MixedAnalyzingCompiler conflict is due to:
* sbt#427 (a 1.x PR), conflicting with
* sbt#452 (a 1.0.x PR).
@typesafe-tools

This comment has been minimized.

typesafe-tools commented Nov 25, 2017

The validator has checked the following projects against Scala 2.12,
tested using dbuild, projects built on top of each other.

Project Reference Commit
sbt 1.x sbt/sbt@c517517
zinc pull/452/head a90a0e9
io 1.x sbt/io@e8c9757
librarymanagement 1.x sbt/librarymanagement@163e8bf
util 1.x sbt/util@f4eadfc
website 1.x

The result is: SUCCESS
(restart)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment