New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace slow SHA-1 by xxHash for classpath hashes #371

Closed
wants to merge 5 commits into
base: 1.x
from

Conversation

Projects
None yet
8 participants
@jvican
Member

jvican commented Jul 18, 2017

This commit replaces SHA-1 with xxHash, an extremely fast
non-cryptographic hash algorithm that is used extensively in production
systems all over the world (check the website:
http://cyan4973.github.io/xxHash/).

The reason why this new hashing function has been added is because
MessageDigest and, concretely, MixedAnalyzingCompiler.config showed
up in some profiles consuming too much cpu time:

sbt.internal.inc.MixedAnalyzingCompiler$.$anonfun$makeConfig$1(File) 1213ms

The culprit is Stamper.forHash that relies on MessageDigest to
checksum the jars on the fly:

sbt.internal.inc.Stamper$.$anonfun$forHash$2(File) 1213ms
-> FilterInputStream.java:107 java.security.DigestInputStream.read(byte[], int, int) 1146ms

However, it is not reasonable that for not such a big classpath (sbt's
main module classpath) this takes 1213ms. This is totally exaggerated,
and it's a price that it's paid every time we create a
MixedAnalyzingCompiler, which is instantiated every time we invoke
compileIncrementally in the IncrementalCompilerImpl. Basically,
in every compile iteration.

The benefit of this change is motivated by the following scenarios:

  1. Users incrementally compile lots of times. Reducing the running time
    of every iteration makes the incremental compilation start faster.

  2. Big projects can have huge classpaths which will make incremental
    compilation significantly slower. This is especially true when every jar
    is big.

There are faster hashing functions than xxHash. However, given that
xxHash is used in important projects and has been battle-tested, I've
decided to use it for Zinc.

Note that cryptographic hashes are not necessary to ensure that the
classpath has or has not changed. It is safe to use xxHash since the
collision rate is really low.

Replace slow SHA-1 by xxHash for classpath hashes
This commit replaces `SHA-1` with `xxHash`, an extremely fast
non-cryptographic hash algorithm that is used extensively in production
systems all over the world (check the website:
http://cyan4973.github.io/xxHash/).

The reason why this new hashing function has been added is because
`MessageDigest` and, concretely, `MixedAnalyzingCompiler.config` showed
up in some profiles consuming too much cpu time:

```
sbt.internal.inc.MixedAnalyzingCompiler$.$anonfun$makeConfig$1(File) 1213ms
```

The culprit is `Stamper.forHash` that relies on `MessageDigest` to
checksum the jars on the fly:

```
sbt.internal.inc.Stamper$.$anonfun$forHash$2(File) 1213ms
-> FilterInputStream.java:107 java.security.DigestInputStream.read(byte[], int, int) 1146ms
```

However, it is not reasonable that for not such a big classpath (sbt's
main module classpath) this takes 1213ms. This is totally exaggerated,
and it's a price that it's paid every time we create a
`MixedAnalyzingCompiler`, which is instantiated every time we invoke
`compileIncrementally` in the `IncrementalCompilerImpl`. Basically,
in every compile iteration.

The benefit of this change is motivated by the following scenarios:

1. Users incrementally compile lots of times. Reducing the running time
of every iteration makes the incremental compilation start faster.

2. Big projects can have **huge** classpaths which will make incremental
compilation significantly slower. This is especially true when every jar
is big.

There are faster hashing functions than xxHash. However, given that
xxHash is used in important projects and has been battle-tested, I've
decided to use it for Zinc.

Note that cryptographic hashes are not necessary to ensure that the
classpath has or has not changed. It is safe to use `xxHash` since the
collision rate is really low.
Replace old `Hash` by `Hash32`
The new `Hash32` class is designed to wrap a 32-byte integer product of
the underlying hash algorithm. Instead of making this accept a `Long`
(xxHash works at the 64-byte level), we produce `Int` because `FileHash`
needs an integer and we don't want to break that API.

Truncating from 64-byte to 32-byte is safe, read
http://fastcompression.blogspot.ch/2014/07/xxhash-wider-64-bits.html.

The following commit hence removes any reference to the previous `Hash`,
that was using hexadecimal values, and deprecates any method that
assumes its existence `getHash` in the `Stamp` Java interface.

It also makes the changes in the underlying formats, the binary change
being binary compatible.

@jvican jvican force-pushed the scalacenter:faster-hash branch from 326fa6e to 56d38b4 Jul 18, 2017

@jvican

This comment has been minimized.

Member

jvican commented Jul 18, 2017

I forgot to note that Zinc also hashes source files. Source files will benefit from faster checksum, though this is not important really since they don't (or should not) suppose a bottleneck. The business is with classpath entries.

@@ -1,16 +0,0 @@
package sbt

This comment has been minimized.

@jvican

jvican Jul 18, 2017

Member

@dwijnand This had to go since the old Hash does not exist anymore.

@jvican

This comment has been minimized.

Member

jvican commented Jul 18, 2017

I made a quick informal experiment in the console. I ran both versions for 2600 jars of my ivy cache, that add up to 2.3GB. These are the results that I get:

// xxHash
scala> time(files.map(sbt.internal.inc.Stamper.forHashFast(_)))

Elapsed time: 597ms
res24: Seq[Long] = ArrayBuffer(-3384889253332121523, -3527231023132713716, 7892861326213445175, 2451949907259077672, 5725452954874663514, -5369460893638597645, 1204386044997430456, 3298869352266117677, -5271945718839461178, 942270622214988861, 3048720246964634412, 5427635316159346305, 5718018840473948034, -2058785138603094765, -3271930324248853072, 3972647176633517853, -179989182144432264, -8061950611569265155, 2726512015248768373, 762430146932025182, 4054887334111209062, -2187529583494234596, 4163105613072123238, -6181651819314671931, 3277286446478550665, 6887187100216121304, -8827566660350297174, -8860464038811057367, -4966084786619279801, 5053469599840204340, -4153491388658725764, 1116061402141784577, -4829038829955149288, 5773705831753227679, -1119917513819...

// SHA-1
scala> time(files.map(sbt.internal.inc.Stamper.forHash(_)))

Elapsed time: 9350ms
res25: Seq[xsbti.compile.analysis.Stamp] = ArrayBuffer(hash(8f094e4f77b2a60e95063bf654887db6c6518b87), hash(bb55318aacb1a2e9361e7691040dff9f8759365d), hash(6f55abdc07c8f32aea8f41c970a755ef6f800b21), hash(207b97f0dc5e569a3701e8a481ac75286de5deff), hash(2174717ada9529d9c5612d42c16522280dfe02a9), hash(03c5b4bafb6701f801f0fb2fa88a468a700307df), hash(0a1a164e866a636a533dab1b2a0c664f6aabc9bf), hash(cf279cb46a21e3f51f20bfec58800ff4f0746495), hash(bfa6e3ed1b1a05887744d643b47f052542673536), hash(b922623a9aeec35926bf82510d717c9f47a04dc5), hash(0bcdd28661a1616b3a56fe2386961fdfbed85f59), hash(9ecd32d56442c3e4c683e6d6a0f1136ae433705b), hash(e552a7f0055dd526b87cce502518878b91285f8a), hash(84d13a7a013ee0a4563b21953410cb890d9480de), hash(6da1be5ad00be7220e677a9093e9c6642271101...

That's 20x over the status quo.

@stuhood

This comment has been minimized.

Contributor

stuhood commented Jul 18, 2017

Would be good to see before and after numbers for the same scenario, rather than just before numbers.

I'm a bit concerned about using a 32 bit hash for this usecase... if you're going to add a 3rdparty dep, adding one that allows for larger outputs (at least 64 bits) would be good: almost anything should be faster than SHA1.

@jvican

This comment has been minimized.

Member

jvican commented Jul 18, 2017

Would be good to see before and after numbers for the same scenario, rather than just before numbers.

Yeah, agreed. I cannot do this before I can compile the sbt codebase with RC. I'll try then.

I'm a bit concerned about using a 32 bit hash for this usecase... if you're going to add a 3rdparty dep, adding one that allows for larger outputs (at least 64 bits) would be good: almost anything should be faster than SHA1.

The current 3rd party dep is specialised for 64bytes, the only issue is that we truncate it. See the commit message to know why 😄. I go into the details.

In short: there I explain that it's safe to truncate from 64 to 32. The resources I've read in the Internet (and the blog post I link to in the commit) claim that the quality of the 32-bit version is excellent.

By the way, as an interesting fact, this is the hash that Spark and MySQL use.

@jvican

This comment has been minimized.

Member

jvican commented Jul 18, 2017

Two more notes here:

  1. This PR is an enabler for build reproducibility. Zinc currently gets the last modified hashes for products (class files) and library dependencies (jars). As you all know, relying on the last modified time is not reproducible, so if we switch to hashing the contents of all these files our function has to be fast. This will help to make Zinc reproducible.
  2. With regard to truncating, I've just realised that Zinc was doing something of dubious correctness previous to this change: it was hashCodeing the hexadecimal representation of the SHA-1 hash of some file contents. This was required because FileHash needs an int, instead of a string. I don't have the mathematical rigour to prove that this creates lots of collisions, but my small understanding of these algorithms leads me to think it does. Relying on hashCode to create a good hash distribution is strongly discouraged in distributed systems running on the JVM.

To address Stu's concerns on the length of the hash: I am going to try to change the API of FileHash, meaning that hash will be a long and all the methods expecting an int will be deprecated.

Migrate away from 32-byte hashes
See discussion in #371.

Motivations to have a 64-byte hashes for all the file hashes:

1. In a distributed environment, 32-byte is not sufficient to ensure
   that there are no collisions.
2. The quality of the 64-byte hash is supposed to be greater than 32.
3. Previously, we were using the `hashCode` as a 32-byte hash for the
   hexadecimal representation of a SHA1 of some file contents. This is
   unsafe, and by switching to `long` we can also make the underlying
   hash file content implementation to be 64-byte based.
@jvican

This comment has been minimized.

Member

jvican commented Jul 18, 2017

@stuhood My last change makes the necessary changes to the FileHash API to avoid the use of hashCode and use a 64-byte hash instead. I believe it is binary compatible. Can you double-check?

@@ -25,7 +25,7 @@
*
* @return A valid string-based representation for logical equality, not referential equality.
*/
public int getValueId();
public long getValueId();

This comment has been minimized.

@stuhood

stuhood Jul 19, 2017

Contributor

I'll reiterate that if you're going to change this, changing it to use Array[Byte] or ByteBuffer or protobuf's ByteString or something else that has dynamic length would be strongly preferable to using a fixed width value. It allows you to change the implementation of the hash function without changing the interface (which is fine for a hash function, as it just causes misses). In the future when you need more than 64 bits, it would not be necessary to change the API again.

This comment has been minimized.

@jvican

jvican Jul 19, 2017

Member

Makes sense.

val acc = new Array[Long](2)
val buffer = new Array[Byte](BufferSize)
while (is.read(buffer) >= 0) {
val checksumChunk = xxHash.hashBytes(buffer)

This comment has been minimized.

@stuhood

stuhood Jul 19, 2017

Contributor

Can you include a reference to where you found this pattern?

Alternatively, using a hash library with a more usable interface would be great. Guava's is particularly good, and includes hash of file.

This comment has been minimized.

@fommil

fommil Jul 19, 2017

Guava notoriously break binary compatibility all the time, it would be good to not depend on it in core sbt.

This comment has been minimized.

@jvican

jvican Jul 19, 2017

Member

@stuhood I'm looking into a way of mixing hashes safely.
Note that Guava does not support xxHash. I like our current dependency because is high-quality and self-contained, and pretty light.

jvican added some commits Jul 19, 2017

Mmap jars and sources to be hashed
The following commit removes the streaming by mmapping the contents of
the files to be hashed, which is faster and better because no copying
occurs between OS regions and userbase regions, meaning that the OS
optimizes its access in virtual memory.

This also means that the subsequent times that this method is executed
it will be faster since the memory will be already mapped. Mapping my
whole ivy cache is 561ms the first time, and around ~350ms for the rest
of the time.
Turn hash type to `byte[]` in all APIs
This affects the `FileHash` and `Hash64` APIs.

This will allow us to store the representation no matter what the hash
is. It's friendlier to API changes in the future, as Stu has mentioned
in the review.

See https://github.com/sbt/zinc/pull/371/files for discussion.
@stuhood

One critical issue.

return this.hash[0];
}
public byte[] hash64() {

This comment has been minimized.

@stuhood

stuhood Jul 19, 2017

Contributor

Needn't be named hash64 anymore.

This comment has been minimized.

@jvican

jvican Jul 20, 2017

Member

Cannot be called hash because it conflicts with the previous signature. I'll call it hashBytes.

}
public int hashCode() {
return 37 * (37 * (37 * (17 + "xsbti.compile.FileHash".hashCode()) + file().hashCode()) + hash64().hashCode());

This comment has been minimized.

@stuhood

stuhood Jul 19, 2017

Contributor

Important: The hashCode and equals methods are not useful (need to use the static Arrays.equals(), iirc), which is one reason to prefer ByteBuffers for this.

The other reason to prefer ByteBuffers is that they can act as views into other data, which can avoid copying it: also, protobuf ByteStrings can be "viewed" as ByteBuffers with asReadOnlyByteBuffer.

This comment has been minimized.

@jvican

jvican Jul 20, 2017

Member

👍 We're already using asReadOnlyByteBuffer in the protobuf converters, but I agree ByteBuffers are superior over Array[Byte].

override def writeStamp: String = s"hash($hexHash)"
override def getValueId: Int = hexHash.hashCode()
override def getHash: Optional[String] = Optional.of(hexHash)
final class Hash64(val hash: Long) extends StampBase {

This comment has been minimized.

@stuhood

stuhood Jul 19, 2017

Contributor

Is it possible to just hold the hash bytes here, and use its hashCode as the hashCode? Not sure which API constraints you're working with, but: converting to 32 is necessary for JVM hashCode methods is necessary, but converting back and forth to Long values just shouldn't (ever) be necessary except in the text based analysis format.

This comment has been minimized.

@jvican

jvican Jul 20, 2017

Member

This is a good suggestion 👍. I'll try to see how it goes.

@@ -26,11 +27,17 @@ final class ProtobufReaders(mapper: ReadMapper) {
java.nio.file.Paths.get(path).toFile
}
private def readLong(bs: ByteString) = {

This comment has been minimized.

@stuhood

stuhood Jul 19, 2017

Contributor

See above: should just take the Array[Byte] directly here. Or better yet, use ByteBuffer in your public APIs and just directly consume asReadOnlyByteBuffer here.

This comment has been minimized.

@jvican

jvican Jul 20, 2017

Member

It's not possible to use ByteBuffer because bytes in protobuf relies on ByteString. We cannot change this data type before.

@jvican jvican added this to the 1.1.0 milestone Jul 23, 2017

@jvican

This comment has been minimized.

Member

jvican commented Jul 23, 2017

I'm assigning this to the 1.1.x milestone. Even though this change is not really intrusive, I would prefer it to land in 1.1.x instead of 1.0.0. I would like to advertise 1.1.0 as the release that focuses on making the incremental compiler reproducible.

@jvican jvican modified the milestones: 1.1.0, 1.0.1 Aug 10, 2017

@dwijnand dwijnand modified the milestones: 1.0.1, 1.1.0 Sep 27, 2017

@fommil

This comment has been minimized.

fommil commented Oct 13, 2017

Seems I'm hitting the same in a project with ~350 source files. No-op compile goes from 16 seconds (according to sbt, but I'm sure it's slower) to 37 seconds. A Yourkit profile shows all the time in FilterInputStream, which is nuts because I have an SSD and it takes real 0m0.029s for tar to create a tarball of my entire project.

"Reverse Call Tree","Time (ms)","Level"
"java.io.FilterInputStream.read(byte[]) FilterInputStream.java","36493","1"
"sbt.io.Hash$.apply(InputStream) Hash.scala:75","","2"
"sbt.io.Hash$.$anonfun$apply$1(InputStream) Hash.scala:55","","3"
"sbt.io.Hash$$$Lambda$2088.apply(Object)","","4"
"sbt.io.Using.apply(Object, Function1) Using.scala:22","","5"
"sbt.io.Hash$.apply(File) Hash.scala:55","","6"
"sbt.internal.inc.Hash$.ofFile(File) Stamp.scala:74","","7"
"sbt.internal.inc.Stamper$.$anonfun$forHash$2(File) Stamp.scala:144","","8"
"sbt.internal.inc.Stamper$$$Lambda$2087.apply()","","9"
"sbt.internal.inc.Stamper$.tryStamp(Function0) Stamp.scala:140","","10"
"sbt.internal.inc.Stamper$.$anonfun$forHash$1(File) Stamp.scala:144","","11"
"sbt.internal.inc.Stamper$$$Lambda$1412.apply(Object)","","12"
"sbt.internal.inc.MixedAnalyzingCompiler$.$anonfun$makeConfig$1(File) MixedAnalyzingCompiler.scala:185","36489","13"
"sbt.internal.inc.MixedAnalyzingCompiler$$$Lambda$2086.apply(Object)","","14"
"scala.collection.AbstractTraversable.map(Function1, CanBuildFrom) Traversable.scala:104","","15"
"sbt.internal.inc.MixedAnalyzingCompiler$.makeConfig(ScalaCompiler, JavaCompiler, Seq, Seq, Output, GlobalsCache, Option, Seq, Seq, CompileAnalysis, Option, PerClasspathEntryLookup, Reporter, CompileOrder, boolean, IncOptions, List) MixedAnalyzingCompiler.scala:184","","16"
"sbt.internal.inc.IncrementalCompilerImpl.$anonfun$compileIncrementally$1(IncrementalCompilerImpl, ScalaCompiler, JavaCompiler, File[], Seq, Output, GlobalsCache, Option, Seq, Seq, Option, Option, PerClasspathEntryLookup, Reporter, CompileOrder, boolean, IncOptions, List, Logger) IncrementalCompilerImpl.scala:259","","17"
"sbt.internal.inc.IncrementalCompilerImpl$$Lambda$2082.apply()","","18"
"sbt.internal.inc.InitialStamps.$anonfun$source$1(InitialStamps, File) Stamp.scala:256","3","13"
@jvican

This comment has been minimized.

Member

jvican commented Oct 13, 2017

Quoting the exact same part of the profile that I did in the initial description of this PR, we can see that it's the same culprit:

"sbt.internal.inc.MixedAnalyzingCompiler$.$anonfun$makeConfig$1(File) MixedAnalyzingCompiler.scala:185","36489","13"

I still have to measure whether the impact here is just IO or the hashing algorithm being used. I'll check this PR with my previous experiment to see if the running time does indeed decrease.

@fommil

This comment has been minimized.

fommil commented Oct 13, 2017

the true cause is this line

"java.io.FilterInputStream.read(byte[]) FilterInputStream.java","36493","1"

i.e. this is all I/O (this is a backtrace). Using NIO to read the file in one operation should be significantly faster.

@jvican

This comment has been minimized.

Member

jvican commented Oct 13, 2017

I remember seeing "sbt.io.Hash$.apply(InputStream) Hash.scala:75","","2" with way more running time in my experiment, so in this case it's likely that either these are different issues or there's some overlapping between them or YourKit lied to me. I'll profile it again to be 100% sure.

@fommil

This comment has been minimized.

fommil commented Oct 13, 2017

make sure to swap between sampling and probe mode.

This was referenced Oct 13, 2017

@godenji

This comment has been minimized.

godenji commented Oct 30, 2017

Thanks for working on this.

Any chance this PR can get merged into 1.0.x rather than waiting for 1.1? Sbt is basically unusable right now for iterative development. Every compile incurs a several second delay before the task kicks off (small project 30kloc, 650 sources), regardless of whether or not sources have changed.

@eed3si9n

This comment has been minimized.

Member

eed3si9n commented Oct 31, 2017

It would be good to clarify what scenario we can expect to see speedup in a form of repeatable benchmark test.

@fommil

This comment has been minimized.

fommil commented Oct 31, 2017

@eed3si9n I've created extensive docs around this issue across the new multiple repo structure (which was confusing as hell) and even sent a POC with a fix. I gave up because there is no way to monkey patch sbt with a custom implementation of any of this stuff. So I sent PRs to allow monkey patching. Then I gave up.

It is hard to understate how much of a performance regression this is on anything but the most trivial of projects.

A benchmark is trivial... apply this hash to the scala library jar. Cached, or as Jorge points out using the jar's header, is orders of magnitude faster.

@dispalt

This comment has been minimized.

dispalt commented Nov 8, 2017

Here's some more data for y'all. 7s on no-change compile. Big list of dependent jars. https://gist.github.com/dispalt/2201b25a6a3476f2f01188bf44b808a3

@jvican

This comment has been minimized.

Member

jvican commented Nov 9, 2017

Given the importance of this change and how it's affecting the community, I'll be updating this PR and finding a solution to this hashing problem tomorrow.

@fommil

This comment has been minimized.

fommil commented Nov 9, 2017

Any hash is going to be too slow to use for this purpose... we need a quick check before then based on OS metadata. Efforts to speed up the hash itself would still be greatly appreciated as it's a constant overhead for all builds.

@leonardehrenfried

This comment has been minimized.

leonardehrenfried commented Nov 9, 2017

jvican added a commit to scalacenter/zinc that referenced this pull request Nov 9, 2017

Fix sbt#433: Make classpath hashing more lightweight
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 133ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.
@jvican

This comment has been minimized.

Member

jvican commented Nov 9, 2017

Done in #452.

I'll follow up this PR and change the algorithm and the API limitations imposed by the old Zinc 1.0 API in the future. So far, #452's PR should:

  • Make the first time you run compile on a project faster (by doing hashing in parallel).
  • Avoid the hashing completely if the last modified times of the jars have not changed in following compiles.

Numbers on the pull request.

jvican added a commit to scalacenter/zinc that referenced this pull request Nov 9, 2017

Fix sbt#433: Make classpath hashing more lightweight
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.

jvican added a commit to scalacenter/zinc that referenced this pull request Nov 9, 2017

Fix sbt#433: Make classpath hashing more lightweight
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.

jvican added a commit to scalacenter/zinc that referenced this pull request Nov 9, 2017

Make classpath hashing more lightweight
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.

Fixes sbt#433.

jvican added a commit to scalacenter/zinc that referenced this pull request Nov 9, 2017

Make classpath hashing more lightweight
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.

Fixes sbt#433.

jvican added a commit to scalacenter/zinc that referenced this pull request Nov 9, 2017

Make classpath hashing more lightweight
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.

Fixes sbt#433.

jvican added a commit to scalacenter/zinc that referenced this pull request Nov 10, 2017

Make classpath hashing more lightweight
And make it parallel!

This patch adds a cache that relies on filesystem metadata to cache
hashes for jars that have the same last modified time across different
compiler iterations. This is important because until now there was a
significant overhead when running `compile` on multi-module builds that
have gigantic classpaths. In this scenario, the previous algorithm
computed hashes for all jars transitively across all these projects.

This patch is conservative; there are several things that are wrong with
the status quo of classpath hashing. The most important one is the fact
that Zinc has been doing `hashCode` on a SHA-1 checksum, which doesn't
make sense. The second one is that we don't need a SHA-1 checksum for
the kind of checks we want to do. sbt#371
explains why. The third limitation with this check is that file hashes
are implemented internally as `int`s, which is not enough to represent
the richness of the checksum. My previous PR also tackles this problem,
which will be solved in the long term.

Therefore, this pull request only tackles these two things:

* Caching of classpath entry hashes.
* Parallelize this IO-bound task.

Results, on my local machine:

- No parallel hashing of the first 500 jars in my ivy cache: 1330ms.
- Parallel hashing of the first 500 jars in my ivy cache: 770ms.
- Second parallel hashing of the first 500 jars in my ivy cache: 1ms.

Fixes sbt#433.

@jvican jvican closed this Nov 14, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment