-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New analysis format #1326
New analysis format #1326
Conversation
The 3 large binary files clearly don't belong into this repo but I don't know what's the right place for them. They are used by the integration tests and the benchmarks. In our internal repo we keep such files in S3 instead of checking them into git. |
* - Consistent output files: If two compiler runs result in the same internal representation of | ||
* incremental state (after applying WriteMappers), they produce identical zinc files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I guess the flip is that the existing binary format is incorrect/unstable? ("correct" defined by Bazel that it produces the same results between runs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current format can't be made stable, otherwise I probably wouldn't have created this new one from scratch. It is based on protobuf and has maps in it. This doesn't prevent a stable output at the spec level but the Java/Scala protobuf tooling only gives you unordered maps.
Getting identical outputs is not so much a correctness problem as one of performance. If your only mechanism to determine what to rebuild is Zinc, then it doesn't matter. But in Bazel we a) have a faster way of skipping targets entirely (when all inputs are identical) and b) a larger overhead for dropping down into Zinc to make this decision.
ec: ExecutionContext = ExecutionContext.global, // NOLINT | ||
parallelism: Int = Runtime.getRuntime.availableProcessors() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove the default arguments here to avoid inadvertently use global
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using global
. Is there a better EC that is accessible when you construct an AnalysisStore? Should this be a parameter of the AnalysisStore.text/binary methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, honestly I don't how much of the issue it would be, if it's relatively quick, but it might be a good idea to expose EC as a configuration parameter so we can pick something else like a dedicated thread pool. Even if we are using global
, I think we should try to make that explicit as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm moving the implicits up into the ConsistentAnalysisStore factory methods and making them explicit further down.
@Measurement(iterations = 5) | ||
@OutputTimeUnit(TimeUnit.MILLISECONDS) | ||
@State(Scope.Benchmark) | ||
class AnalysisFormatBenchmark { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we integrate it as part of CI check, similar to #1323?
This doesn't have to go into this PR, we can do this in a separate PR.
Maybe we can store them via Git LFS? I have never used Git LFS, so ideally people with prior experience with it can judge if this is indeed a good idea. |
|
||
private[this] def writeAPIs(out: Serializer, apis: APIs, storeApis: Boolean): Unit = { | ||
def write(n: String, m: Map[String, AnalyzedClass]): Unit = | ||
writeMaybeSortedStringMap(out, n, m.mapValues(_.withCompilationTimestamp(-1L))) { ac => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for setting timestamp to -1L
?
I am not fully sure, but I think sbt testQuick
uses analysis timestamp. In the future, new macro invalidation logics may also require timestamp information.
If this is for reproducible output, we can make it opt-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any metadata about a compilation run inevitably results in inconsistent outputs. We can only store data that is repeatable in a fresh build. If sbt needs this data, we could include it optionally (similar to the unsorted output that I added at the last minute).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably use 2010-01-01, like elsewhere in sbt - sbt/sbt#6237, if not a few seconds after the midnight of 2010-01-01 (to break tie break with source JARs).
There's a weird combination ZIP + JVM bug where any timestamp before 1980 requires extended timestamp, and it sometimes end up capturing the timezone of the host machine, breaking hermeticity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to something like this? I found this in the JarUtils
class that is part of our Bazel tooling. I'm not sure if it was written from scratch or adapted from code in sbt or some other place originally:
private lazy val fixedTime = new SimpleDateFormat("dd-MM-yyyy").parse("01-01-1980").getTime
This looks good at first glance, but SDF defaults to the current timezone. I changed it to
final val FixedTime = 315532800000L // 1980-01-01Z00:00:00
if (!file.getParentFile.exists()) file.getParentFile.mkdirs() | ||
val fout = new FileOutputStream(tmpAnalysisFile) | ||
try { | ||
val gout = new ParallelGzipOutputStream(fout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be curious to benchmark how much of the write speedup is coming from parallelism. Would that basically require more CPU power, or assume there's excess CPU cores that aren't pinned by other tasks? In Bazel, compilation is out-sourced a worker process (and sometimes remote worker machines), but this could behave differently for sbt. Not that sbt can always occupy all cores, so we might still see speedup but it might be good to benchmark this characteristic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gzip compression is very slow relative to everything else, despite gzip with native zlib being fast and using the best performance/size tradeoff in he compression settings as determined by benchmarks. Or maybe I should say: The new serializer and deserializer are very fast compared to gzip.
I reran a quick benchmark. Writing without compression takes 153ms, writing with Java's standard GZIPOutputStream is 402ms, writing with ParallelGzipOutputStream is 157ms. The latter is on a 12-core M2 Max (but with only 230% CPU usage; that's enough to offload all the compression to background threads, making it almost free in terms of wall clock time).
With the new flag that skips sorting, writing is now faster than reading. This is because we've reached the point where even gzip decompression during reading could benefit from parallelization (but the gains would be much smaller than for writing).
internal/zinc-persist/src/main/scala/sbt/internal/inc/consistent/ConsistentAnalysisFormat.scala
Outdated
Show resolved
Hide resolved
implicit def sortedMapFactoryToCBF[CC[A, B] <: SortedMap[A, B] with SortedMapLike[ | ||
A, | ||
B, | ||
CC[A, B] | ||
], K: Ordering, V](f: SortedMapFactory[CC]): Factory[(K, V), CC[K, V]] = | ||
new f.SortedMapCanBuildFrom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The auto-formatter is awful. All of this was so much more readable before the auto-formatter ruined it.
A new implementation of Zinc's incremental state serialization. - Full structural serialization (like the existing protobuf format), no shortcuts with sbinary or Java serialization (like the existing text format). - A single implementation that supports an efficient binary format for production use and a text format for development and debugging. - Consistent output files: If two compiler runs result in the same internal representation of incremental state (after applying WriteMappers), they produce identical zinc files. This is important for build tools like Bazel where skipping a build entirely when the outputs are identical is much cheaper than having to run Zinc to perform the incremental state analysis. - Smaller output files than the existing binary format. - Faster serialization and deserialization than the existing binary format. - Smaller implementation than either of the existing formats. - Optional unsorted output that trades consistency and small file sizes for much faster writing. Benchmark data based on scala-library + reflect + compiler: | | Write time | Read time | File size | |-----------------------------|------------|-----------|-----------| | sbt Text | 1002 ms | 791 ms | 7102 kB | | sbt Binary | 654 ms | 277 ms | 6182 kB | | ConsistentBinary | 157 ms | 100 ms | 3097 kB | | ConsistentBinary (unsorted) | 79 ms | | 3796 kB | This PR makes the new format available via the new ConsistentFileAnalysisStore. It does not replace the existing formats (but it should; it's a better choice for almost every use case). We have been using iterations of this format internally over the last few months for the Bazel build (with our own Zinc-based tooling) of our two main monorepos totaling about 27000 Scala (+ Java/mixed) targets ranging in size from a few LOC to almost 1 million LOC.
8305c68
to
9865282
Compare
I don't understand why the benchmarks are failing in CI. I added the new class to the existing benchmark project so it should be found. I was not able to reproduce the problem locally. The benchmark commands from CI run just fine on my machine. |
Oh, it's the "benchmark against develop branch" step that fails. This seems to be the usual undercompilation bug of sbt-jmh. It usually needs an explicit clean when you remove a benchmark. Switching to the develop branch after building from this PR would cause it. This problem should go away after merging, but explicitly cleaning the jmh project would be a better fix. |
9865282
to
05a13c4
Compare
05a13c4
to
1d296b2
Compare
How about : For the formats? seems fast too. |
bump @eed3si9n, are there any blockers from merging this? We've been using this internally with good success, and expect it would provide a lot of value to the broader community to have this upstreamed for SBT and Mill and other build tools to use |
Overall I am onboard with more stable / hermetic analysis serializer. My main line of concern was where the speedup gain was coming from ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @szeiger!
@eed3si9n would you have time to do a 1.10.0-M4 release that includes this PR? |
Yea. I'll try to get something out this weekend. |
https://github.com/sbt/zinc/releases/tag/v1.10.0-RC1 is on its way to Maven Central. |
See also sbt/zinc#1326 This adds a new setting `enableConsistentCompileAnalysis`, which enables the new "Consistent" Analysis format, which is faster and more repeatable than the status quo. This is initialized to `true` by default. It can be opted out either by the setting or using `-Dsbt.analysis2024=false`.
See also sbt/zinc#1326 This adds a new setting `enableConsistentCompileAnalysis`, which enables the new "Consistent" Analysis format, which is faster and more repeatable than the status quo. This is initialized to `true` by default. It can be opted out either by the setting or using `-Dsbt.analysis2024=false`.
This is new analysis store format added to Zinc by Databricks that is deterministic. Given two identical Zinc states (after applying the read write wappers) the outputs should be identical. As an added bonus it is faster and smaller than the previous format. See this PR for more info: sbt/zinc#1326 This means we can stop most of the work we're doing to make the Zinc analysis output more deterministic and just rely on this new analysis format.
This is new analysis store format added to Zinc by Databricks that is deterministic. Given two identical Zinc states (after applying the read write wappers) the outputs should be identical. As an added bonus it is faster and smaller than the previous format. See this PR for more info: sbt/zinc#1326 This means we can stop most of the work we're doing to make the Zinc analysis output more deterministic and just rely on this new analysis format.
A new implementation of Zinc's incremental state serialization.
Benchmark data based on scala-library + reflect + compiler:
This PR makes the new format available via the new ConsistentFileAnalysisStore. It does not replace the existing formats (but it should; it's a better choice for almost every use case).
We have been using iterations of this format internally over the last few months for the Bazel build (with our own Zinc-based tooling) of our two main monorepos totaling about 27000 Scala (+ Java/mixed) targets ranging in size from a few LOC to almost 1 million LOC.