New analysis format #1326

szeiger · 2024-01-05T18:28:38Z

A new implementation of Zinc's incremental state serialization.

Full structural serialization (like the existing protobuf format), no shortcuts with sbinary or Java serialization (like the existing text format).
A single implementation that supports an efficient binary format for production use and a text format for development and debugging.
Consistent output files: If two compiler runs result in the same internal representation of incremental state (after applying WriteMappers), they produce identical zinc files. This is important for build tools like Bazel where skipping a build entirely when the outputs are identical is much cheaper than having to run Zinc to perform the incremental state analysis.
Smaller output files than the existing binary format.
Faster serialization and deserialization than the existing binary format.
Smaller implementation than either of the existing formats.
Optional unsorted output that trades consistency and small file sizes for much faster writing.

Benchmark data based on scala-library + reflect + compiler:

	Write time	Read time	File size
sbt Text	1002 ms	791 ms	~ 7102 kB
sbt Binary	654 ms	277 ms	~ 6182 kB
ConsistentBinary	157 ms	100 ms	3097 kB
ConsistentBinary (unsorted)	79 ms		~ 3796 kB

This PR makes the new format available via the new ConsistentFileAnalysisStore. It does not replace the existing formats (but it should; it's a better choice for almost every use case).

We have been using iterations of this format internally over the last few months for the Bazel build (with our own Zinc-based tooling) of our two main monorepos totaling about 27000 Scala (+ Java/mixed) targets ranging in size from a few LOC to almost 1 million LOC.

szeiger · 2024-01-05T18:30:06Z

The 3 large binary files clearly don't belong into this repo but I don't know what's the right place for them. They are used by the integration tests and the benchmarks. In our internal repo we keep such files in S3 instead of checking them into git.

eed3si9n · 2024-01-05T19:48:45Z

internal/zinc-persist/src/main/scala/sbt/internal/inc/consistent/ConsistentAnalysisFormat.scala

+ * - Consistent output files: If two compiler runs result in the same internal representation of
+ *   incremental state (after applying WriteMappers), they produce identical zinc files.


Nice! I guess the flip is that the existing binary format is incorrect/unstable? ("correct" defined by Bazel that it produces the same results between runs)

The current format can't be made stable, otherwise I probably wouldn't have created this new one from scratch. It is based on protobuf and has maps in it. This doesn't prevent a stable output at the spec level but the Java/Scala protobuf tooling only gives you unordered maps.

Getting identical outputs is not so much a correctness problem as one of performance. If your only mechanism to determine what to rebuild is Zinc, then it doesn't matter. But in Bazel we a) have a faster way of skipping targets entirely (when all inputs are identical) and b) a larger overhead for dropping down into Zinc to make this decision.

eed3si9n · 2024-01-05T19:54:55Z

internal/zinc-persist/src/main/scala/sbt/internal/inc/consistent/ParallelGzipOutputStream.scala

+    ec: ExecutionContext = ExecutionContext.global, // NOLINT
+    parallelism: Int = Runtime.getRuntime.availableProcessors()


Could we remove the default arguments here to avoid inadvertently use global?

We are using global. Is there a better EC that is accessible when you construct an AnalysisStore? Should this be a parameter of the AnalysisStore.text/binary methods?

Yea, honestly I don't how much of the issue it would be, if it's relatively quick, but it might be a good idea to expose EC as a configuration parameter so we can pick something else like a dedicated thread pool. Even if we are using global, I think we should try to make that explicit as possible.

I'm moving the implicits up into the ConsistentAnalysisStore factory methods and making them explicit further down.

Friendseeker · 2024-01-05T21:54:37Z

internal/zinc-benchmarks/src/test/scala/xsbt/AnalysisFormatBenchmark.scala

+@Measurement(iterations = 5)
+@OutputTimeUnit(TimeUnit.MILLISECONDS)
+@State(Scope.Benchmark)
+class AnalysisFormatBenchmark {


Can we integrate it as part of CI check, similar to #1323?

This doesn't have to go into this PR, we can do this in a separate PR.

Friendseeker · 2024-01-05T22:11:39Z

The 3 large binary files clearly don't belong into this repo but I don't know what's the right place for them. They are used by the integration tests and the benchmarks. In our internal repo we keep such files in S3 instead of checking them into git.

Maybe we can store them via Git LFS?

I have never used Git LFS, so ideally people with prior experience with it can judge if this is indeed a good idea.

Friendseeker · 2024-01-05T22:34:08Z

internal/zinc-persist/src/main/scala/sbt/internal/inc/consistent/ConsistentAnalysisFormat.scala

+
+  private[this] def writeAPIs(out: Serializer, apis: APIs, storeApis: Boolean): Unit = {
+    def write(n: String, m: Map[String, AnalyzedClass]): Unit =
+      writeMaybeSortedStringMap(out, n, m.mapValues(_.withCompilationTimestamp(-1L))) { ac =>


What is the reason for setting timestamp to -1L?

I am not fully sure, but I think sbt testQuick uses analysis timestamp. In the future, new macro invalidation logics may also require timestamp information.

If this is for reproducible output, we can make it opt-in.

Any metadata about a compilation run inevitably results in inconsistent outputs. We can only store data that is repeatable in a fresh build. If sbt needs this data, we could include it optionally (similar to the unsorted output that I added at the last minute).

We should probably use 2010-01-01, like elsewhere in sbt - sbt/sbt#6237, if not a few seconds after the midnight of 2010-01-01 (to break tie break with source JARs).
There's a weird combination ZIP + JVM bug where any timestamp before 1980 requires extended timestamp, and it sometimes end up capturing the timezone of the host machine, breaking hermeticity.

Related to something like this? I found this in the JarUtils class that is part of our Bazel tooling. I'm not sure if it was written from scratch or adapted from code in sbt or some other place originally:

private lazy val fixedTime = new SimpleDateFormat("dd-MM-yyyy").parse("01-01-1980").getTime

This looks good at first glance, but SDF defaults to the current timezone. I changed it to

final val FixedTime = 315532800000L // 1980-01-01Z00:00:00

eed3si9n · 2024-01-05T22:50:29Z

...al/zinc-persist/src/main/scala/sbt/internal/inc/consistent/ConsistentFileAnalysisStore.scala

+      if (!file.getParentFile.exists()) file.getParentFile.mkdirs()
+      val fout = new FileOutputStream(tmpAnalysisFile)
+      try {
+        val gout = new ParallelGzipOutputStream(fout)


I'd be curious to benchmark how much of the write speedup is coming from parallelism. Would that basically require more CPU power, or assume there's excess CPU cores that aren't pinned by other tasks? In Bazel, compilation is out-sourced a worker process (and sometimes remote worker machines), but this could behave differently for sbt. Not that sbt can always occupy all cores, so we might still see speedup but it might be good to benchmark this characteristic.

The gzip compression is very slow relative to everything else, despite gzip with native zlib being fast and using the best performance/size tradeoff in he compression settings as determined by benchmarks. Or maybe I should say: The new serializer and deserializer are very fast compared to gzip.

I reran a quick benchmark. Writing without compression takes 153ms, writing with Java's standard GZIPOutputStream is 402ms, writing with ParallelGzipOutputStream is 157ms. The latter is on a 12-core M2 Max (but with only 230% CPU usage; that's enough to offload all the compression to background threads, making it almost free in terms of wall clock time).

With the new flag that skips sorting, writing is now faster than reading. This is because we've reached the point where even gzip decompression during reading could benefit from parallelization (but the gains would be much smaller than for writing).

internal/zinc-persist/src/main/scala/sbt/internal/inc/consistent/ConsistentAnalysisFormat.scala

szeiger · 2024-01-09T18:15:20Z

internal/zinc-persist/src/main/scala-2.12/sbt/internal/inc/consistent/Compat.scala

+  implicit def sortedMapFactoryToCBF[CC[A, B] <: SortedMap[A, B] with SortedMapLike[
+    A,
+    B,
+    CC[A, B]
+  ], K: Ordering, V](f: SortedMapFactory[CC]): Factory[(K, V), CC[K, V]] =
+    new f.SortedMapCanBuildFrom


The auto-formatter is awful. All of this was so much more readable before the auto-formatter ruined it.

A new implementation of Zinc's incremental state serialization. - Full structural serialization (like the existing protobuf format), no shortcuts with sbinary or Java serialization (like the existing text format). - A single implementation that supports an efficient binary format for production use and a text format for development and debugging. - Consistent output files: If two compiler runs result in the same internal representation of incremental state (after applying WriteMappers), they produce identical zinc files. This is important for build tools like Bazel where skipping a build entirely when the outputs are identical is much cheaper than having to run Zinc to perform the incremental state analysis. - Smaller output files than the existing binary format. - Faster serialization and deserialization than the existing binary format. - Smaller implementation than either of the existing formats. - Optional unsorted output that trades consistency and small file sizes for much faster writing. Benchmark data based on scala-library + reflect + compiler: | | Write time | Read time | File size | |-----------------------------|------------|-----------|-----------| | sbt Text | 1002 ms | 791 ms | 7102 kB | | sbt Binary | 654 ms | 277 ms | 6182 kB | | ConsistentBinary | 157 ms | 100 ms | 3097 kB | | ConsistentBinary (unsorted) | 79 ms | | 3796 kB | This PR makes the new format available via the new ConsistentFileAnalysisStore. It does not replace the existing formats (but it should; it's a better choice for almost every use case). We have been using iterations of this format internally over the last few months for the Bazel build (with our own Zinc-based tooling) of our two main monorepos totaling about 27000 Scala (+ Java/mixed) targets ranging in size from a few LOC to almost 1 million LOC.

szeiger · 2024-01-10T15:07:01Z

I don't understand why the benchmarks are failing in CI. I added the new class to the existing benchmark project so it should be found. I was not able to reproduce the problem locally. The benchmark commands from CI run just fine on my machine.

szeiger · 2024-01-10T15:42:19Z

Oh, it's the "benchmark against develop branch" step that fails. This seems to be the usual undercompilation bug of sbt-jmh. It usually needs an explicit clean when you remove a benchmark. Switching to the develop branch after building from this PR would cause it. This problem should go away after merging, but explicitly cleaning the jmh project would be a better fix.

He-Pin · 2024-01-18T06:44:28Z

How about :
https://github.com/HebiRobotics/QuickBuffers
and
https://github.com/google/flatbuffers

For the formats? seems fast too.

lihaoyi-databricks · 2024-03-18T01:11:44Z

bump @eed3si9n, are there any blockers from merging this? We've been using this internally with good success, and expect it would provide a lot of value to the broader community to have this upstreamed for SBT and Mill and other build tools to use

eed3si9n · 2024-03-18T02:21:26Z

bump @eed3si9n, are there any blockers from merging this? We've been using this internally with good success, and expect it would provide a lot of value to the broader community to have this upstreamed for SBT and Mill and other build tools to use

Overall I am onboard with more stable / hermetic analysis serializer. My main line of concern was where the speedup gain was coming from (ExecutionContext.global) and if it would end up competing for CPU attention in non-Bazel usage. Now that ExecutionContext is a parameter, hopefully we can tweak how much thread we assign to this vs other tasks.

eed3si9n

Thanks @szeiger!

lrytz · 2024-04-05T08:01:36Z

@eed3si9n would you have time to do a 1.10.0-M4 release that includes this PR?

eed3si9n · 2024-04-05T13:03:48Z

Yea. I'll try to get something out this weekend.

eed3si9n · 2024-04-08T09:19:12Z

https://github.com/sbt/zinc/releases/tag/v1.10.0-RC1 is on its way to Maven Central.

See also sbt/zinc#1326 This adds a new setting `enableConsistentCompileAnalysis`, which enables the new "Consistent" Analysis format, which is faster and more repeatable than the status quo. This is initialized to `true` by default. It can be opted out either by the setting or using `-Dsbt.analysis2024=false`.

sbt/zinc#1326

This is new analysis store format added to Zinc by Databricks that is deterministic. Given two identical Zinc states (after applying the read write wappers) the outputs should be identical. As an added bonus it is faster and smaller than the previous format. See this PR for more info: sbt/zinc#1326 This means we can stop most of the work we're doing to make the Zinc analysis output more deterministic and just rely on this new analysis format.

eed3si9n reviewed Jan 5, 2024

View reviewed changes

Friendseeker reviewed Jan 5, 2024

View reviewed changes

eed3si9n reviewed Jan 5, 2024

View reviewed changes

szeiger commented Jan 9, 2024

View reviewed changes

internal/zinc-persist/src/main/scala/sbt/internal/inc/consistent/ConsistentAnalysisFormat.scala Outdated Show resolved Hide resolved

szeiger commented Jan 9, 2024

View reviewed changes

szeiger force-pushed the wip/consistent-analysis-format branch 2 times, most recently from 8305c68 to 9865282 Compare January 10, 2024 14:57

szeiger force-pushed the wip/consistent-analysis-format branch from 9865282 to 05a13c4 Compare January 10, 2024 15:46

Some fixes

1d296b2

szeiger force-pushed the wip/consistent-analysis-format branch from 05a13c4 to 1d296b2 Compare January 10, 2024 16:13

eed3si9n approved these changes Mar 18, 2024

View reviewed changes

eed3si9n merged commit dcddc1f into sbt:develop Mar 18, 2024
8 checks passed

eed3si9n mentioned this pull request Apr 15, 2024

Consistent Analysis sbt/sbt#7534

Merged

lefou mentioned this pull request Apr 26, 2024

Update zinc from 1.9.6 to 1.10.0, use improved analysis store com-lihaoyi/mill#2899

Merged

jjudd pushed a commit to lucidsoftware/rules_scala that referenced this pull request Jul 2, 2024

Use ConsistentAnalysisStore

5abda10

sbt/zinc#1326

jjudd pushed a commit to lucidsoftware/rules_scala that referenced this pull request Jul 2, 2024

Use ConsistentAnalysisStore

b324f5a

sbt/zinc#1326

jjudd pushed a commit to lucidsoftware/rules_scala that referenced this pull request Jul 2, 2024

Use ConsistentAnalysisStore

488814e

sbt/zinc#1326

jjudd pushed a commit to lucidsoftware/rules_scala that referenced this pull request Jul 3, 2024

Use ConsistentAnalysisStore

ac38de6

sbt/zinc#1326

jjudd pushed a commit to lucidsoftware/rules_scala that referenced this pull request Jul 3, 2024

Use ConsistentAnalysisStore

5ccb9a6

sbt/zinc#1326

bowenliang123 mentioned this pull request Jul 30, 2024

Bump scala-maven-plugin from 4.8.0 to 4.9.2 apache/kyuubi#6364

Closed

4 tasks

eed3si9n mentioned this pull request Sep 5, 2024

[2.x] Remove the Protocol Buffer usage #1388

Merged

timothyklim mentioned this pull request Sep 7, 2024

Migrate to new analysis format timothyklim/rules_scala3#34

Open

This was referenced Sep 19, 2024

Improve performance of Analysis deserialization. #984

Closed

Writing the analysis file is slow #623

Closed

CompileAnalysis contains many redundant strings #547

Closed

eed3si9n mentioned this pull request Oct 15, 2024

[2.x] Replace ParallelGzipOutputStream with Ichoran's implementation #1456

Merged

This was referenced Oct 16, 2024

Reduce memory of Zinc by de-duplicating Strings, etc #780

Closed

[1.x] Let Consistent Analysis to be opt-in by default sbt/sbt#7807

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New analysis format #1326

New analysis format #1326

szeiger commented Jan 5, 2024 •

edited

Loading

szeiger commented Jan 5, 2024

eed3si9n Jan 5, 2024 •

edited

Loading

szeiger Jan 8, 2024

eed3si9n Jan 5, 2024

szeiger Jan 8, 2024 •

edited by dwijnand

Loading

eed3si9n Jan 9, 2024

szeiger Jan 9, 2024

Friendseeker Jan 5, 2024

Friendseeker commented Jan 5, 2024 •

edited

Loading

Friendseeker Jan 5, 2024 •

edited

Loading

szeiger Jan 8, 2024

eed3si9n Jan 9, 2024

szeiger Jan 9, 2024

eed3si9n Jan 5, 2024

szeiger Jan 8, 2024

szeiger Jan 9, 2024

szeiger commented Jan 10, 2024

szeiger commented Jan 10, 2024

He-Pin commented Jan 18, 2024

lihaoyi-databricks commented Mar 18, 2024

eed3si9n commented Mar 18, 2024

eed3si9n left a comment

lrytz commented Apr 5, 2024

eed3si9n commented Apr 5, 2024

eed3si9n commented Apr 8, 2024

		* - Consistent output files: If two compiler runs result in the same internal representation of
		* incremental state (after applying WriteMappers), they produce identical zinc files.

		ec: ExecutionContext = ExecutionContext.global, // NOLINT
		parallelism: Int = Runtime.getRuntime.availableProcessors()

New analysis format #1326

New analysis format #1326

Conversation

szeiger commented Jan 5, 2024 • edited Loading

szeiger commented Jan 5, 2024

eed3si9n Jan 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szeiger Jan 8, 2024 • edited by dwijnand Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Friendseeker commented Jan 5, 2024 • edited Loading

Friendseeker Jan 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szeiger commented Jan 10, 2024

szeiger commented Jan 10, 2024

He-Pin commented Jan 18, 2024

lihaoyi-databricks commented Mar 18, 2024

eed3si9n commented Mar 18, 2024

eed3si9n left a comment

Choose a reason for hiding this comment

lrytz commented Apr 5, 2024

eed3si9n commented Apr 5, 2024

eed3si9n commented Apr 8, 2024

szeiger commented Jan 5, 2024 •

edited

Loading

eed3si9n Jan 5, 2024 •

edited

Loading

szeiger Jan 8, 2024 •

edited by dwijnand

Loading

Friendseeker commented Jan 5, 2024 •

edited

Loading

Friendseeker Jan 5, 2024 •

edited

Loading