Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New analysis format #1326

Merged
merged 2 commits into from
Mar 18, 2024
Merged

Conversation

szeiger
Copy link
Contributor

@szeiger szeiger commented Jan 5, 2024

A new implementation of Zinc's incremental state serialization.

  • Full structural serialization (like the existing protobuf format), no shortcuts with sbinary or Java serialization (like the existing text format).
  • A single implementation that supports an efficient binary format for production use and a text format for development and debugging.
  • Consistent output files: If two compiler runs result in the same internal representation of incremental state (after applying WriteMappers), they produce identical zinc files. This is important for build tools like Bazel where skipping a build entirely when the outputs are identical is much cheaper than having to run Zinc to perform the incremental state analysis.
  • Smaller output files than the existing binary format.
  • Faster serialization and deserialization than the existing binary format.
  • Smaller implementation than either of the existing formats.
  • Optional unsorted output that trades consistency and small file sizes for much faster writing.

Benchmark data based on scala-library + reflect + compiler:

Write time Read time File size
sbt Text 1002 ms 791 ms ~ 7102 kB
sbt Binary 654 ms 277 ms ~ 6182 kB
ConsistentBinary 157 ms 100 ms 3097 kB
ConsistentBinary (unsorted) 79 ms ~ 3796 kB

This PR makes the new format available via the new ConsistentFileAnalysisStore. It does not replace the existing formats (but it should; it's a better choice for almost every use case).

We have been using iterations of this format internally over the last few months for the Bazel build (with our own Zinc-based tooling) of our two main monorepos totaling about 27000 Scala (+ Java/mixed) targets ranging in size from a few LOC to almost 1 million LOC.

@szeiger
Copy link
Contributor Author

szeiger commented Jan 5, 2024

The 3 large binary files clearly don't belong into this repo but I don't know what's the right place for them. They are used by the integration tests and the benchmarks. In our internal repo we keep such files in S3 instead of checking them into git.

Comment on lines +22 to +23
* - Consistent output files: If two compiler runs result in the same internal representation of
* incremental state (after applying WriteMappers), they produce identical zinc files.
Copy link
Member

@eed3si9n eed3si9n Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I guess the flip is that the existing binary format is incorrect/unstable? ("correct" defined by Bazel that it produces the same results between runs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current format can't be made stable, otherwise I probably wouldn't have created this new one from scratch. It is based on protobuf and has maps in it. This doesn't prevent a stable output at the spec level but the Java/Scala protobuf tooling only gives you unordered maps.

Getting identical outputs is not so much a correctness problem as one of performance. If your only mechanism to determine what to rebuild is Zinc, then it doesn't matter. But in Bazel we a) have a faster way of skipping targets entirely (when all inputs are identical) and b) a larger overhead for dropping down into Zinc to make this decision.

Comment on lines 55 to 56
ec: ExecutionContext = ExecutionContext.global, // NOLINT
parallelism: Int = Runtime.getRuntime.availableProcessors()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove the default arguments here to avoid inadvertently use global?

Copy link
Contributor Author

@szeiger szeiger Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using global. Is there a better EC that is accessible when you construct an AnalysisStore? Should this be a parameter of the AnalysisStore.text/binary methods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, honestly I don't how much of the issue it would be, if it's relatively quick, but it might be a good idea to expose EC as a configuration parameter so we can pick something else like a dedicated thread pool. Even if we are using global, I think we should try to make that explicit as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm moving the implicits up into the ConsistentAnalysisStore factory methods and making them explicit further down.

@Measurement(iterations = 5)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
class AnalysisFormatBenchmark {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we integrate it as part of CI check, similar to #1323?

This doesn't have to go into this PR, we can do this in a separate PR.

@Friendseeker
Copy link
Member

Friendseeker commented Jan 5, 2024

The 3 large binary files clearly don't belong into this repo but I don't know what's the right place for them. They are used by the integration tests and the benchmarks. In our internal repo we keep such files in S3 instead of checking them into git.

Maybe we can store them via Git LFS?

I have never used Git LFS, so ideally people with prior experience with it can judge if this is indeed a good idea.


private[this] def writeAPIs(out: Serializer, apis: APIs, storeApis: Boolean): Unit = {
def write(n: String, m: Map[String, AnalyzedClass]): Unit =
writeMaybeSortedStringMap(out, n, m.mapValues(_.withCompilationTimestamp(-1L))) { ac =>
Copy link
Member

@Friendseeker Friendseeker Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for setting timestamp to -1L?

I am not fully sure, but I think sbt testQuick uses analysis timestamp. In the future, new macro invalidation logics may also require timestamp information.

If this is for reproducible output, we can make it opt-in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any metadata about a compilation run inevitably results in inconsistent outputs. We can only store data that is repeatable in a fresh build. If sbt needs this data, we could include it optionally (similar to the unsorted output that I added at the last minute).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use 2010-01-01, like elsewhere in sbt - sbt/sbt#6237, if not a few seconds after the midnight of 2010-01-01 (to break tie break with source JARs).
There's a weird combination ZIP + JVM bug where any timestamp before 1980 requires extended timestamp, and it sometimes end up capturing the timezone of the host machine, breaking hermeticity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to something like this? I found this in the JarUtils class that is part of our Bazel tooling. I'm not sure if it was written from scratch or adapted from code in sbt or some other place originally:

  private lazy val fixedTime = new SimpleDateFormat("dd-MM-yyyy").parse("01-01-1980").getTime

This looks good at first glance, but SDF defaults to the current timezone. I changed it to

  final val FixedTime = 315532800000L // 1980-01-01Z00:00:00

if (!file.getParentFile.exists()) file.getParentFile.mkdirs()
val fout = new FileOutputStream(tmpAnalysisFile)
try {
val gout = new ParallelGzipOutputStream(fout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be curious to benchmark how much of the write speedup is coming from parallelism. Would that basically require more CPU power, or assume there's excess CPU cores that aren't pinned by other tasks? In Bazel, compilation is out-sourced a worker process (and sometimes remote worker machines), but this could behave differently for sbt. Not that sbt can always occupy all cores, so we might still see speedup but it might be good to benchmark this characteristic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gzip compression is very slow relative to everything else, despite gzip with native zlib being fast and using the best performance/size tradeoff in he compression settings as determined by benchmarks. Or maybe I should say: The new serializer and deserializer are very fast compared to gzip.

I reran a quick benchmark. Writing without compression takes 153ms, writing with Java's standard GZIPOutputStream is 402ms, writing with ParallelGzipOutputStream is 157ms. The latter is on a 12-core M2 Max (but with only 230% CPU usage; that's enough to offload all the compression to background threads, making it almost free in terms of wall clock time).

With the new flag that skips sorting, writing is now faster than reading. This is because we've reached the point where even gzip decompression during reading could benefit from parallelization (but the gains would be much smaller than for writing).

Comment on lines +18 to +23
implicit def sortedMapFactoryToCBF[CC[A, B] <: SortedMap[A, B] with SortedMapLike[
A,
B,
CC[A, B]
], K: Ordering, V](f: SortedMapFactory[CC]): Factory[(K, V), CC[K, V]] =
new f.SortedMapCanBuildFrom
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto-formatter is awful. All of this was so much more readable before the auto-formatter ruined it.

A new implementation of Zinc's incremental state serialization.

- Full structural serialization (like the existing protobuf format), no shortcuts with sbinary or Java serialization (like the existing text format).
- A single implementation that supports an efficient binary format for production use and a text format for development and debugging.
- Consistent output files: If two compiler runs result in the same internal representation of incremental state (after applying WriteMappers), they produce identical zinc files. This is important for build tools like Bazel where skipping a build entirely when the outputs are identical is much cheaper than having to run Zinc to perform the incremental state analysis.
- Smaller output files than the existing binary format.
- Faster serialization and deserialization than the existing binary format.
- Smaller implementation than either of the existing formats.
- Optional unsorted output that trades consistency and small file sizes for much faster writing.

Benchmark data based on scala-library + reflect + compiler:

|                             | Write time | Read time | File size |
|-----------------------------|------------|-----------|-----------|
| sbt Text                    |    1002 ms |    791 ms |   7102 kB |
| sbt Binary                  |     654 ms |    277 ms |   6182 kB |
| ConsistentBinary            |     157 ms |    100 ms |   3097 kB |
| ConsistentBinary (unsorted) |      79 ms |           |   3796 kB |

This PR makes the new format available via the new ConsistentFileAnalysisStore. It does not replace the existing formats (but it should; it's a better choice for almost every use case).

We have been using iterations of this format internally over the last few months for the Bazel build (with our own Zinc-based tooling) of our two main monorepos totaling about 27000 Scala (+ Java/mixed) targets ranging in size from a few LOC to almost 1 million LOC.
@szeiger szeiger force-pushed the wip/consistent-analysis-format branch 2 times, most recently from 8305c68 to 9865282 Compare January 10, 2024 14:57
@szeiger
Copy link
Contributor Author

szeiger commented Jan 10, 2024

I don't understand why the benchmarks are failing in CI. I added the new class to the existing benchmark project so it should be found. I was not able to reproduce the problem locally. The benchmark commands from CI run just fine on my machine.

@szeiger
Copy link
Contributor Author

szeiger commented Jan 10, 2024

Oh, it's the "benchmark against develop branch" step that fails. This seems to be the usual undercompilation bug of sbt-jmh. It usually needs an explicit clean when you remove a benchmark. Switching to the develop branch after building from this PR would cause it. This problem should go away after merging, but explicitly cleaning the jmh project would be a better fix.

@szeiger szeiger force-pushed the wip/consistent-analysis-format branch from 9865282 to 05a13c4 Compare January 10, 2024 15:46
@szeiger szeiger force-pushed the wip/consistent-analysis-format branch from 05a13c4 to 1d296b2 Compare January 10, 2024 16:13
@He-Pin
Copy link

He-Pin commented Jan 18, 2024

How about :
https://github.com/HebiRobotics/QuickBuffers
and
https://github.com/google/flatbuffers

For the formats? seems fast too.

@lihaoyi-databricks
Copy link

bump @eed3si9n, are there any blockers from merging this? We've been using this internally with good success, and expect it would provide a lot of value to the broader community to have this upstreamed for SBT and Mill and other build tools to use

@eed3si9n
Copy link
Member

bump @eed3si9n, are there any blockers from merging this? We've been using this internally with good success, and expect it would provide a lot of value to the broader community to have this upstreamed for SBT and Mill and other build tools to use

Overall I am onboard with more stable / hermetic analysis serializer. My main line of concern was where the speedup gain was coming from (ExecutionContext.global) and if it would end up competing for CPU attention in non-Bazel usage. Now that ExecutionContext is a parameter, hopefully we can tweak how much thread we assign to this vs other tasks.

Copy link
Member

@eed3si9n eed3si9n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @szeiger!

@eed3si9n eed3si9n merged commit dcddc1f into sbt:develop Mar 18, 2024
8 checks passed
@lrytz
Copy link
Contributor

lrytz commented Apr 5, 2024

@eed3si9n would you have time to do a 1.10.0-M4 release that includes this PR?

@eed3si9n
Copy link
Member

eed3si9n commented Apr 5, 2024

Yea. I'll try to get something out this weekend.

@eed3si9n
Copy link
Member

eed3si9n commented Apr 8, 2024

https://github.com/sbt/zinc/releases/tag/v1.10.0-RC1 is on its way to Maven Central.

eed3si9n added a commit to eed3si9n/sbt that referenced this pull request Apr 15, 2024
See also sbt/zinc#1326

This adds a new setting `enableConsistentCompileAnalysis`,
which enables the new "Consistent" Analysis format,
which is faster and more repeatable than the status quo.
This is initialized to `true` by default.
It can be opted out either by the setting or using
`-Dsbt.analysis2024=false`.
eed3si9n added a commit to eed3si9n/sbt that referenced this pull request Apr 15, 2024
See also sbt/zinc#1326

This adds a new setting `enableConsistentCompileAnalysis`,
which enables the new "Consistent" Analysis format,
which is faster and more repeatable than the status quo.
This is initialized to `true` by default.
It can be opted out either by the setting or using
`-Dsbt.analysis2024=false`.
jjudd pushed a commit to lucidsoftware/rules_scala that referenced this pull request Jul 2, 2024
jjudd pushed a commit to lucidsoftware/rules_scala that referenced this pull request Jul 2, 2024
jjudd pushed a commit to lucidsoftware/rules_scala that referenced this pull request Jul 2, 2024
jjudd pushed a commit to lucidsoftware/rules_scala that referenced this pull request Jul 3, 2024
jjudd pushed a commit to lucidsoftware/rules_scala that referenced this pull request Jul 3, 2024
jjudd pushed a commit to lucidsoftware/rules_scala that referenced this pull request Jul 16, 2024
This is new analysis store format added to Zinc by Databricks that is
deterministic. Given two identical Zinc states (after applying the read
write wappers) the outputs should be identical.

As an added bonus it is faster and smaller than the previous format. See
this PR for more info: sbt/zinc#1326

This means we can stop most of the work we're doing to make the Zinc
analysis output more deterministic and just rely on this new analysis
format.
jjudd pushed a commit to lucidsoftware/rules_scala that referenced this pull request Jul 16, 2024
This is new analysis store format added to Zinc by Databricks that is
deterministic. Given two identical Zinc states (after applying the read
write wappers) the outputs should be identical.

As an added bonus it is faster and smaller than the previous format. See
this PR for more info: sbt/zinc#1326

This means we can stop most of the work we're doing to make the Zinc
analysis output more deterministic and just rely on this new analysis
format.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants