Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Classpath Implementation #416

Open
1 task done
retronym opened this issue Aug 17, 2017 · 10 comments
Open
1 task done

Improve Classpath Implementation #416

retronym opened this issue Aug 17, 2017 · 10 comments
Assignees
Milestone

Comments

@retronym
Copy link
Member

retronym commented Aug 17, 2017

Background

The Scala compiler uses its Classpath abstraction to present the compilation classpath and sourcepath to the compiler. Entries in these paths are typically directories, JARs, or, more recently, the jrt:// virtual filesystem exposed by Java 9.

Scala 2.12 switched to a refactored implementation of the classpath (known as the "flat classpath"). This work was primary motivated by reducing the memory footprint of the classpath representation. Improved compilation speed by saving JAR listings is a secondary benefit.

An example of an environment where this matters is in a modularized build loaded into Scala IDE. There may be hundreds of instances of Global, each with classpaths numbering in hundreds of elements, that point at some smaller set of JARs.

Two main changes were made:

  1. Nested packages no longer give rise to nested objects in the representation, rather each element has a single element that can lookup a FQN (fully qualified name).
  2. The representation of the classpath is by default shared across multiple instances of the compiler (either a subsequent compilation or concurrent compilation in the same JVM).

Like any new implementation, there have been some teething problems.

  • An API point used by SBT to backtrack from a Symbol to the corresponding classpath entry became O(N) with poor constant factors. This was fixed in scala/scala#5956.
  • scala/bug#10295 The static caching of classpath represention does not have any invalidation based on, e.g. last modified timestamps of the JARs, nor does it ever close files once opened. In practice, this affects people using the exportJars =: true mode of SBT. A workaround is to use -YdisableFlatCpCaching

Furthermore, the cost of navigating through the classpath to a .class and converting it to a Symbol is a non-neglible part of some compilation use cases (in particular, very small compile runs), so finding ways to make this faster or avoid parts of it via reliable caching is appealing.

Goals

New Globals don't see stale JAR listings

Improve the current caching mechanism so that If a JAR is overwritten, a newly constructed Global should use the listing from the new JAR, rather than the stale cached info.

If aggregates are cached, these need to be invalidated when a component is invalidated.

Thread Safety of shared instances

Code review and adversarial testing should be used to flush out any thread safety problems in Classpath as it is shared by instance of Global.

JARs are not locked when there are no active compilations

There are two approaches here:

  • Make default the approach currently hidden behind -D scala.classpath.closeZip that open and closes zips around the extraction of each class file. While this hurts performance in a microbenchmark, it might be acceptable in the context of the compiler.
  • Extend and hook into the lifecycle of Global and/or Run (perhaps involving SBT) to implement a reference counting for cached classpath elements that correspond to open ZipFile-s , and close them when the ref-count drops to 0. I've prototyped this in retronym:ticket/10295

Performant Implementation

Create benchmarks and tests to measure:

  • Footprint of classpath representation should be minimized. Measure effect of sharing enabled/disabled.
  • Lookup performance wrt length of classpath should be better than O(n), aggregates should index package => entry
  • Avoid gratuitous allocations on lookups (e.g. by caching entries)

New Run-s respect overwritten JAR / .class files

While we're in the neighborhood of cache invalidation, we should set a stretch target of making multi-run compilation more sure of foot when treading over the shifting sands of overwritten JARs and .class files.

The naive approach would be to wipe out the symbol table on any change to the underlying files that had been witnessed during its construction.

A more nuanced approach would be able to prune only symbols that had changed, or had depended (perhaps transitively) on something that has changed.

As a concrete example, in a multi-module SBT:

  • scalac -cp static-lib.jar -d a/target/classes a/src/...
  • scalac -cp static-lib.jar:a/target/classes -d b/target/classes b/src/...
  • scalac -cp static-lib.jar -d a/target/classes a/src/... Overwrites some classfiles
  • scalac -cp static-lib.jar:a/target/classes -d b/target/classes b/src/... We could reuse Symbol-s representing static-lib.jar, rt.jar, and reset _root_.a back to its initial state. This should be possible because no signature in the those JARs refers to _root_.a.*

This has been tried in the past without success, so we must assume it isn't trivial to get right. As such, this is a stretch target.

Non Goals

Mutation of JAR or .class files during compilation currently has undefined behaviour, we don't plan to improve that.

The highest possible performance of the classpath in isolation. Performance tuning of the classpath should be motivated by scenarios in which we measure that it imposes an unreasonable overhead to a real world compilation use case.

@retronym retronym self-assigned this Aug 17, 2017
@mkeskells
Copy link

I think there is a simpler but related stretch target. It seems from the example that you are proposing to share between globals, but if we limit it to a new run with the same settings, but a changed file-system I think that would be easier and closer to the IDE flow

If we are to support some reuse of the symbols, it would be easier I think to allow reuse when the classpath doesn't change, but either the source and/or the jar content has changed. This is the typical case for a multi-module build running in parallel.

For instance if we detect that the difference between a previous compile cycle and a new one has only changed sources and jars a set of packages, can we invalidate those package symbols and all children, and thing that depend on then

I think that this would work well for an IDE, as typically a downstream project has changed some jar/classes and/or the user has edited some source

xsbt-dependency provides some information about the dependencies for the sources

the classPath could provide info on what changed between the prior run and current snap reasonably easyly

@mkeskells
Copy link

Jars that are no involved in an active compilation should not need sbt support should it?

If the Run is compiling (or being constructed) its active otherwise it isn't. That is what I tried in scala/scala#5957

@mkeskells
Copy link

-D scala.classpath.closeZip is a measurable overhead in real world benchmarks

@mkeskells
Copy link

I think it should be a design constraint (or at least a strong target) to not require SBT to implement changes. I think that the compiler should be standalone, as there are many build systems.

Certainly we should allow APIs to tools builders to make the environment better, but should not rely on these API for the functional behaviour or the compiler

@retronym
Copy link
Member Author

I've added support for benchmarking the resident compiler to get a feel for how much we can speedup with symbol table re-use.

@fommil
Copy link

fommil commented Sep 16, 2017

scala.classpath.closeZip is not a significant overhead in your full project, @mkeskells, I measured it. It is however an overhead (although still insignificant imho) on smaller projects. I use it in my builds because the time not lost restarting sbt works out as a net benefit throughout the day.

@mkeskells
Copy link

@fommil with small changes to the way classpath works we can remove this overhead completely, and speedup general classpath access, while not locking jar files (and not having to restart sbt). The 'measured overhead' is increasing, and will increase further due to changes since you left. Maybe it just what is regarded as 'significant' though. Either way it seems like a free wil

@fommil
Copy link

fommil commented Sep 17, 2017

Anything that avoids locking jar files is awesome in my book. I don't care about Windows anymore but I do care about the memory leak in ensime... especially so now that we have a compiler process per project/configuration

👍

@retronym
Copy link
Member Author

retronym commented Jun 4, 2018

Here's a straw-man proposal for a hook to support reading classfiles from a cache or a yet-to-be-written classfile from another Global instance:

trait ClasspathPlugin {
  def replaceClasspath(cp: Classpath): Classpath
}

A new variant of compiler plugin could implement this interface. The plugin could then replace one or more elements of the classpath, and override the likes of:

def classes(inPackage: String): Seq[ClassFileEntry]

We need to do a bit of a cleanup of the Classpath API first:

  • Pass in a builder to accumulate results rather than having each classpath element returning Vector-s and concatenating
  • Making parts of the API public

I would have liked to replace AbstractFile with java.nio.Path, but there are some use cases that are hard to support VirtualDirectory / VirtualFile without implementing an in-memory NIO filesystem or bringing a third party dependency that does as much. I'm on the fence as to whether we have the time budget to do this.

At that point, the custom classpath could then return a VirtualFile with a minimal classfile that wraps the ScalaSignature annotation and the Scala/ScalaRaw attributes.

Going a bit further, we could avoid the need for the dummy classfile with:

This

trait ClasspathPlugin {
  def info(file: AbstractFile): Option[ClassfileInfo]
}
trait ClassfileInfo {
  def pickle: Option[java.nio.ByteBuffer]
  def isScalaRaw: Boolean
  def isScala: Boolean
}

@fommil
Copy link

fommil commented Jun 4, 2018

(fyi, class monkey does this already)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants