Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler Performance Tracking Issue #48547

Open
michaelwoerister opened this issue Feb 26, 2018 · 6 comments

Comments

Projects
None yet
6 participants
@michaelwoerister
Copy link
Contributor

commented Feb 26, 2018

This issue is the landing page for all things compilation speed related. We define the most important usage scenarios, as seen by the @rust-lang/compiler team and the @rust-lang/wg-compiler-performance working group. Benchmarks based on these scenarios are then used as a metric for how well the compiler is doing, compile-time-wise. We then establish a model of the compilation process and use it to estimate the effect of optimization categories on the various usage scenarios. Finally, we list concrete optimization ideas and initiatives, relate them to categories and usage scenarios, and link to their various specific GH issues.

Usage Scenarios

We track compiler performance in a number of use cases in order to gauge how compile times in the most important and common scenarios develop. We continuously measure how the compiler performs on perf.rust-lang.org. #48750 defines which concrete benchmarks go into measuring each usage scenario. In the following, "project" means a medium to large codebase consisting of dozens of crates, most of which come from crates.io.

  • FROM-SCRATCH - Compiling a project from scratch (P-high)
    This scenario occurs when a project is compiled for the first time, during CI builds, and when compiling after running cargo clean, make clean or something similar. The targeted runtime performance for builds like this is usually "fast enough", that is, the compiled program should execute with performance that does not hinder testing but it is not expected to have absolute peak performance as you would expect from a release build.

  • SMALL-CHANGE - Re-Compiling a project after a small change (P-high)
    This scenario is the most common during the regular edit-compile-debug cycle. Low compile times are the top-most priority in this scenario. The compiled runtime performance, again, should be good enough not to be an obstacle.

  • RLS - Continuously re-compiling a project for the Rust Language Server (P-high)
    This scenario covers how the Rust compiler is used by the RLS. Again, low compile times are the top-most priority here. The only difference to the previous scenario is that no executable program needs to be generated at all. The output here is diagnostics and the RLS specific save-analysis data.

  • DIST - Compiling a project for maximum runtime performance (P-medium)
    This scenario covers the case of generating release artifacts meant for being distributed and for reflecting runtime performance as perceived by end users. Here, compile times are of secondary importance -- they should be low if possible (especially for running benchmarks) but if there is a trade-off between compile time and runtime performance then runtime performance wins.

Sometimes we also see people "measuring" Rust's compile times by compiling a "Hello World" program and comparing how long that takes in other languages. While we do have such a benchmark in our suite, we don't consider it one of the important usage scenarios.

A Model of the Compilation Process

The compiler does lots of different things while it is compiling a program. Making any one of those things faster might improve the compile time in one of the scenarios above or it might not. This section will establish a few categories of tasks that the compiler executes and then will map each category to the scenarios that it impacts.

Compilation Phases / Task Categories

The compiler works in four large phases these days:

  1. Pre-Query Analysis (pre-query) -- This roughly includes parsing, macro expansion, name resolution, and lowering to HIR. The tasks in this phase still follow the bulk processing paradigm and the results it produces cannot be cached for incremental compilation. This phase is executed on a single thread.

  2. Query-based Analysis (query) -- This includes type checking and inference, lowering to MIR, borrow checking, MIR optimization, and translation to LLVM IR. The various sub-tasks in this phase are implemented as so-called queries, which are computations that form a DAG and the results of which can be tracked and cached for incremental compilation. Queries are also only executed if their result is requested, so in theory this system would allow for partially compiling code. This phase too is executed on a single thread.

  3. LLVM-based optimization and code generation (llvm) -- Once the compiler has generated the LLVM IR representation of a given crate, it lets LLVM optimize and then translate it to a number of object files. For optimized builds this usually takes up 65-80% of the overall compilation time. This phase can be run on multiple threads in parallel, depending on compiler settings. Incremental compilation allows to skip LLVM work but is less effective than for queries because caching is more coarse-grained.

  4. Linking (link) -- Finally, after LLVM has translated the program into object files, the output is linked into the final binary. This is done by an external linker which rustc takes care of invoking.

Note that this describes the compilation process for a single crate. However, in the scenarios given above, we always deal with a whole graph of crates. Cargo will coordinate the build process for a graph of crates, only compiling crates the code of which has changed. For the overall compile time of a whole project, it is important to note that Cargo will compile multiple crates in parallel, but can only start compiling a crate once all its dependencies have been compiled. The crates in a project form a directed acyclic graph.

Using Task Categories To Estimate Optimization Impact On Scenarios

From the description above we can infer which optimizations will affect which scenarios:

  • Making a (pre-query) task faster will affect all scenarios as these are unconditionally executed in every compilation session.
  • Making a (query) task faster will also affect all scenarios as these are also executed in every session.
  • Making (llvm) execute faster will have most effect on FROM-SCRATCH and DIST, and some effect on SMALL-CHANGE. It will have no effect on RLS.
  • Making (link) faster helps with FROM-SCRATCH, DIST, and SMALL-CHANGE, since we always have to link the whole program in these cases. In the SMALL-CHANGE scenario, linking will be a bigger portion of overall compile time than in the other two. For RLS we don't do any linking.
  • Turning a (pre-query) task into a (query) task will improve SMALL-CHANGE and RLS because we profit from incremental compilation in these scenarios. If done properly, it should not make a difference for for the other scenarios.
  • Reducing the amount of work we generate for (llvm) will have the same effects as making (llvm) execute more quickly.
  • Reducing the time between the start of compiling a crate and the point where dependents of that crate can start compiling can bring superlinear compile-time speedups because it reduces contention in Cargo's parallel compilation flow.
  • Reducing the overhead for incremental compilation helps with SMALL-CHANGE and RLS and possibly FROM-SCRATCH.
  • Improving incr. comp. caching efficiency for LLVM artifacts helps with SMALL-CHANGE and possibly FROM-SCRATCH, but not DIST, which does not use incremental compilation, and RLS, which does not produce LLVM artifacts.
  • Improving generation of save-analysis data will help the RLS case, while this kind of data is not produced during any of the other scenarios.

Concrete Performance Improvement Initiatives

There are various concrete initiatives of various sizes that strive to improve the compiler's performance. Some of them are far along, some of them are just ideas that need to be validated before pursuing them further.

Incremental compilation

Work on supporting incremental re-compilation of programs has been ongoing for quite a while now and it is available on stable Rust. However, there are still many opportunities for improving it.

  • Status: "version 1.0" available on stable.
  • Current progress is tracked in #47660.
  • Affected usage scenarios: SMALL-CHANGE, RLS

Query parallelization

Currently to compiler can evaluate queries (which comprise a large part of the non-LLVM compilation phases) in a single-threaded fashion. However, since queries have a clear evaluation model which structures computations into a directed acyclic graph, it seems feasible to implement parallel evaluation for queries at the framework level. @Zoxc even has done a proof-of-concept implementation. This would potentially help with usage scenarios since all of them have to execute the query-part of compilation.

  • Status: Preliminary work in progress, experimental
  • Current progress is tracked in #48685
  • Affect usage scenarios: FROM-SCRATCH, SMALL-CHANGE, RLS, DIST

MIR-only RLIBs

"MIR-only RLIBs" is what we call the idea of not generating any LLVM IR or machine code for RLIBs. Instead, all of this would be deferred to when executables, dynamic libraries, or static C libraries are generated. This potentially reduces the overall workload for compiling a whole crate graph and has some non-performance related benefits too. However, it might be detrimental in some other usage scenarios, especially if incremental compilation is not enabled.

  • Status: Blocked on query parallelization
  • Current progress is tracked in #38913
  • Affected usage scenarios: FROM-SCRATCH, SMALL-CHANGE, DIST

ThinLTO

ThinLTO is an LLVM mode that allows to perform whole program optimization in a mostly parallel fashion. It is currently supported by the Rust compiler and even enabled by default in some cases. It tries to reduce compile times by distributing the LLVM workload to more CPU cores. At the same time the overall workload increases, so it can also have detrimental effects.

  • Status: available on stable, default for optimized builds
  • Current progress is tracked in #45320
  • Affected usage scenarios: FROM-SCRATCH, DIST

Sharing generic code between crates

Currently, the compiler will duplicate the machine code for generic functions within each crate that uses a specific monomorphization of a given function. If there is a lot of overlap this potentially means lots of redundant work. It should be investigated how much work and compile time could be saved by re-using monomorphizations across crate boundaries.

  • Status: Experimental implementation in #48779
  • Current progress is tracked in #47317
  • Affected usage scenarios: FROM-SCRATCH, SMALL-CHANGE, DIST

Sharing closures among generic instances

We duplicate code for closures within generic functions even if they do not depend on the generic parameters of the enclosing function. This leads to redundant work. We should try to be smarter about it.

  • Status: Unknown
  • Current progress is tracked in #46477
  • Affected usage scenarios: FROM-SCRATCH, SMALL-CHANGE, DIST

Perform inlining at the MIR-level

Performing at least some amount of inlining at the MIR-level would potentially reduce the pressure on LLVM. It would also reduce the overall amount of work to be done because inlining would only have to be done once for all monomorphizations of a function while LLVM has to redo the work for each instance. There is an experimental implementation of MIR-inlining but it is not production-ready yet.

  • Status: Experimental implementation exists on nightly, not stable or optimized yet
  • Current progress is tracked in #43542 (kind of)
  • Affected usage scenarios: FROM-SCRATCH, SMALL-CHANGE, DIST

Provide tools and instructions for profiling the compiler

Profiling is an indispensable part of performance optimization. We should make it as easy as possible to profile the compiler and get an idea what it is spending its time on. That includes guides on how to use external profiling tools and improving the compiler internal profiling facilities.

  • Status: Idea
  • Current progress is tracked in (nowhere yet)
  • Affect usage scenarios: FROM-SCRATCH, SMALL-CHANGE, RLS, DIST

Build released compiler binaries as optimized as possible

There is still headroom for turning on more optimizations for building the rustc release artifacts. Right now this is blocked by a mix of CI restrictions and possibly outdated restrictions for when build Rust dylibs.

  • Status: In progress
  • Current progress is tracked in #49180
  • Affect usage scenarios: FROM-SCRATCH, SMALL-CHANGE, RLS, DIST

Feel free to leave comments below if there's anything you think is pertinent to this topic!

@nrc

This comment has been minimized.

Copy link
Member

commented Mar 4, 2018

Expanding on the RLS scenario: this is kind of an extra dimension rather than a separate scenario because we care about initial build time (which is a 'from scratch' build) and about the incremental case (which is the 'small change' build). We do care more about the 'small change' case since that is fundamental to the RLS, but the startup time is really important too.

So, the differences between RLS and regular build:

  • we don't need to generate output code (as noted)
  • we emit save-analysis data for dependent crates (both generating and serialising this can be significant amount of time, for example, I think we could make RLS builds significantly faster by moving from rustc_serialise to serde)
  • we supply some files from memory rather than disk (and potentially could supply more more data from memory if that improved compile times)
  • if compile tasks are long-running or we have to build several crates (e.g., editing files in different crates in a workspace project), we will be running multiple compile processes at once. And people are really sensitive to the RLS using a lot of CPU, therefore I'm a bit cynical about whether any parallelisation will be beneficial for the 'small change' RLS case (it will probably help a lot with startup though)

And some areas which I think would be desirable for the RLS:

  • incremental type checking, name resolution, macro expansion, parsing (we really need to push incremental compilation all the way to make it useful for the RLS)
  • skip later phases altogether, or return data before executing them (e.g., we only need name resolution info for some code completion, + trait resolution for other code completion, we don't need to borrow check in order to generate save-analysis data. However we do want to get diagnostics info from all phases at some point - but user tolerance for latency to errors is much greater than latency to code completion).
  • having the RLS manage the incremental compilation data and keep it in memory and pass to the compiler (and have the RLS keep all files in memory, rather than just the changed ones, though this is more of an RLS change)
  • improve save-analysis speed, probably using Serde is easiest, perhaps we can avoid serialising dep crates? Incremental save-analysis would be a huge win (I don't really know how this would work or how we'd get there, but it feels like it should be possible, and an order of magnitude easier than incremental compilation :-) )
  • 'very lazy' compilation - assume we're changing a function body and the rest of the program is unchanged, and we have the type info saved from a previous compilation. Then we'd like to compile just enough to get code completion info for the expression being edited - we don't even care about whether the rest of the function type checks, etc.
  • @matklad had some ideas about making parsing and libsyntax stuff much faster/useful for IDEs
  • not so much a pure performance issue, but one of the goals here is to support compiler-powered code completion. Once we think compilation can be fast enough, then we need to implement that and ensure that it can be done quickly - that will probably be mostly a performance challenge
@matklad

This comment has been minimized.

Copy link
Member

commented Mar 4, 2018

Excellent summary, @nrc! I don't have anything else to add really, but I would like to give some input about relative priorities of features.

Incremental save-analysis would be a huge win (I don't really know how this would work or how we'd get there, but it feels like it should be possible, and an order of magnitude easier than incremental compilation :-) )

I think this should be prioritized, as it's very important for the overall RLS architecture. My understanding is that save analysis at the moment is the bottleneck, which cancels "on-demand" features of rustc.

skip later phases altogether, or return data before executing them (e.g., we only need name resolution info for some code completion, + trait resolution for other code completion, we don't need to borrow check in order to generate save-analysis data. However we do want to get diagnostics info from all phases at some point - but user tolerance for latency to errors is much greater than latency to code completion).

This suggests to cut the work along compiler phases. A more important thing to do is to cut along the orthogonal dimension, and run all compiler phases, but only for some code. This what "'very lazy' compilation" bullet is about I think. And end goal here is to show all compilation errors for a single (currently opened in the editor) file, and do not comute most of the errors for other fiels!

A useful analogy here is a parallel pipeline vs a data parallel task: the first scales up to the number of pipeline steps, which is fixed, the second is emnarassigly parallel.

Ultimatelly, queries and on-demand compilation should allow to cut work along both dimentions, if queries are exposed to RLS in some form (i.e, if they don't go through save-analysis bottleneck).

because we care about initial build time (which is a 'from scratch' build)

It seems to me that initial build for RLS could do much less work than the usual build: it should be enough to collect the items across the dependency graph, without checking their bodies.

@nrc

This comment has been minimized.

Copy link
Member

commented Mar 5, 2018

It seems to me that initial build for RLS could do much less work than the usual build: it should be enough to collect the items across the dependency graph, without checking their bodies.

This is a great point! Though not quite so simple - we do need data about function internals for doing find all refs, etc. inside a function, but we could do that on demand if it were quick enough (which I think if we could just do a single function body at a time, it would be, but that is predicated on very incremental compilation).

However, for dependent crates, we never need the function body information. We could make cargo check (and initial RLS indexing) really fast if we had an option to only compile 'header info' for metadata-only builds (maybe that is what @matklad meant). I don't think there is anything blocking this, so somebody could implement it right now.

@michaelwoerister

This comment has been minimized.

Copy link
Contributor Author

commented Mar 5, 2018

Thanks for the detailed feedback, @nrc & @matklad! I'll update the description of the RLS scenario. There's definitely more to it than to cargo check.

I'll open an sub-issue soon that will define the concrete benchmarks for measuring success in each of the scenarios. I suggest we continue discussion about how to measure the RLS case there.

For concrete ideas on how to improve performance, I suggest that we open a separate tracking issue for RLS specific performance questions. I'd add this to the "Concrete Performance Improvement Initiatives" section then.

@Vurich

This comment has been minimized.

Copy link

commented Mar 13, 2018

For "Sharing closures among generic instances" I think it'd be good to generate just one struct per polymorphic function but give the closure type a type parameter iff it needs it. This would mean that closures would benefit from any work done on polymorphic functions.

So this:

fn do_a_thing<T>(foo: T) -> T {
    let my_fn = || foo;

    my_fn()
}

would desugar to this:

fn do_a_thing<T>(foo: T) -> T {
    struct Closure<U>(U);

    impl<U> FnOnce<()> for Closure<U> {
        type Output = U;

        fn call_once(self, _: ()) -> U {
            self.0
        }
    }

    let my_fn = Closure::<T>(foo);

    my_fn()
}

i.e. closures would get desugared before monomorphisation. I kinda just assumed this was how it worked anyway before seeing this issue. I don't know whether all that would be needed would be removing redundant type parameters from the generated closure struct, but a remove-redundant-type-parameter pass would lead to even bigger wins than just my change proposed here.

@eddyb

This comment has been minimized.

Copy link
Member

commented Mar 15, 2018

@Vurich That's how it works already, more or less. And yes, closures and polymorphic functions would benefit the same from "unused type parameters" analyses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.