Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelize LLVM optimization and codegen passes #16367

Merged
merged 14 commits into from Sep 6, 2014

Conversation

Projects
None yet
@epdtry
Copy link
Contributor

epdtry commented Aug 8, 2014

This branch adds support for running LLVM optimization and codegen on different parts of a crate in parallel. Instead of translating the crate into a single LLVM compilation unit, rustc now distributes items in the crate among several compilation units, and spawns worker threads to optimize and codegen each compilation unit independently. This improves compile times on multicore machines, at the cost of worse performance in the compiled code. The intent is to speed up build times during development without sacrificing too much optimization.

On the machine I tested this on, librustc build time with -O went from 265 seconds (master branch, single-threaded) to 115s (this branch, with 4 threads), a speedup of 2.3x. For comparison, the build time without -O was 90s (single-threaded). Bootstrapping rustc using 4 threads gets a 1.6x speedup over the default settings (870s vs. 1380s), and building librustc with the resulting stage2 compiler takes 1.3x as long as the master branch (44s vs. 55s, single threaded, ignoring time spent in LLVM codegen).

The user-visible changes from this branch are two new codegen flags:

  • -C codegen-units=N: Distribute items across N compilation units.
  • -C codegen-threads=N: Spawn N worker threads for running optimization and codegen. (It is possible to set codegen-threads larger than codegen-units, but this is not very useful.)

Internal changes to the compiler are described in detail on the individual commit messages.

Note: The first commit on this branch is copied from #16359, which this branch depends on.

r? @nick29581

@metajack

This comment has been minimized.

Copy link
Contributor

metajack commented Aug 8, 2014

Have you clocked the resulting code performance difference? This seems like it would be great for servo.

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Aug 8, 2014

Very excited about this.

I don't see the words 'codegen-threads' in this patch. Are you sure it exists? What happens when you specify --codegen-units but not --codegen-threads?

@liigo

This comment has been minimized.

Copy link
Contributor

liigo commented Aug 9, 2014

Replace the two with --codegen-tasks ?

2014年8月9日 上午7:45于 "Brian Anderson" notifications@github.com写道:

Very excited about this.

I don't see the words 'codegen-threads' in this patch. Are you sure it
exists? What happens when you specify --codegen-units but not
--codegen-threads?

// For LTO purposes, the bytecode of this library is also
// inserted into the archive. We currently do this only when
// codegen_units == 1, so we don't have to deal with multiple
// bitcode files per crate.

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

Could each module be linked into one module to be emitted? (is that too timing-intensive?)

Otherwise, could we name the files bytecodeN.deflate and list how many bytecodes are in the metadata?

This comment has been minimized.

@epdtry

epdtry Aug 11, 2014

Author Contributor

Could each module be linked into one module to be emitted? (is that too timing-intensive?)

I haven't tried this yet. I don't think it would take much longer than we already spend linking the object files together. The only problem is, all the LLVM modules are in separate contexts, so we would need to serialize each one and then deserialize into a shared context for linking.

Otherwise, could we name the files bytecodeN.deflate and list how many bytecodes are in the metadata?

Yeah, this is my preferred solution (which I also haven't tried implementing yet).

cmd.stdin(::std::io::process::Ignored)
.stdout(::std::io::process::InheritFd(1))
.stderr(::std::io::process::InheritFd(2));
cmd.status().unwrap();

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

In the past I have seen ld -r produce some very odd object files that turn out to not work at all. I initially intended for ld -r to be used to link together the rust object with all native static libraries to produce one object to put inside of an rlib (I even went as far as to hope that the object could be the rlib itself). I ended up abandoning that strategy once I saw that ld -r was flaky across platforms.

Sadly I don't quite remember what errors I was seeing, nor why I abandoned the strategy (specific errors encountered). We may want to not perform ld -r in the meantime. On the other hand, have you confirmed this object to be usable? It may be difficult to verify that as you'd probably still need to link to upstream libraries and such, but it may be a good smoke test!

This comment has been minimized.

@epdtry

epdtry Aug 11, 2014

Author Contributor

have you confirmed this object to be usable?

Yes, it works fine on Linux at least. I can bootstrap rustc and get nearly all tests to pass with codegen-units > 1. (The failing tests are either LTO-related or require a single bitcode file per crate.)

I also added a few run-pass tests that use codegen-units > 1. These should fail if ld -r produces bad objects.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Aug 9, 2014

Could you describe some of the difficulties with sharing the Session and across worker threads? Was it mainly that Rc is used liberally inside of it? If so, do you think it would ever be feasible to share the Session in the worker threads?

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Aug 9, 2014

This is also some super amazing work, I'm incredibly excited to see where this goes! Major prosp @epdtry! 🐟

}

match config.opt_level {
Some(opt_level) => {

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

While you're at it, could you 4-space tab this match?

Some(sess) =>
if sess.lto() {
let reachable = cgcx.reachable
.expect("reachable should be Some if sess is Some");

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

Would it be too painful to store Option<(sess, reachable)>?

llvm::LLVMWriteBitcodeToFile(llmod, buf);
match cgcx.sess {
Some(sess) =>
if sess.lto() {

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

You could take this indent down one level with Some(sess) if sess.lto() => {

});
}

time(config.time_passes, "codegen passes", (), |()| {

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

Shouldn't this time include all the if-statements above as well?

trans: &CrateTranslation,
output_types: &[OutputType],
crate_output: &OutputFilenames) {
let mut cmd = Command::new("ld");

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

Could this be declared closer to where it's being used? I'm a little worried about hardcoding "ld" as well, but I have worries about invoking ld -r anyway down below.

});
}

if sess.opts.cg.codegen_threads == 1 {

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

In theory it would be nice to unify these two code paths. Due to how CodegenContext is created though, that may not be possible.



// Populate a queue with a list of codegen tasks.
let mut work_queue = RingBuf::with_capacity(1 + trans.modules.len());

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

Is it actually important that this is a FIFO queue?

}
}

unsafe fn optimize_and_codegen(cgcx: &CodegenContext,

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

This is a pretty big function to have the entire thing be unsafe. Would it be possible to limit the scopes more? If it ends up having an unsafeblock on every other line it's probably not worth it!

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

Hm, looking through, this may not be worth it. Could you add a comment saying why it's unsafe?


// Make sure to fail the worker so the main thread can tell
// that there were errors.
cgcx.handler.abort_if_errors();

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

Should this happen right after work runs in case it generates errors?

This comment has been minimized.

@epdtry

epdtry Aug 11, 2014

Author Contributor

OK, sure. We might show fewer errors per compiler invocation, but that's probably fine, since most of the errors you can get at this point are things like "ran out of disk space" which will show up during other work operations as well.

llvm::LLVMWriteBitcodeToFile(llmod, buf);
});
llvm::LLVMDisposeModule(llmod);
llvm::LLVMContextDispose(llcx);

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

This seems... odd?

This comment has been minimized.

@epdtry

epdtry Aug 11, 2014

Author Contributor

Whoops, forgot to remove that when I changed the LTO bitcode handling.

output.with_extension("lto.bc").with_c_str(|buf| {
llvm::LLVMWriteBitcodeToFile(llmod, buf);
})
pub fn run_passes(sess: &Session,

This comment has been minimized.

@alexcrichton

alexcrichton Aug 9, 2014

Member

This function is becoming quite huge, could it be broken apart? The same kinda applies to the write module at this point. It probably shouldn't be defined in back/link.rs, but rather separate submodules at this point. Feel free to reorganize things!

@asterite

This comment has been minimized.

Copy link

asterite commented Aug 9, 2014

I don't know if it's applicable, but the way we do it in Crystal is to have one llvm module for each "logical unit". In our cases each logical unit is a class or a module. Maybe in Rust a "logical unit" is a struct, an array, etc., together will all its impls. Then you can also have another logical unit to be the top level functions.

Then we fire up N threads and each one takes a task (an llvm module) to compile it. This greatly reduces the compilation time. When you split your whole program in N modules and fire up N threads to compile those (as you are proposing here), if a thread finishes early its left without a job to do, so a thread becomes idle. With M smaller modules and N threads, with N > M, when a thread finishes it can start working on another module, reducing the idle time.

Additionally, before compiling each module we write its bitcode to a .bc file in a hidden directory (.crystal, in our case). We then compare that .bc file to the .bc file generated by the previous run. If they turn out to be the same (and this will be true as long as you don't modify any impl of that logical unit), we can safely reuse the .o file of the previous run. This, again, reduces dramatically the times to recompile a project that had minimal changes.

Bits of the source code implementing this behaviour are here and here, in case you want to take a look.

@vadimcn

This comment has been minimized.

Copy link
Contributor

vadimcn commented Aug 9, 2014

^THIS^ !!! Please, please implement incremental compilation!

Though I wonder if (transitively) unchanged modules could be culled right after resolution pass, based on source file timestamps and inter-module dependency info, which should be available at that point.
But even if they are culled after translation, it would be a major boon in day-to-day development.

@nrc

This comment has been minimized.

Copy link
Member

nrc commented Aug 9, 2014

I would say that the 'logical unit' in Rust is a module, they tend to be fairly small (at least relative to crate size) and are naturally self contained. It is probably worth getting data (at least) for smaller units - thanks for the idea!

Incremental compilation is the next part of the project - looking forward to what comes out of that :-)

@epdtry

This comment has been minimized.

Copy link
Contributor Author

epdtry commented Aug 11, 2014

@metajack:

Have you clocked the resulting code performance difference?

Compiling rustc and all libraries using 4 compilation units produces a rustc that takes about 25% longer to run.

@brson:

I don't see the words 'codegen-threads' in this patch. Are you sure it exists?

It's a codegen flag, so the flag name codegen-threads is generated by a macro from the variable name codegen_threads.

What happens when you specify --codegen-units but not --codegen-threads?

rustc generates several compilation units, then runs optimization and codegen for them all sequentially.

@alexcrichton:

Could you describe some of the difficulties with sharing the Session and across worker threads?

Like most of rustc's major data structures, Session uses RefCell all over the place. I suppose we could share it using a Mutex, if we changed how the ownership is handled and were careful about the lifetimes of the mutex guards.

@asterite:

When you split your whole program in N modules and fire up N threads to compile those (as you are proposing here), if a thread finishes early its left without a job to do, so a thread becomes idle.

One way I tried to address this problem was by adding some basic load balancing: rustc tries to make each LLVM module roughly the same size, so that each worker thread gets the same amount of work to do. I also made codegen-units and codegen-threads separate flags so that you can have several smaller modules per worker thread. (Though in the testing I've done so far, it doesn't seem to help.)

@vadimcn:

Though I wonder if (transitively) unchanged modules could be culled right after resolution pass, based on source file timestamps and inter-module dependency info, which should be available at that point.

This is basically my next project. Translation is the #2 time sink in rustc (after LLVM passes), so culling modules (or finer-grained items) before translation seems like the way to go.

@epdtry

This comment has been minimized.

Copy link
Contributor Author

epdtry commented Aug 11, 2014

@alexcrichton:
I think I've fixed all the things you mentioned, except that I haven't implemented LTO against separately compiled libraries yet.

Regarding ld -r (since Github has unhelpfully collapsed that line comment), I haven't seen any problems yet on Linux or OSX. On both I have bootstrapped rustc and run the test suite normally with no problems. On Linux I have also run the test suite with codegen-units > 1, also with no problems. I haven't tested it on Windows yet.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Aug 12, 2014

Like most of rustc's major data structures, Session uses RefCell all over the place. I suppose we could share it using a Mutex, if we changed how the ownership is handled and were careful about the lifetimes of the mutex guards.

I would definitely expect an Arc<Mutex<Session>> to be passed around (maybe Option<Session> so it could be unwrapped). I'm not entirely sure if this could be done because Rc<T> isn't Send, and I think that the session has a bunch of Rc pointers, but I'm not sure how hard it would be to get rid of those.

It looked like it would make parts of this much nicer to have access to the raw session rather than duplicating some logic here and there, but it may not be too worth it in the end.

Regarding ld -r (since Github has unhelpfully collapsed that line comment), I haven't seen any problems yet on Linux or OSX

I don't think that any of our tests actually use the object file emitted, they just emit it. I also recall that the linker always succeeded in creating an object, but the object itself was just unusable (for one reason or another). Again though, this could all just be misremembering, or some bug which has since been fixed!

@epdtry

This comment has been minimized.

Copy link
Contributor Author

epdtry commented Aug 12, 2014

I don't think that any of our tests actually use the object file emitted, they just emit it.

The run-pass tests link the object into an executable, run the resulting binary, and check that it works. At least one step in that process should fail if ld -r emits a bad object file.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Aug 12, 2014

Oh dear, I must be over looking a test! I only see two instances of emit=.*obj in the codebase, one is the output-type-permutations run-make test (no linking involved there), and the other is the codegen tests (no linking involved either). What was the test that uses the output of ld -r?

@epdtry

This comment has been minimized.

Copy link
Contributor Author

epdtry commented Aug 12, 2014

OK, let me back up. I think the relevant part of the design was unclear.

On the master branch, rustc produces a single object file crate.o. Then it feeds crate.o into the linker to produce an executable or shared object.

On this branch, rustc produces several object files crate.0.o, crate.1.o, etc. It feeds those into ld -r to produce a combined object file crate.o. Then crate.o is used to produce the final executable/library just like before. (That's why this branch does not need any changes to link_dylib and such.)

So, on this branch, any test that involves compiling and running Rust code will end up using ld -r as part of the linking process.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Aug 12, 2014

Oh wow, I missed that entirely, I thought it was only used for OutputTypeObject! Sorry I missed that!

In that case, I'm definitely willing to trust ld -r.

codegen_units: uint = (1, parse_uint,
"divide crate into N units for optimization and codegen"),
codegen_threads: uint = (1, parse_uint,
"number of worker threads to use when running codegen"),

This comment has been minimized.

@nrc

nrc Aug 21, 2014

Member

Given there was no benefit to having different values here, lets just have one option.

@nrc

This comment has been minimized.

Copy link
Member

nrc commented Aug 23, 2014

OK, looks good! r=me with all the changes (most of which are nits, TBH) and with Alex's review. @alexcrichton r? (specifically the stuff in back and concerning linking, about which I have no idea).

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Aug 30, 2014

Well then, that's a new segfault I've never seen before!

@l0kod l0kod referenced this pull request Aug 31, 2014

Closed

Incremental recompilation #2369

epdtry added some commits Sep 5, 2014

split CrateContext into shared and local pieces
Break up `CrateContext` into `SharedCrateContext` and `LocalCrateContext`.  The
local piece corresponds to a single compilation unit, and contains all
LLVM-related components.  (LLVM data structures are tied to a specific
`LLVMContext`, and we will need separate `LLVMContext`s to safely run
multithreaded optimization.)  The shared piece contains data structures that
need to be shared across all compilation units, such as the `ty::ctxt` and some
tables related to crate metadata.
run optimization and codegen on worker threads
Refactor the code in `llvm::back` that invokes LLVM optimization and codegen
passes so that it can be called from worker threads.  (Previously, it used
`&Session` extensively, and `Session` is not `Share`.)  The new code can handle
multiple compilation units, by compiling each unit to `crate.0.o`, `crate.1.o`,
etc., and linking together all the `crate.N.o` files into a single `crate.o`
using `ld -r`.  The later linking steps can then be run unchanged.

The new code preserves the behavior of `--emit`/`-o` when building a single
compilation unit.  With multiple compilation units, the `--emit=asm/ir/bc`
options produce multiple files, so combinations like `--emit=ir -o foo.ll` will
not actually produce `foo.ll` (they instead produce several `foo.N.ll` files).

The new code supports `-Z lto` only when using a single compilation unit.
Compiling with multiple compilation units and `-Z lto` will produce an error.
(I can't think of any good reason to do such a thing.)  Linking with `-Z lto`
against a library that was built as multiple compilation units will also fail,
because the rlib does not contain a `crate.bytecode.deflate` file.  This could
be supported in the future by linking together the `crate.N.bc` files produced
when compiling the library into a single `crate.bc`, or by making the LTO code
support multiple `crate.N.bytecode.deflate` files.
translate into multiple llvm contexts
Rotate between compilation units while translating.  The "worker threads"
commit added support for multiple compilation units, but only translated into
one, leaving the rest empty.  With this commit, `trans` rotates between various
compilation units while translating, using a simple stragtegy: upon entering a
module, switch to translating into whichever compilation unit currently
contains the fewest LLVM instructions.

Most of the actual changes here involve getting symbol linkage right, so that
items translated into different compilation units will link together properly
at the end.
reuse original symbols for inlined items
When inlining an item from another crate, use the original symbol from that
crate's metadata instead of generating a new symbol using the `ast::NodeId` of
the inlined copy.  This requires exporting symbols in the crate metadata in a
few additional cases.  Having predictable symbols for inlined items will be
useful later to avoid generating duplicate object code for inlined items.
avoid duplicate translation of monomorphizations, drop glue, and visi…
…t glue

Use a shared lookup table of previously-translated monomorphizations/glue
functions to avoid translating those functions in every compilation unit where
they're used.  Instead, the function will be translated in whichever
compilation unit uses it first, and the remaining compilation units will link
against that original definition.
make symbols internal when possible
Add a post-processing pass to `trans` that converts symbols from external to
internal when possible.  Translation with multiple compilation units initially
makes most symbols external, since it is not clear when translating a
definition whether that symbol will need to be accessed from another
compilation unit.  This final pass internalizes symbols that are not reachable
from other crates and not referenced from other compilation units, so that LLVM
can perform more aggressive optimizations on those symbols.
make separate compilation respect #[inline] attributes
Adjust the handling of `#[inline]` items so that they get translated into every
compilation unit that uses them.  This is necessary to preserve the semantics
of `#[inline(always)]`.

Crate-local `#[inline]` functions and statics are blindly translated into every
compilation unit.  Cross-crate inlined items and monomorphizations of
`#[inline]` functions are translated the first time a reference is seen in each
compilation unit.  When using multiple compilation units, inlined items are
given `available_externally` linkage whenever possible to avoid duplicating
object code.

@epdtry epdtry force-pushed the epdtry:parallel-codegen branch from 6a60448 to 6d2d47b Sep 5, 2014

@epdtry

This comment has been minimized.

Copy link
Contributor Author

epdtry commented Sep 5, 2014

Older versions of OSX's ld64 linker parse object files using variable-size stack-allocated buffers for some temporary data structures. The bus error seen on 6a60448 occurs because the object file contains too much stuff (mainly, too many unwinding table entries), and those stack allocated buffers overflow the 8MB stack limit of the parser thread. This 8MB stack size is hard-coded inside ld64, so we can't work around the bug by bumping up stack size with ulimit -s.

On master, the librustc build works fine because rustc.o requires about 5MB of stack to parse. This branch triggers a stack overflow because it uses ld -r to generate rustc.o (even with -C codegen-units=1), and ld -r adds a __compact_unwind section to the generated object file. Parsing librustc's __compact_unwind section uses an additional 4MB of stack, which puts the parser thread over its 8MB limit. There is an undocumented flag -no_compact_unwind which is supposed to suppress the generation of the __compact_unwind section, but this flag is ignored when passed in combination with -r.

Newer versions of ld64 fix the stack overflow bug, by having the object file parser use malloc when the required buffer size is large. Unfortunately, according to wikipedia, the fixed ld64 versions (224.1+) are available only with XCode 5+, for OSX 10.8+, while Rust is supposed to support building on OSX 10.7. I'm not sure if there is any way to install newer ld64 on older versions of OSX.

The latest commit on this branch avoids running ld -r when building with only a single compilation unit (which is probably a good idea regardless of the ld64 bug). This will let librustc build without errors (giant object file, but no ld -r doubling its stack use), and the separate compilation tests should also pass (ld -r, but tiny object files). It doesn't fix the underlying problem, though - if anyone using XCode 4 tries to build a large crate with parallel codegen enabled, they will get a nasty segfault from the linker. (Though note that rustc master can already trigger the same error without ld -r, for crates with about twice as many functions as librustc.)

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Sep 6, 2014

@epdtry, oh my, that is quite the investigation! That's quite unfortunate that we'll segfault on older versions of OSX. It looks like there's not a whole lot we can do right now though. I'm sad that this may mean that we have to turn off parallel codegen for rustc itself by default (at least for osx), but we can cross that bridge later!

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Sep 6, 2014

Also, major major props for that investigation, that must have been quite a beast to track down!

@alexcrichton

This comment has been minimized.

Copy link

alexcrichton commented on 6d2d47b Sep 6, 2014

r+

@bors

This comment has been minimized.

Copy link
Contributor

bors commented on 6d2d47b Sep 6, 2014

saw approval from alexcrichton
at epdtry@6d2d47b

This comment has been minimized.

Copy link
Contributor

bors replied Sep 6, 2014

merging epdtry/rust/parallel-codegen = 6d2d47b into auto

This comment has been minimized.

Copy link
Contributor

bors replied Sep 6, 2014

epdtry/rust/parallel-codegen = 6d2d47b merged ok, testing candidate = 4bea7b3

This comment has been minimized.

Copy link
Contributor

bors replied Sep 6, 2014

fast-forwarding master to auto = 4bea7b3

bors added a commit that referenced this pull request Sep 6, 2014

auto merge of #16367 : epdtry/rust/parallel-codegen, r=alexcrichton
This branch adds support for running LLVM optimization and codegen on different parts of a crate in parallel.  Instead of translating the crate into a single LLVM compilation unit, `rustc` now distributes items in the crate among several compilation units, and spawns worker threads to optimize and codegen each compilation unit independently.  This improves compile times on multicore machines, at the cost of worse performance in the compiled code.  The intent is to speed up build times during development without sacrificing too much optimization.

On the machine I tested this on, `librustc` build time with `-O` went from 265 seconds (master branch, single-threaded) to 115s (this branch, with 4 threads), a speedup of 2.3x.  For comparison, the build time without `-O` was 90s (single-threaded).  Bootstrapping `rustc` using 4 threads gets a 1.6x speedup over the default settings (870s vs. 1380s), and building `librustc` with the resulting stage2 compiler takes 1.3x as long as the master branch (44s vs.  55s, single threaded, ignoring time spent in LLVM codegen).

The user-visible changes from this branch are two new codegen flags:

 * `-C codegen-units=N`: Distribute items across `N` compilation units.
 * `-C codegen-threads=N`: Spawn `N` worker threads for running optimization and codegen.  (It is possible to set `codegen-threads` larger than `codegen-units`, but this is not very useful.)

Internal changes to the compiler are described in detail on the individual commit messages.

Note: The first commit on this branch is copied from #16359, which this branch depends on.

r? @nick29581

@bors bors closed this Sep 6, 2014

@bors bors merged commit 6d2d47b into rust-lang:master Sep 6, 2014

1 of 2 checks passed

continuous-integration/travis-ci The Travis CI build failed
Details
default all tests passed
@l0kod

This comment has been minimized.

Copy link
Contributor

l0kod commented Sep 6, 2014

Awesome work! Looks great for an efficient #2369.

Though I wonder if (transitively) unchanged modules could be culled right after resolution pass, based on source file timestamps and inter-module dependency info, which should be available at that point.

I think the Ninja build system use hashes from the compilation commands and source files (all dependencies) instead of relying on timestamps. @bors should like it ;)
Shake can do it as well for source files: http://neilmitchell.blogspot.fr/2014/06/shake-file-hashesdigests.html

@japaric

This comment has been minimized.

Copy link
Member

japaric commented Sep 6, 2014

Q: Is this flag ignored if --test is passed to the compiler?

I just tried rustc --test -L target/deps -C codegen-units=8 src/lib.rs on my library that has 300+ tests and the compile time is still 20 seconds, and CPU usage remained at 100% (one thread).

Did I do something wrong? (Also -C codegen-threads=8 returns error: unknown codegen option)

@epdtry

This comment has been minimized.

Copy link
Contributor Author

epdtry commented Sep 6, 2014

@japaric,

The flag is not ignored, it's just that your library is small enough that it doesn't get much benefit from this patch (especially when optimization is turned off).

With -C codegen-units=1 (the default):

time: 1.879 s   translation
  time: 0.142 s llvm function passes
  time: 0.067 s llvm module passes
  time: 3.547 s codegen passes
  time: 0.000 s codegen passes
time: 4.264 s   LLVM passes
  time: 0.408 s running linker
time: 0.409 s   linking

real    0m16.718s

And with -C codegen-units=4:

time: 2.927 s   translation
time: 0.054 s   llvm function passes
time: 0.055 s   llvm function passes
time: 0.056 s   llvm function passes
time: 0.055 s   llvm function passes
time: 0.022 s   llvm module passes
time: 0.025 s   llvm module passes
time: 0.025 s   llvm module passes
time: 0.026 s   llvm module passes
time: 1.422 s   codegen passes
time: 1.443 s   codegen passes
time: 1.448 s   codegen passes
time: 0.000 s   codegen passes
time: 1.474 s   codegen passes
time: 1.875 s   LLVM passes
  time: 0.489 s running linker
time: 0.492 s   linking

real    0m15.472s

Since rustc spends only 4 seconds in LLVM passes to begin with, there is not much room for improvement. Setting codegen-units=4 reduces the time by about 2.5s, but also slows down translation and linking, so the overall benefit is tiny.

Also -C codegen-threads=8 returns error: unknown codegen option

I removed that flag because in my testing I found no benefit from setting codegen-threads != codegen-units.

@japaric

This comment has been minimized.

Copy link
Member

japaric commented Sep 6, 2014

@epdtry Thanks for the detailed info!

It seems that the bottleneck is the type checking phase in my particular case, and now I'm wondering if spending 50%+ of the time in that phase is normal (but that's off-topic).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.