Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release nightly compilers with ability to internally parallelize #59667

Open
alexcrichton opened this issue Apr 3, 2019 · 3 comments
Open

Release nightly compilers with ability to internally parallelize #59667

alexcrichton opened this issue Apr 3, 2019 · 3 comments
Labels
A-parallel-queries T-compiler WG-compiler-performance

Comments

@alexcrichton
Copy link
Member

@alexcrichton alexcrichton commented Apr 3, 2019

This is intended to be a tracking issue to releasing nightly compilers with the ability to internally parallelize themselves but they are still defaulted to single threaded mode. This is part of the larger parallel compiler tracking issue, and is intended to be an incremental step towards fully closing that out.

A recent attempt to build binaires of the parallel compiler led to the thought of whether we could just enable a parallel compiler by default. Note that there are two axes we can change here over time:

  • Whether or nor the compiler can be parallelized at all, aka whether it's built with the --cfg parallel_compiler flag.
  • Whether or not the compiler by default is parallelized, aka the default value of -Z threads

The proposal in this issue is to default to -Z threads=1 (or the moral equivalent) but build nightly compilers with --cfg parallel_compiler (or the equivalent thereof). The intention is to get us closer to shipping a parallel compiler while buying us time to continue to fix any issues that arise. This would allow, for example, for users to very easily test out parallel compilation locally by using RUSTFLAGS=-Zthreads=16.

The main blocker for doing this is performance. Requested in a recent thread we realized it's imported to not watch the comparison of instruction counts but rather instead watch the wall time numbers. The instruction count numbers regress 2-3% which looks deceptively good, but the wall-time numbers regress 10-20% (ish) which is much more serious.

Some further investigation shows that most of the slowdown is likely coming from the use of mutexes (as opposed to other avenues like removing parallel code, the overhead of using rayon, or using Arc instead of Rc).

The next steps here would be to investigate whether we can recover the performance lost from using mutexes (probably if we can remove the mutexes one way or another).

This issue will likely receive many updates over time!

@alexcrichton alexcrichton added A-parallel-queries T-compiler WG-compiler-performance labels Apr 3, 2019
@alexcrichton
Copy link
Member Author

@alexcrichton alexcrichton commented Apr 3, 2019

@Zoxc do you know off the top of your head what some hot mutexes might be? Some local profiling of a compiler from #59644 and the commit just before is not very illuminating, while everything does get a bit slower it's hard to see where it's getting slower.

It does look like get_query (presumably this lock?) is pretty hot, but that also seems somewhat fundamental

@HadrienG2
Copy link

@HadrienG2 HadrienG2 commented Apr 3, 2019

You may want to try this lock contention profiling tool and see if it works for you: http://0pointer.de/blog/projects/mutrace.html .

@Aaron1011
Copy link
Member

@Aaron1011 Aaron1011 commented Jul 5, 2019

As a temporary workaround, we could try doing something similar to the fragile crate. At runtime, we would inspect -Z threads:

  1. If -Z threads > 1, we use a normal Mutex.
  2. If -Z threads = 1, we use a 'fake' mutex - a type which implements Send/Sync, but panics if used on any thread other than the one which created it. Since only one thread should ever be accessing these Mutexes, the panic should never actually occur.

Hopefully, the overhead of these runtime checks would be much less than the overhead of a full Mutex type. This would hopefully allow a parallelizable compiler to be shipped, while at the same time we continue to work in improving single-thread performance (with actual Mutexes).

bors added a commit to rust-lang-ci/rust that referenced this issue Nov 14, 2020
Build the compiler with -Ctarget-cpu=x86-64-v2

This PR instructs rustbuild to compile the compiler with the `-Ctarget-cpu=x86-64-v2` option enabled in hope of getting some optimization gains from autovectorization.

The PR also adds support for `x86-64-{2,3,4}` target CPUs by [backporting an LLVM 12.0 commit](llvm/llvm-project@012dd42e027e).

I'm opening this to get a perf run to gauge the potential speedups on the rustc side. The LLVM side isn't built with the option enabled, as that would require clang 12.0 or manual enabling of the target features corresponding to the target CPU.

If the perf run shows up nice improvements one can talk about how to get this to users. One can't just enable this unconditionally for all users as it'd break for users of older CPUs. It's a similar question to rust-lang#59667.
bors added a commit to rust-lang-ci/rust that referenced this issue Nov 14, 2020
Build the compiler with -Ctarget-cpu=x86-64-v2

This PR instructs rustbuild to compile the compiler with the `-Ctarget-cpu=x86-64-v2` option enabled in hope of getting some optimization gains from autovectorization.

The PR also adds support for `x86-64-{2,3,4}` target CPUs by [backporting an LLVM 12.0 commit](llvm/llvm-project@012dd42e027e).

I'm opening this to get a perf run to gauge the potential speedups on the rustc side. The LLVM side isn't built with the option enabled, as that would require clang 12.0 or manual enabling of the target features corresponding to the target CPU.

If the perf run shows up nice improvements one can talk about how to get this to users. One can't just enable this unconditionally for all users as it'd break for users of older CPUs. It's a similar question to rust-lang#59667.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parallel-queries T-compiler WG-compiler-performance
Projects
None yet
Development

No branches or pull requests

3 participants