New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support for using a jobserver with Rayon #56946

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
6 participants
@Zoxc
Copy link
Contributor

Zoxc commented Dec 18, 2018

The Rayon changes are here: Zoxc/rayon#2

cc @alexcrichton
r? @nikomatsakis

@rust-highfive

This comment has been minimized.

Copy link
Collaborator

rust-highfive commented Dec 18, 2018

The job mingw-check of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:053ee1e0:start=1545120573675720183,finish=1545120574678870273,duration=1003150090
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
Setting environment variables from .travis.yml
$ export IMAGE=mingw-check
---
[00:03:25]     Checking arena v0.0.0 (/checkout/src/libarena)
[00:03:25]     Checking syntax_pos v0.0.0 (/checkout/src/libsyntax_pos)
[00:03:26]     Checking rustc_errors v0.0.0 (/checkout/src/librustc_errors)
[00:03:39]     Checking syntax_ext v0.0.0 (/checkout/src/libsyntax_ext)
[00:03:43] error[E0425]: cannot find function `continue_unblocked` in module `rayon_core`
[00:03:43]    --> src/librustc/ty/query/job.rs:224:21
[00:03:43] 224 |         rayon_core::continue_unblocked();
[00:03:43]     |                     ^^^^^^^^^^^^^^^^^^ not found in `rayon_core`
[00:03:43] 
[00:04:02] error: aborting due to previous error
[00:04:02] error: aborting due to previous error
[00:04:02] 
[00:04:02] For more information about this error, try `rustc --explain E0425`.
[00:04:02] error: Could not compile `rustc`.
[00:04:02] 
[00:04:02] To learn more, run the command again with --verbose.
[00:04:02] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "check" "--target" "x86_64-unknown-linux-gnu" "-j" "4" "--release" "--color" "always" "--features" "" "--manifest-path" "/checkout/src/rustc/Cargo.toml" "--message-format" "json"
[00:04:02] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap check
[00:04:02] Build completed unsuccessfully in 0:03:01
travis_time:end:2f018950:start=1545120583159960648,finish=1545120825707478576,duration=242547517928
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 1.
---
travis_time:end:1a302ae4:start=1545120826103270363,finish=1545120826107763497,duration=4493134
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:0012bf28
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!\(.*\)|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:00b8ee70
travis_time:start:00b8ee70
$ cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory
travis_fold:end:after_failure.5
travis_fold:start:after_failure.6
travis_time:start:0c674736
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Dec 18, 2018

I'm not too familiar with rayon internals, but can you describe at a high level the strategy for managing the jobserver tokens?

@Zoxc

This comment has been minimized.

Copy link
Contributor

Zoxc commented Dec 18, 2018

The strategy is to call Proxy::return_token before blocking and Proxy::acquire_token when blocking is complete (on both the main thread and in Rayon worker threads). The Proxy type papers over the differences between the implicit token given to the main thread and Acquired tokens, allowing the main thread to give its token over to the thread pool.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Dec 18, 2018

As someone not familiar with rayon, can you expand on that a bit more? Acquiring and releasing a token is a pretty expensive operation and would be pretty inefficient if we did it super commonly, so is blocking in rayon something that's amortized over a long time?

@Zoxc

This comment has been minimized.

Copy link
Contributor

Zoxc commented Dec 19, 2018

There are 2 places where blocking can happen which is not related to setup/teardown:

  • When a Rayon worker thread has no work to do and it has been spinning a bit looking for work. This can be reduced by ensuring that there's always work available for the thread pool. Making the compiler on-demand driven will help exposing parallelism to Rayon. This spinning isn't ideal if some other thread has work to do, but requires a token. We could probably adapt it to not spin in that case.

  • When a query requires another query which is already executing in another thread. This can be eliminated by using fibers / coroutines so we can do something else instead of blocking.

I'd want to add that we spawn n threads immediately for the Rayon thread pool and they all will call Proxy::acquire_token before they do any real work. If they get a token they'll just spin a bit waiting for work then release the token and fall asleep. There is currently no parallelism in the front end of the compiler, so there won't be any work for them to do.

@Zoxc

This comment has been minimized.

Copy link
Contributor

Zoxc commented Dec 19, 2018

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Dec 19, 2018

Could the jobserver be made abstract to Rayon? I.e. Rayon would not directly use the jobserver crate and instead just use something with the jobserver interface to acquire and release tokens? That would be a bit more general and rustc (or any other piece of code using Rayon) could do some buffering and application-specific management of tokens (e.g. keeping tokens around a bit longer if it expects for more work to show up soon).

Another approach that might be interesting: Add some way to tell Rayon the target number of active threads it should be running. It would then internally try to match this soft target by not assigning any more work to threads it wants to wind down.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Dec 19, 2018

(FYI I have this scheduled for review Thu Dec 20 at 13:00 UTC-05:00.)

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Dec 20, 2018

OK, so I read this PR and the other one. @Zoxc let me summarize what I think is going on in the current PRs. I'm putting the comment here because I want to keep conversation "consolidated".

I believe that the current design basically has each rayon thread acquire a token from the jobserve before it starts looking for work and release that token when it goes to sleep, right?

This obviously makes a lot of sense, though I'm wondering a bit if there is some interaction with the LLVM compilation threads we want to be careful of. In particular, I believe that LLVM execution and (e.g.) trans can overlap -- this was a requirement to help us reduce peak memory usage requirements of incremental compilation, if I recall. Maybe we want to move those LLVM things into spawned rayon tasks, so that they are sharing the same basic thread-pool? (I've sort of forgotten how that system is setup, I'll have to investigate.)

I would definitely prefer if the rayon core code was "agnostic" as to the specifics of the thread-pool, as @michaelwoerister suggested. Given how simple the interface is, it basically seems like we are talking about adding two callbacks -- acquire_token and release_token -- to the rayon threadpool interface, right?

(The PR has some other changes, e.g., adopting #[thread_local], but I am not sure what the motivation for those changes is.)

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Dec 21, 2018

Maybe we want to move those LLVM things into spawned rayon tasks, so that they are sharing the same basic thread-pool?

That's basically how I imagined this work in the future. The current LLVM scheduling is rather complicated but only because codegen/trans is bound to the main thread. All of this should get a lot simpler once the tcx can be shared between threads.

@Dylan-DPC

This comment has been minimized.

Copy link
Member

Dylan-DPC commented Jan 21, 2019

ping from triage @Zoxc @nikomatsakis any updates on this?

@Zoxc Zoxc force-pushed the Zoxc:jobserver branch from 3409a32 to 8d210ce Jan 22, 2019

@Zoxc

This comment has been minimized.

Copy link
Contributor

Zoxc commented Jan 22, 2019

I've updated this to use the callbacks I added to Rayon. The jobserver module is now moved to rustc_data_structures (aka. rustc_misc).

@rust-highfive

This comment has been minimized.

Copy link
Collaborator

rust-highfive commented Jan 22, 2019

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:01bf40e0:start=1548179329381587011,finish=1548179451168497109,duration=121786910098
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
Setting environment variables from .travis.yml
$ export IMAGE=x86_64-gnu-llvm-6.0
---
############################################################              84.7%
######################################################################    98.0%
######################################################################## 100.0%
[00:01:30] extracting /checkout/obj/build/cache/2019-01-04/cargo-beta-x86_64-unknown-linux-gnu.tar.gz
[00:01:30] error: failed to read `/par/rayon-tlv/Cargo.toml`
[00:01:30] Caused by:
[00:01:30]   No such file or directory (os error 2)
[00:01:30] failed to run: /checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo build --manifest-path /checkout/src/bootstrap/Cargo.toml --locked
[00:01:30] Build completed unsuccessfully in 0:00:14
[00:01:30] Build completed unsuccessfully in 0:00:14
[00:01:30] Makefile:71: recipe for target 'prepare' failed
[00:01:30] make: *** [prepare] Error 1
[00:01:31] Command failed. Attempt 2/5:
[00:01:31] error: failed to read `/par/rayon-tlv/Cargo.toml`
[00:01:31] Caused by:
[00:01:31]   No such file or directory (os error 2)
[00:01:31] failed to run: /checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo build --manifest-path /checkout/src/bootstrap/Cargo.toml --locked
[00:01:31] Build completed unsuccessfully in 0:00:00
[00:01:31] Build completed unsuccessfully in 0:00:00
[00:01:31] Makefile:71: recipe for target 'prepare' failed
[00:01:31] make: *** [prepare] Error 1
[00:01:33] Command failed. Attempt 3/5:
[00:01:33] error: failed to read `/par/rayon-tlv/Cargo.toml`
[00:01:33] Caused by:
[00:01:33]   No such file or directory (os error 2)
[00:01:33] failed to run: /checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo build --manifest-path /checkout/src/bootstrap/Cargo.toml --locked
[00:01:33] Build completed unsuccessfully in 0:00:00
[00:01:33] Build completed unsuccessfully in 0:00:00
[00:01:33] make: *** [prepare] Error 1
[00:01:33] Makefile:71: recipe for target 'prepare' failed
[00:01:36] Command failed. Attempt 4/5:
[00:01:36] error: failed to read `/par/rayon-tlv/Cargo.toml`
[00:01:36] Caused by:
[00:01:36]   No such file or directory (os error 2)
[00:01:36] failed to run: /checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo build --manifest-path /checkout/src/bootstrap/Cargo.toml --locked
[00:01:36] Build completed unsuccessfully in 0:00:00
[00:01:36] Build completed unsuccessfully in 0:00:00
[00:01:36] make: *** [prepare] Error 1
[00:01:36] Makefile:71: recipe for target 'prepare' failed
[00:01:40] Command failed. Attempt 5/5:
[00:01:40] error: failed to read `/par/rayon-tlv/Cargo.toml`
[00:01:40] Caused by:
[00:01:40]   No such file or directory (os error 2)
[00:01:40] failed to run: /checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo build --manifest-path /checkout/src/bootstrap/Cargo.toml --locked
[00:01:40] Build completed unsuccessfully in 0:00:00

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment