Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miri detects a data race in rayon-core's test mutual_install #938

Closed
saethlin opened this issue May 27, 2022 · 6 comments
Closed

Miri detects a data race in rayon-core's test mutual_install #938

saethlin opened this issue May 27, 2022 · 6 comments

Comments

@saethlin
Copy link
Contributor

I observed this on commit 9fcf0ffe28db6ffe396fb13dbf01cbb925674187, but not on the latest published version of rayon-core.

To reproduce:

$ git clone git@github.com:rust-lang/rayon
$ cd rayon/rayon-core
$ MIRIFLAGS="-Zmiri-disable-isolation" cargo +miri miri test mutual_install

I then see:

test thread_pool::test::mutual_install ... warning: thread support is experimental and incomplete: weak memory effects are not emulated.

error: Undefined Behavior: Data race detected between Deallocate on Thread(id = 2) and Read on Thread(id = 1) at alloc167539 (current vector clock = VClock([47, 428, 824]), conflicting timestamp = VClock([0, 429, 824]))
   --> rayon-core/src/registry.rs:506:5
    |
506 |     }
    |     ^ Data race detected between Deallocate on Thread(id = 2) and Read on Thread(id = 1) at alloc167539 (current vector clock = VClock([47, 428, 824]), conflicting timestamp = VClock([0, 429, 824]))
    |
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information

full error with backtrace
test thread_pool::test::mutual_install ... warning: thread support is experimental and incomplete: weak memory effects are not emulated.

error: Undefined Behavior: Data race detected between Deallocate on Thread(id = 2) and Read on Thread(id = 1) at alloc167539 (current vector clock = VClock([47, 428, 824]), conflicting timestamp = VClock([0, 429, 824]))
   --> rayon-core/src/registry.rs:506:5
    |
506 |     }
    |     ^ Data race detected between Deallocate on Thread(id = 2) and Read on Thread(id = 1) at alloc167539 (current vector clock = VClock([47, 428, 824]), conflicting timestamp = VClock([0, 429, 824]))
    |
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
            
    = note: inside `registry::Registry::in_worker_cross::<[closure@rayon-core/src/thread_pool/mod.rs:110:33: 110:44], bool>` at rayon-core/src/registry.rs:506:5
note: inside `registry::Registry::in_worker::<[closure@rayon-core/src/thread_pool/mod.rs:110:33: 110:44], bool>` at rayon-core/src/registry.rs:446:17
   --> rayon-core/src/registry.rs:446:17
    |
446 |                 self.in_worker_cross(&*worker_thread, op)
    |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `thread_pool::ThreadPool::install::<[closure@rayon-core/src/thread_pool/test.rs:163:27: 167:14], bool>` at rayon-core/src/thread_pool/mod.rs:110:9
   --> rayon-core/src/thread_pool/mod.rs:110:9
    |
110 |         self.registry.in_worker(|_, _| op())
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside closure at rayon-core/src/thread_pool/test.rs:163:13
   --> rayon-core/src/thread_pool/test.rs:163:13
    |
163 | /             pool1.install(|| {
164 | |                 // If they blocked on inter-pool installs, there would be no
165 | |                 // threads left to run this!
166 | |                 true
167 | |             })
    | |______________^
note: inside closure at rayon-core/src/thread_pool/mod.rs:110:40
   --> rayon-core/src/thread_pool/mod.rs:110:40
    |
110 |         self.registry.in_worker(|_, _| op())
    |                                        ^^^^
note: inside closure at rayon-core/src/registry.rs:499:17
   --> rayon-core/src/registry.rs:499:17
    |
499 |                 op(&*worker_thread, true)
    |                 ^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside closure at rayon-core/src/job.rs:116:21
   --> rayon-core/src/job.rs:116:21
    |
116 |             move || func(true)
    |                     ^^^^^^^^^^
    = note: inside `<std::panic::AssertUnwindSafe<[closure@rayon-core/src/job.rs:116:13: 116:31]> as std::ops::FnOnce<()>>::call_once` at rustc_src/src/panic/unwind_safe.rs:271:9
    = note: inside `std::panicking::r#try::do_call::<std::panic::AssertUnwindSafe<[closure@rayon-core/src/job.rs:116:13: 116:31]>, bool>` at rustc_src/src/panicking.rs:492:40
    = note: inside `std::panicking::r#try::<bool, std::panic::AssertUnwindSafe<[closure@rayon-core/src/job.rs:116:13: 116:31]>>` at rustc_src/src/panicking.rs:456:19
    = note: inside `std::panic::catch_unwind::<std::panic::AssertUnwindSafe<[closure@rayon-core/src/job.rs:116:13: 116:31]>, bool>` at rustc_src/src/panic.rs:137:14
note: inside `unwind::halt_unwinding::<[closure@rayon-core/src/job.rs:116:13: 116:31], bool>` at rayon-core/src/unwind.rs:17:5
   --> rayon-core/src/unwind.rs:17:5
    |
17  |     panic::catch_unwind(AssertUnwindSafe(func))
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `<job::StackJob<latch::TickleLatch<latch::SpinLatch>, [closure@rayon-core/src/registry.rs:496:13: 500:14], bool> as job::Job>::execute` at rayon-core/src/job.rs:123:38
   --> rayon-core/src/job.rs:123:38
    |
123 |         (*this.result.get()) = match unwind::halt_unwinding(call(func)) {
    |                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `job::JobRef::execute` at rayon-core/src/job.rs:60:9
   --> rayon-core/src/job.rs:60:9
    |
60  |         (self.execute_fn)(self.pointer)
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `registry::WorkerThread::execute` at rayon-core/src/registry.rs:730:9
   --> rayon-core/src/registry.rs:730:9
    |
730 |         job.execute();
    |         ^^^^^^^^^^^^^
note: inside `registry::WorkerThread::wait_until_cold::<latch::CountLatch>` at rayon-core/src/registry.rs:711:17
   --> rayon-core/src/registry.rs:711:17
    |
711 |                 self.execute(job);
    |                 ^^^^^^^^^^^^^^^^^
note: inside `registry::WorkerThread::wait_until::<latch::CountLatch>` at rayon-core/src/registry.rs:685:13
   --> rayon-core/src/registry.rs:685:13
    |
685 |             self.wait_until_cold(latch);
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `registry::main_loop` at rayon-core/src/registry.rs:809:5
   --> rayon-core/src/registry.rs:809:5
    |
809 |     worker_thread.wait_until(&registry.terminate_latch);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `registry::ThreadBuilder::run` at rayon-core/src/registry.rs:56:18
   --> rayon-core/src/registry.rs:56:18
    |
56  |         unsafe { main_loop(self.worker, self.registry, self.index) }
    |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside closure at rayon-core/src/registry.rs:101:20
   --> rayon-core/src/registry.rs:101:20
    |
101 |         b.spawn(|| thread.run())?;
    |                    ^^^^^^^^^^^^

note: some details are omitted, run with `MIRIFLAGS=-Zmiri-backtrace=full` for a verbose backtrace

error: aborting due to previous error; 32 warnings emitted

error: test failed, to rerun pass '--lib'
@saethlin
Copy link
Contributor Author

saethlin commented May 27, 2022

I also see a data race via inferno's test suite, so perhaps the particular test that's failing there is misleading:

test collapse::dtrace::tests::test_collapse_multi_dtrace_non_utf8 ... warning: thread support is experimental and incomplete: weak memory effects are not emulated.

error: Undefined Behavior: Data race detected between Read on Thread(id = 2) and Write on Thread(id = 1) at alloc654748+0xd38 (current vector clock = VClock([6455, 7233, 1111, 579, 376, 331, 197, 79, 34, 34, 34]), conflicting timestamp = VClock([0, 7236]))
   --> src/collapse/common.rs:417:35
    |
417 |             MultiThreaded(arc) => *arc.entry(key).or_insert(0) += count,
    |                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Data race detected between Read on Thread(id = 2) and Write on Thread(id = 1) at alloc654748+0xd38 (current vector clock = VClock([6455, 7233, 1111, 579, 376, 331, 197, 79, 34, 34, 34]), conflicting timestamp = VClock([0, 7236]))
    |
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information

@cuviper
Copy link
Member

cuviper commented May 27, 2022

I observed this on commit 9fcf0ffe28db6ffe396fb13dbf01cbb925674187, but not on the latest published version of rayon-core.

To be clear, that commit is not really part of this repo, but the rust-lang/rustc-rayon fork. I believe the exact problem is that it's missing the fix in #740. That's pretty nice that miri identified this race, because it was a pain to track down at the time. It looks like the backtrace only shows one of the two threads involved in the race though, and we would ideally want to see the other side as well.

Anyway, I would like to get rustc-rayon rebased again, and I pushed a branch for that: master...cuviper:rustc-rayon-next. I'm not sure about the deadlock detection though -- I had to make a few functional changes since we don't wake up all threads at once anymore. When I tried this in a rustc build, the ui testsuite hit a deadlock and then panicked when its deadlock handler tried to read a WorkerLocal from a thread outside the pool. I don't know if that's something I broke, or if that was already poorly handled, since the rustc-parallel builds don't get much attention.

If you (or anyone else watching) have time to look into this, that would be great!

@saethlin
Copy link
Contributor Author

saethlin commented May 27, 2022

we would ideally want to see the other side as well.

I'm not actually going to implement this, but would local spans suffice instead of a backtrace for the other racing access? This is seems like the same problem with historical context that I solved for Stacked Borrows checking in this PR, you can see an example with color here: https://miri.saethlin.dev/ub?crate=smallvec&version=1.8.0 (I've debugged a single data race in my whole life so I have no idea what helps)

@cuviper
Copy link
Member

cuviper commented May 27, 2022

The span would definitely help, if a second backtrace is not feasible. In this case, the fix was needed in that other location.

@saethlin
Copy link
Contributor Author

saethlin commented May 28, 2022

I think the bigger challenge with backtraces is skill of the implementor 😅
Thanks for the feedback, I'll think on this.

bors added a commit to rust-lang-ci/rust that referenced this issue May 28, 2022
…mulacrum

Update to rebased rustc-rayon 0.4

In rayon-rs/rayon#938, miri uncovered a race in `rustc-rayon-core` that had already been fixed in the regular `rayon-core`. I have now rebased that fork onto the latest rayon branch, and published as 0.4. I also updated `indexmap` to bump the dependency.

`Cargo.lock` changes:

    Updating indexmap v1.8.0 -> v1.8.2
    Updating rayon v1.5.1 -> v1.5.3
    Updating rayon-core v1.9.1 -> v1.9.3
    Updating rustc-rayon v0.3.2 -> v0.4.0
    Updating rustc-rayon-core v0.3.2 -> v0.4.1
@cuviper
Copy link
Member

cuviper commented May 31, 2022

I landed the rustc-rayon rebase in rust-lang/rust#97472 -- please let us know if you find anything else!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants