Use of get_cx can fail after clearing JS runtime #21866

jdm · 2018-10-04T00:35:09Z

Using window.get_cx can trigger a script thread panic when it runs as part of a task that affects a global that already had its runtime cleared. We can avoid this by cancelling all outstanding tasks for the global as part of Window::clear_js_runtime.

jdm · 2018-10-04T00:36:24Z

An example of what this looks like:

  ▶ CRASH [expected OK] /_mozilla/mozilla/postmessage_closed.html
  │ 
  │ VMware, Inc.
  │ softpipe
  │ 3.3 (Core Profile) Mesa 18.1.0-devel
  │ called `Option::unwrap()` on a `None` value (thread ScriptThread PipelineId { namespace_id: PipelineNamespaceId(1), index: PipelineIndex(1) }, at libcore/option.rs:345)
  │ stack backtrace:
  │    0:     0x7f458ade275c - backtrace::backtrace::trace::h25a855f1b024854d
  │    1:     0x7f458ade2482 - <backtrace::capture::Backtrace as core::default::Default>::default::habdbfa508190f944
  │    2:     0x7f458ade24f8 - backtrace::capture::Backtrace::new::h4d43a9c81c9f7d2b
  │    3:     0x7f4587d14879 - servo::main::{{closure}}::h4f3daf96d0e7b2ea
  │    4:     0x7f458adf8083 - std::panicking::rust_panic_with_hook::h96bd948766a721a4
  │                         at libstd/panicking.rs:481
  │    5:     0x7f458adf7c29 - std::panicking::continue_panic_fmt::h79dd527b10c49814
  │                         at libstd/panicking.rs:391
  │    6:     0x7f458adf7b25 - rust_begin_unwind
  │                         at libstd/panicking.rs:326
  │    7:     0x7f458ae2fa4b - core::panicking::panic_fmt::ha9838fa819cc4c3a
  │                         at libcore/panicking.rs:77
  │    8:     0x7f458ae2f97a - core::panicking::panic::h285e7c5e0188ee3f
  │                         at libcore/panicking.rs:52
  │    9:     0x7f4588d1e015 - <script::dom::window::Window::post_message::post_serialised_message<F> as script::task::TaskOnce>::run_once::hc24a6f1083c58812
  │   10:     0x7f458854e6a3 - <T as script::task::TaskBox>::run_box::h7decbd20c577411a
  │   11:     0x7f458869022d - script::script_thread::ScriptThread::handle_msg_from_script::ha5cb4dc8dc384b38
  │   12:     0x7f4588682def - _ZN6script13script_thread12ScriptThread11handle_msgs17ha3193724cf5b3cbfE.llvm.16639006869903642791
  │   13:     0x7f458867f801 - script::script_thread::ScriptThread::start::hd0d0c5cc418f3718
  │   14:     0x7f458881f20f - std::sys_common::backtrace::__rust_begin_short_backtrace::h56d875724efd5dee
  │   15:     0x7f4588d23e35 - _ZN3std9panicking3try7do_call17h110851039340f857E.llvm.10071600851121277610
  │   16:     0x7f458ae20f79 - __rust_maybe_catch_panic
  │                         at libpanic_unwind/lib.rs:103
  │   17:     0x7f4588a2aa60 - <F as alloc::boxed::FnBox<A>>::call_box::h88161e523050cfb5
  │   18:     0x7f458ae108fa - <alloc::boxed::Box<(dyn alloc::boxed::FnBox<A, Output$u3d$R$GT$$u20$$u2b$$u20$$u27$a$RP$$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h91e1a0628cdb6b64
  │                         at /checkout/src/liballoc/boxed.rs:656
  │                          - std::sys_common::thread::start_thread::h8710783cd11c6158
  │                         at libstd/sys_common/thread.rs:24
  │   19:     0x7f458adf8855 - std::sys::unix::thread::Thread::new::thread_start::h8ecb5f3a76b46e8e
  │                         at libstd/sys/unix/thread.rs:90
  │   20:     0x7f458608b183 - start_thread
  │   21:     0x7f458431703c - clone
  │   22:                0x0 - <unknown>
  │ ERROR 2018-10-04T00:02:01Z: servo: called `Option::unwrap()` on a `None` value
  └ Pipeline failed in hard-fail mode.  Crashing!

asajeffrey · 2018-10-04T01:04:45Z

Related to #21419, and in general lots of code which assumes browsing contexts are never discarded.

AgustinCB · 2018-10-04T10:21:15Z

How can I reproduce this on local?

AgustinCB · 2018-10-04T10:28:53Z

Also, when you mean the global, do you mean all the tasks submitted through GlobalScope.pipeline_id?

jdm · 2018-10-04T10:45:20Z

You may be lucky enough to be able to trigger this (potentially) intermittent failure by running ./mach test-wpt tests/wpt/mozilla/tests/mozilla/postmessage_closed.html. And when I talk about the global, I mean tasks that get queued that make use of a Window object and end up calling the get_cx method for some reason when they are later executed.

AgustinCB · 2018-10-04T12:59:01Z

Sorry, I have more questions:

How can I get a list of such a tasks? I have been taking a look at the window and globalscope files but didn't find references.

jdm · 2018-10-04T15:09:00Z

There's no way to get such a list without looking at each use of CommonScriptMsg::Task, as far as I can tell. It shouldn't be necessary to modify any of them, though; we built our task cancellation mechanism that you've been improving to enable this use case - if we use the API that marks all task sources as cancelled then any outstanding tasks for those sources will be ignored.

gterzian · 2018-10-05T11:07:26Z

@AgustinCB Here are a few more pointers, hopefully they're correct and useful:

It appears the test is first using postMessage on an iframe, and then removing the iframe from the DOM tree of the page where it is embedded.
It appears to be a same origin iframe, hence postMessage will end up calling Window.post_message
post_message will enqueue a task using the DOMManipulation task-source, here

servo/components/script/dom/window.rs

Line 2287 in caa4d19

let _ = self.script_chan.send(CommonScriptMsg::Task(
When this task is executed by the script_thread here get_cx is called, see

servo/components/script/dom/window.rs

Line 2263 in caa4d19

let cx = this.get_cx();
Removing the iframe from the tree, will, I think, end up calling unbind_from_tree, which will among others send a ScriptMsg::RemoveIFrame message to the constellation.
It's also noteworthy that the iframe will block waiting on the answer from the constellation at

servo/components/script/dom/htmliframeelement.rs

Line 651 in caa4d19

let exited_pipeline_ids = receiver.recv().unwrap();

. Since this "iframe" is same-origin, it is running in the same script-thread as the rest of the page, and this will thus block the entire script-thread, which will not be able to handle other tasks in the meantime.
While the script-thread is blocked because of 6, the constellation is not, and will do the work related to "closing the iframe", which is found at handle_remove_iframe_msg.
This will include "closing a browsing context", which is a workflow that ends-up sending a ConstellationControlMsg::ExitPipeline back to the script-thread were this iframe is running.
The script-thread will handle this message in handle_exit_pipeline_msg, which also calls window.clear_js_runtime

So the way to think about how this is processed is that:

PostMessage enqueues a task on a script-thread.
remove ends up sending a ConstellationControlMsg::ExitPipeline to that same script-thread, while that script-thread is actually blocked, hence it won't for example handle the task related to 1.
The script-thread handles running tasks, and handling message from the constellation, as part of the same "run loop" as part of handle_msgs
The result of that both the "PostMessage" task, and the "Exit Pipeline" message, are sort of enqueued and waiting on being handled by this script-thread at the same time(since the script-thread will block at 2, meaning it won't be able to handled the "PostMessage" before the "Exit Pipeline" message has been received).
At that point, it's a question of which one is handled first. If the "Exit Pipeline" is handled first, we'll get a crash when the "PostMessage" is handled next. There is some randomness there, because the select! statement that is used to "choose" a message either from the task-queue, or from the constellation queue of messages, will pick one fairly randomly. See

servo/components/script/script_thread.rs

Line 1109 in caa4d19

let mut event = select! {
The solution to this problem is to ensure that as part of handle_exit_pipeline_msg, all tasks for the window corresponding to the pipeline that is exited, are marked as cancelled.
The weird thing, is that it appears to be done already inside clear_js_runtime at

servo/components/script/dom/window.rs

Line 1288 in caa4d19

self.ignore_all_events();
One thing I can think of, is to try to used a more strict Ordering at

servo/components/script/dom/window.rs

Line 338 in caa4d19

flag.store(true, Ordering::Relaxed);
Because when a script-thread runs a task(see

servo/components/script/script_thread.rs

Line 1566 in caa4d19

MainThreadScriptMsg::Common(CommonScriptMsg::Task(_, task, _, _)) => task.run_box(),

)
The task will check whether it has been cancelled using

servo/components/script/task.rs

Line 108 in caa4d19

.map_or(false, |cancelled| cancelled.load(Ordering::SeqCst))

Hope this helps(and that if changing the ordering really is the fix, it didn't spoil the fun)!

jdm added A-content/dom Interacting with the DOM from web content I-panic Servo encounters a panic. labels Oct 4, 2018

jdm mentioned this issue Oct 4, 2018

Unify the task source and task canceller API #21804

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of get_cx can fail after clearing JS runtime #21866

Use of get_cx can fail after clearing JS runtime #21866

jdm commented Oct 4, 2018

jdm commented Oct 4, 2018

asajeffrey commented Oct 4, 2018

AgustinCB commented Oct 4, 2018

AgustinCB commented Oct 4, 2018

jdm commented Oct 4, 2018

AgustinCB commented Oct 4, 2018

jdm commented Oct 4, 2018

gterzian commented Oct 5, 2018 •

edited

Use of get_cx can fail after clearing JS runtime #21866

Use of get_cx can fail after clearing JS runtime #21866

Comments

jdm commented Oct 4, 2018

jdm commented Oct 4, 2018

asajeffrey commented Oct 4, 2018

AgustinCB commented Oct 4, 2018

AgustinCB commented Oct 4, 2018

jdm commented Oct 4, 2018

AgustinCB commented Oct 4, 2018

jdm commented Oct 4, 2018

gterzian commented Oct 5, 2018 • edited

gterzian commented Oct 5, 2018 •

edited