Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upNew test failures in test-perf #20127
Comments
|
Given that there are various mutexes involved in the fetch code and the HTTP cache, it's plausible that there could be a deadlock hiding in there. |
|
Adding in an old skule debugging println, there are some loads that are never unblocked:
|
|
Comparing a successful run:
with an unsuccessful run:
there's loads started but never terminated. |
|
Now with the
so it looks like some image loads aren't finishing. |
|
A successful run with finish_load being tracked:
|
|
Ah:
with two lines both saying |
|
Good news... on my laptop in the office, I can trigger this with a debug build! Yay, less stumbling around in the dark! A sample run ends with:
So deadlock is looking quite likely, Script is waiting on layout which never completes. I wonder if script is holding a lock while doing this? |
|
Aaand adding some extra debugging printlns to debug!("Sending webrender tx");
self.webrender_api.send_transaction(self.webrender_document, txn);
debug!("Finished webrender tx");oh look:
So the webrender tx is sent but never completes. |
|
Current guess: webrender does something with the image cache that needs a lock or similar, but oh dear it's still owned by the script thread. |
|
I don't think there's any way that WR can interact with the image cache. In general WR is a push-style API where it shouldn't block on any API call (possible I'm forgetting something though). My guess would be an ipc-channel bug. |
|
A back trace from gdb, from the layout thread in its stuck state...
which would suggest something similar to servo/ipc-channel#34 ? |
|
@antrik looks like this might be a case where IPC channel send is blocking? |
|
OK, more digging... Servo ought to be deadlock-free even if IPC sending blocks. I even wrote some notes to myself! servo/components/constellation/constellation.rs Lines 80 to 90 in 2d4caaf In particular, sending to the WR API is blocking. In particular, layout can block on WR: servo/components/layout_thread/lib.rs Line 1061 in e8f7786 Digging through this particular deadlock in rr, the font cache thread is blocked sending to WR:
although the WR back end is waiting to recv:
One thing that's weird is that the fd that WR is listening on is fd 16, but the font cache is sending on fd 8:
|
|
Ah, the fds are created with socketpair, so it's not too surprising that they are different. I still don't understand why the WR thread is still waiting to recv though :/ |
|
Ah, I have a theory! The webrender API sends a transaction by first sending the payload, the sending the transaction itself: https://github.com/servo/webrender/blob/d7735c7da9844775d4a3e6ec8b6cea25fb0b0486/webrender_api/src/api.rs#L897-L903 The problem is that IPC send is blocking when the buffer is full. At this point, the WR back end is waiting on its API channel https://github.com/servo/webrender/blob/d7735c7da9844775d4a3e6ec8b6cea25fb0b0486/webrender/src/render_backend.rs#L727-L736 The problem is that the transaction sender is blocked waiting to send the payload, after which it will send the transaction msg. But the WR back end is waiting for the transaction message before unblocking the payload. Cyclic blocking, so DEADLOCK! Gosh this was painful to diagnose. Here's the relevant GDB backtraces...
|
|
Blocking sends are a problem on mac as well as linux. I wrote a program that demonstrates that the IPC channels on mac eventually block during the send operation if no receive operation occurs. |
|
Yes, @antrik didn't think there was any way to give non-blocking semantics to ipc send on all platforms. |
|
Hopefully this is fixed by servo/webrender#2480 |
|
Fixed by servo/webrender#2480 |
We started getting more test failures in test-perf, starting 2018-01-20. https://datastudio.google.com/u/0/reporting/1b16G5pQNp2lE-1nNcPlOure9G0eR36cq/page/heOK
An example that used to pass is
http:/localhost:8123/page_load_test/tp5n/en.wikipedia.org/en.wikipedia.org/wiki/Rorschach_test.html. To replicate thisetc/ci/performancedirectory, e.g.python -m SimpleHTTPServer 8123The test fails nondeterministically, sometimes rendering and sometimes just producing a blank page.
There are no obvious culprits in the commits after 6fc71a7 and before b62b51a, which are the ones for Jan 19.
My suspicion is that this is a timing-sensitive deadlock, and that some commit on Jan 19 happened to change the timing such that the deadlock happens on CI.
My attempts to diagnose this haven't borne any fruit. I did once get a panic:
which seems to be the same panic as #19707 but I don't know if this is related. Most runs deadlock, they don't panic.
IRC conversation with @jdm at https://mozilla.logbot.info/servo/20180226#c14363739