Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed panicking when frame or pipeline lookup fails. #10082

Merged
merged 1 commit into from Mar 31, 2016

Conversation

@asajeffrey
Copy link
Member

asajeffrey commented Mar 18, 2016

Removed the methods pipeline(id), pipeline_mut(id), frame(id) and frame_mut(id) from constellation, which panicked when the table lookup failed.

The panics were causing race conditions, e.g. visiting google.com and resizing the page would cause a panic, most likely due to an iframe being added and removed, with the DOMLoad event arriving after the iframe had been removed, causing a panic.

This patch fixes #10017 and #8769 (although in non-webrender builds there's now a different panic, see #10017 (comment)).

There are a few TODO items in the initial commit, for cases where it's not completely obvious what to do in the case of failure.


This change is Reviewable

@KiChjang
Copy link
Member

KiChjang commented Mar 18, 2016

@bors-servo
Copy link
Contributor

bors-servo commented Mar 18, 2016

Trying commit a4c9bbf with merge a66b754...

bors-servo added a commit that referenced this pull request Mar 18, 2016
Removed panicking when frame or pipeline lookup fails.

Removed the methods `pipeline(id)`, `pipeline_mut(id)`, `frame(id)` and `frame_mut(id)` from constellation, which panicked when the table lookup failed.

The panics were causing race conditions, e.g. visiting google.com and resizing the page would cause a panic, most likely due to an iframe being added and removed, with the `DOMLoad` event arriving after the iframe had been removed, causing a panic.

This patch fixes #10017 and #8769 (although in non-webrender builds there's now a different panic, see #10017 (comment)).

There are a few `TODO` items in the initial commit, for cases where it's not completely obvious what to do in the case of failure.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/10082)
<!-- Reviewable:end -->
@asajeffrey
Copy link
Member Author

asajeffrey commented Mar 18, 2016

@mbrubeck and @jdm suggest re-enabling /_mozilla/css/iframe/hide_layers2.html, since this patch removes the intermittent on headless builds.

@bors-servo
Copy link
Contributor

bors-servo commented Mar 18, 2016

@asajeffrey
Copy link
Member Author

asajeffrey commented Mar 18, 2016

We might want to land #8641 before this PR.

@glennw
Copy link
Member

glennw commented Mar 20, 2016

This seems like a scary change.

In the original implementation I had those conditions as panics because I was fairly certain that if they do cause a panic, they signal a logic problem elsewhere.

I think that making these not panic is likely to hide / mask other genuine bugs.

Ideally I'd like to see an exact sequence of events / messages that results in these race conditions.

Thoughts @larsbergstrom ?

@larsbergstrom
Copy link
Contributor

larsbergstrom commented Mar 20, 2016

@glennw Thanks for bringing this to my attention! I totally agree, and the reason these are scary is that when I first started working on them pipeline-related shutdown issues resulted in crashes - either the compositor or script task can get into a really horrific state (w.r.t. their native resources) if they are not shut down correclty, sometime even leading to segfaults.

I realize that the shutdown code is pretty terrifying (cleaning up our shutdown segfaults it was my first task, then glenn got it early on to handle the intermittents, and now it appears to be one of yours ). That said, now that we have session types available, I'd be totally open to instead sitting down and writing out all of the protocols and attempting to codify the shutdown sequence more explicitly.

@asajeffrey
Copy link
Member Author

asajeffrey commented Mar 21, 2016

@glennw yup, pretty scary!

The problem is that the current code is causing panics pretty regularly (e.g. load up google.com and resize the window). The panic is caused by an unexpected sequence of events (e.g. where an iframe is loaded then removed, but its DOMLoad event arrives after the iframe's resources have been reclaimed). I think the problem is that the code requires certain sequences of events to be impossible, but the events are concurrent, so I don't see any way to enforce arrival order.

I agree that we need to rethink the protocols more thoroughly, and take a long hard look at the compositor, the question is what to do in the short term.

@larsbergstrom I'm not 100% sure session types will address the issue, but I'm certainly open to giving it a shot!

@larsbergstrom
Copy link
Contributor

larsbergstrom commented Mar 21, 2016

Maybe Thursday we can have a go at determining the new version of: shutdown

The Rust session types work was designed with exactly that diagram in mind :-) Though I'm sure things have gotten more complicated since I fixed what I could and then ran in fear in early 2014.

@asajeffrey
Copy link
Member Author

asajeffrey commented Mar 21, 2016

@larsbergstrom In this case it's not the shutdown sequence that's causing the problem, it's an iframe being removed, and the constellation panicking because it can't send messages to the iframe. This is why I don't know whether session types will fix the problem, since it's caused by lookup failures in the pipeline hash table.

@pcwalton
Copy link
Contributor

pcwalton commented Mar 21, 2016

I agree with the general unease around silently ignoring frames that have unexpectedly gone missing. I do sympathize with wanting to improve browser robustness even in the presence of logic bugs though. Perhaps we could loudly print errors in these cases?

@asajeffrey
Copy link
Member Author

asajeffrey commented Mar 21, 2016

@pcwalton we can include warning messages. These aren't exactly logic bugs though, just messages arriving in odd orders, which is pretty much par for the course on any concurrent system.

@asajeffrey
Copy link
Member Author

asajeffrey commented Mar 21, 2016

Relevant discussion on irc: http://logs.glob.uno/?c=mozilla%23servo#c388656
Brainstorming on goals for threads in servo: https://public.etherpad-mozilla.org/p/servo-threads

@asajeffrey
Copy link
Member Author

asajeffrey commented Mar 21, 2016

I did some digging in constellation.rs, categorizing the potential causes of panic, the results are in https://public.etherpad-mozilla.org/p/servo-threads.

TL;DR summary: there are four potential sources of panic:

  1. Lookup failure in the pipeline/frame maps
  2. Creating new processes
  3. Creating new channels
  4. Sending/receiving on those channels

This PR is just addressing (1), this still leaves a lot of (4).

@asajeffrey
Copy link
Member Author

asajeffrey commented Mar 21, 2016

Some more irc chat, this time about what to do in the chan.send(...).unwrap() case: http://logs.glob.uno/?c=mozilla%23servo#c389192

Also @jdm points out (http://logs.glob.uno/?c=mozilla%23servo#c389143) that soldiering on in the presence of script thread failure can have nasty interactions with rooting, see #6462.

@asajeffrey
Copy link
Member Author

asajeffrey commented Mar 21, 2016

Not entirely sure what happened there, github reckons I unassigned @glennw, no such luck :)

@DemiMarie
Copy link

DemiMarie commented Mar 22, 2016

One approach is to try to use processes instead of threads whenever possible. In that case, there should be no (or much less) need to cleanup – the OS will handle that if the process just dies.

@asajeffrey
Copy link
Member Author

asajeffrey commented Mar 22, 2016

@drbo true, but there's still some clean-up required, e.g. when a pipeline crashes the constellation needs to update it's pipeline table, redirect the page to about:failure, etc.

@asajeffrey asajeffrey force-pushed the asajeffrey:remove-constellation-panic branch from a4c9bbf to 7bb3c08 Mar 22, 2016
@pcwalton
Copy link
Contributor

pcwalton commented Mar 22, 2016

Processes vs. threads is a red herring since we use unwinding when threads abnormally exit. The resource cleanup isn't the issue.

@asajeffrey asajeffrey force-pushed the asajeffrey:remove-constellation-panic branch 2 times, most recently from 0297acb to d188dd7 Mar 22, 2016
@asajeffrey
Copy link
Member Author

asajeffrey commented Mar 22, 2016

The latest commit removes the TODO(ajeffrey) comments, and adds debug messages in each case where pipeline/frame lookup fails.

@pcwalton: Is debug! enough, or do we want to complain louder than that? Note that many of the messages can fire in the case of unusual but not-actually-bugs event orders.

@bors-servo
Copy link
Contributor

bors-servo commented Mar 31, 2016

Trying commit f079391 with merge e33b804...

bors-servo added a commit that referenced this pull request Mar 31, 2016
Removed panicking when frame or pipeline lookup fails.

Removed the methods `pipeline(id)`, `pipeline_mut(id)`, `frame(id)` and `frame_mut(id)` from constellation, which panicked when the table lookup failed.

The panics were causing race conditions, e.g. visiting google.com and resizing the page would cause a panic, most likely due to an iframe being added and removed, with the `DOMLoad` event arriving after the iframe had been removed, causing a panic.

This patch fixes #10017 and #8769 (although in non-webrender builds there's now a different panic, see #10017 (comment)).

There are a few `TODO` items in the initial commit, for cases where it's not completely obvious what to do in the case of failure.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/10082)
<!-- Reviewable:end -->
@bors-servo
Copy link
Contributor

bors-servo commented Mar 31, 2016

@glennw
Copy link
Member

glennw commented Mar 31, 2016

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, some commit checks failed.


components/compositing/constellation.rs, line 267 [r7] (raw file):
Should these have a debug warning? It's probably not great if the frame tree iterator can't find a frame in the current tree.


Comments from the review on Reviewable.io

@asajeffrey asajeffrey force-pushed the asajeffrey:remove-constellation-panic branch from f079391 to df82a5b Mar 31, 2016
@asajeffrey
Copy link
Member Author

asajeffrey commented Mar 31, 2016

@glennw
Copy link
Member

glennw commented Mar 31, 2016

@bors-servo
Copy link
Contributor

bors-servo commented Mar 31, 2016

📌 Commit df82a5b has been approved by glennw

@bors-servo
Copy link
Contributor

bors-servo commented Mar 31, 2016

Testing commit df82a5b with merge 7518c4d...

bors-servo added a commit that referenced this pull request Mar 31, 2016
Removed panicking when frame or pipeline lookup fails.

Removed the methods `pipeline(id)`, `pipeline_mut(id)`, `frame(id)` and `frame_mut(id)` from constellation, which panicked when the table lookup failed.

The panics were causing race conditions, e.g. visiting google.com and resizing the page would cause a panic, most likely due to an iframe being added and removed, with the `DOMLoad` event arriving after the iframe had been removed, causing a panic.

This patch fixes #10017 and #8769 (although in non-webrender builds there's now a different panic, see #10017 (comment)).

There are a few `TODO` items in the initial commit, for cases where it's not completely obvious what to do in the case of failure.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/10082)
<!-- Reviewable:end -->
@bors-servo
Copy link
Contributor

bors-servo commented Mar 31, 2016

@bors-servo bors-servo merged commit df82a5b into servo:master Mar 31, 2016
3 checks passed
3 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details
@paulrouget
Copy link
Contributor

paulrouget commented Apr 1, 2016

Can someone tl;dr this for me? What's the impact on browserhtml?

What happens for the legitimate "unable to find frame/pipeline" bugs? Do they get ignore? Are they reported via debug logs? Do they still panic Servo? If not, do they need to be reported with a crash message within the browser?

@jdm
Copy link
Member

jdm commented Apr 1, 2016

They are ignored, and should not bring down Servo any longer. If they don't appear in the terminal already, they can be surfaced via RUST_LOG=constellation.

The main impact on browser.html is that it should be harder for a panic that occurs in a script/layout/paint thread to bring down the rest of the browser now.

@paulrouget
Copy link
Contributor

paulrouget commented Apr 1, 2016

Ok. So we probably want to show that within browserhtml: #10334

@asajeffrey
Copy link
Member Author

asajeffrey commented Apr 1, 2016

We should think about what log level to issue these messages at. At the moment it's debug, but that probably should be increased.

nox added a commit to nox/servo that referenced this pull request Aug 10, 2016
@metajack metajack mentioned this pull request Aug 10, 2016
8 of 18 tasks complete
nox added a commit to nox/servo that referenced this pull request Aug 10, 2016
samuknet added a commit to samuknet/servo that referenced this pull request Sep 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

10 participants
You can’t perform that action at this time.