Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardening more of Servo against panic? #24167

Open
asajeffrey opened this issue Sep 10, 2019 · 8 comments
Open

Hardening more of Servo against panic? #24167

asajeffrey opened this issue Sep 10, 2019 · 8 comments
Labels
I-panic Servo encounters a panic.

Comments

@asajeffrey
Copy link
Member

At the moment, most of Servo will panic under error conditions, some which (e.g. GL errors) are in crates that are out of our control.

This is appropriate behaviour for use-cases where we'd like to see the bug reports, but not necessarily what end-users want. Should we provide some hardening for use-cases where it's more appropriate to log the error and try to recover? Perhaps behind a pref or feature gate?

@asajeffrey asajeffrey added the I-panic Servo encounters a panic. label Sep 10, 2019
@asajeffrey
Copy link
Member Author

And before @nox jumps in to say it, the Erlang philosophy of let-the-processes-crash-and-have-a-small-trusted-service-to-restart-them runs into problems in the case of services like our script threads, which have libraries like SpiderMonkey that have lots of in-memory unserializable state. :(

@nox
Copy link
Contributor

nox commented Sep 10, 2019

I don't understand what that means @asajeffrey. Killing and restarting gets rid of in-memory state, that's the point. We should make sure to not panic in places where SM wouldn't be able to recover, but making script threads panic shouldn't bring down Servo.

The point of Erlang is that leaves crashing don't bring down your core, it's not so much about restarting in a way that perfectly restores the state. For example, a Web server written in Erlang shouldn't go down just because a request provoked a crash, the process handling the request will crash and not be restarted and the rest of the system will happily continue to do whatever it is doing.

@larsbergstrom
Copy link
Contributor

One other wrinkle that we've had around panic is the delayed-panic due to channel-based communication. That is, if the receive end of a channel is in a thread that panics but we do not tear down, the thread with a send end will panic at the moment that they attempt to send a message. IPC channels may have been hardened against this, but my memory is fuzzy.

I feel like there are several issues here:

  1. Libraries doing excessive checking and panicing, assuming that consumers are "handling" an issue like getting a different-but-compatible surface type back from OpenGL, and we should fix those. Panic shouldn't be a part of the general course of action.
  2. Panic that is recoverable, such as a loading a web page that exhausts resources or causes the system to go off the rails in a way that we know is isolated to a single tab/window (previously Constellation, though idk if that's changed in years gone by) and should be able to be tossed out the window and restarted.
  3. abort abort the whole ship is going down!

I'm not sure how far we will get with a custom panic handler with log output and resume, but I'd definitely be interested to find out, too!

@gterzian
Copy link
Member

gterzian commented Sep 10, 2019

One other wrinkle that we've had around panic is the delayed-panic due to channel-based communication. That is, if the receive end of a channel is in a thread that panics but we do not tear down, the thread with a send end will panic at the moment that they attempt to send a message.

An IpcSender returns a Result, and I think you get the Error if the receiver is disconnected, although it's not documented: https://doc.servo.org/ipc_channel/ipc/struct.IpcSender.html

In the constellation we take care never to unwrap or expect on those sends, I think, yet I don't think do we do anything to restart a script-process if it has apparently crashed.

It might be easier to harden Servo in a coarse grain manner, at the level of processes, versus trying to catch panics and do something about it inside processes.

I personally like hard expectations where they're possible, like script, because if the tests on CI passes, I don't go in to read the logs to see if there perhaps was a warning. I do look when there's a panic.

A central component like the constellation is probably best placed to restart other component if they fail, and for example script should probably not include logic "restarting WebGL" or something similar. It could on the other end send a message to the constellation if a send to WegGL fails for example, and then panic itself(or run the "shutdown script-thread logic").

isolated to a single tab/window (previously Constellation, though idk if that's changed in years gone by)

The constellation is now basically the main process of the browser, managing all tabs that in multiprocess mode run in their own content process.

See

browsing_context_group_set: HashMap<BrowsingContextGroupId, BrowsingContextGroup>,

@larsbergstrom
Copy link
Contributor

It would be good to write up how we handle the tab-crashed scenario w/o tearing down the whole browser. Based on every other system where I've had teams handle this scenario (e.g., addons in visual studio; user code execution in an interpreter; etc.), I'm a bit nervous if we don't have an architectural boundary for this and are relying on developer heroics to handle an edge case on every channel that may or may not correspond to a crashed tab boundary. Not just for continuing to run, but for ensuring we don't have zombie thread cycles off continuing to run some poor abandoned portion of the code.

Apologies if this is already nailed down and I'm asking for things that have already been done; I'll admit that I (like the wiki!) have not been keeping fully up to date with all changes, though I do try to at least skim all PRs and issues that go through.

@asajeffrey
Copy link
Member Author

@nox: the problem is that (at least at the moment) if a script thread panics, it tears down the thread, and there's nothing we can do to have SM recover from that. We either need to not panic in the first place, or have a panic handler that tries to keep the script thread alive.

@asajeffrey
Copy link
Member Author

@larsbergstrom what I'd dearly love is to have some static analysis that ensures panic-freedom, but we're probably quite a way out from that, at least partly due to third-party code.

The constellation and compositor are meant to be hardened, so in the worst case we should be able to show a crash report page. Not that this will make end-users super-happy.

@gterzian
Copy link
Member

gterzian commented Sep 11, 2019

It would be good to write up how we handle the tab-crashed scenario w/o tearing down the whole browser.

So currently we don't actually do it, and out of the top of my head, I can imagine the following workflow:

  1. We need a way to catch all panics happening in a content/script process.

We have something like that at the embedder level, but it's not going to work in multiprocess mode, since it's only setup at the "main process". See

fn install_crash_handler() {

  1. When a panic occur in script, I think we want to try the "clean shutdown" of a script-thread(read: process), and if it fails, do a process::abort().

See

fn handle_exit_script_thread_msg(&self) {

  1. Then, we need a way for the constellation to reliably become aware of 2 happening in any content-process. This could be done:

    • If the process was able to do a "clean shutdown" following a panic, simply by sending a message to the constellation.
    • Inside the constellation, by handling an error on the various event_loop.send, but that means there might be a delay between an content-process crashing, and the constellation become aware of it by trying to send it a message.
    • By using perhaps the background-hang-monitor, which has an instance running in each conent-process, to send a regular ping to the constellation. Although not hearing from a monitor might also imply it's busy resolving a backtrace for a hanging thread(this might be a reason to move the resolving to a worker thread so that the "main thread" of the monitor is always responsive unless the process crashed), so it would require some additional checks.
  2. The constellation does keep a Pipeline for each document running in a content-process, and that has an ServoUrl. So when the constellation has determined that a tab crashed, it could:

    • show an error page if the tab is the one in front,
    • ask the user whether to reload the tab via a message to the embedder, and
    • if yes reload all the pipelines in a newly created content-process.

Actually, I think it would be more correct, and much easier, to only reload the top-level pipeline, since that will then load any child pipelines "normally", and instead of trying to somehow re-create the exact state of the page when it crashed, we're just reloading the page(it might actually have changed in the meantime and not contain the same iframes, especially if those frames were ads).

If it's an iframe for a different origin that crashes(in a different process from the page in which it is embedded), do we reload it?

  1. There's very likely a million details not covered here(if we reload an auxiliary tab, do we want to reload it as such or treat is as a completely new one? New one probably, since the reload happens in response to user input, the "do you want to reload the tab" flow, not script opening a tab like the original auxiliary load). And this covers only the "tab crash" scenario, not crashes of other components like the image-cache, networking, and so on...

  2. What "reloading a top-level browsing-context" means might even be something to address at the spec level. I'm not aware of any such parts of the spec, although there might be some already. Things like "what to do if it's an auxiliary" . What to do if it's not a top-level BC but instead a service or shared worker that crashes? Those two would be running in their own "agent cluster", read: process, hence their crashes would not bring down an actual tab, yet it might prevent the tab from doing what it wants to do. An SW could crash while no tabs are open, for example in response to a server notification. Do you then queue some sort of crash notification to be delivered next time the origin is loaded in a tab, so that it can reload the worker? Do you just reload it immediately? I've opened Agent cluster and crash recovery: specifications? whatwg/html#4901

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-panic Servo encounters a panic.
Projects
None yet
Development

No branches or pull requests

4 participants