Allow script aborting? #401

Open
domenic opened this Issue Feb 22, 2016 · 17 comments

Comments

Projects
None yet
6 participants
@domenic
Member

domenic commented Feb 22, 2016

Right now multiple environments provide the abiltiy to abort a script mid-run, which the spec does not allow. For example, Node.js's vm module allows timeouts on script execution, and its process.abort() and process.exit() functions can interrupt script execution from inside script.

HTML has a whole section on this, as aborting scripts happens fairly often: the infamous "slow script dialog", but also e.g. page navigation, closing a window, or disabling JavaScript in the browser (these days through extensions) mid-run. And, perhaps most explicitly, the worker.terminate() method.

In whatwg/html@6a48bfb we made this a little more formal by noting exactly what this impacts and where it is called: go to https://html.spec.whatwg.org/#abort-a-running-script, click the words "abort a running script", and you can see all callers and referencers. We added the statement:

Although the JavaScript specification does not account for this possibility, it's sometimes necessary to abort a running script. This causes any ScriptEvaluation or ModuleEvaluation to cease immediately, emptying the JavaScript execution context stack without triggering any of the normal mechanisms like finally blocks. [ECMA262]

It's probably a good idea for ES to explicitly mention this possibility in its spec, and maybe make it more formal. What do you think?

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Feb 22, 2016

Member

I agree that this should be made more formal. AFAICT this is also morally equivalent to specing a semantics for non-user-initiated aborts like OOM. I would like to see a proposal for this, although I suspect there may be a lot of subtle issues here. For example, from a security perspective, afaict, the right thing to do is to nuke the entire vat when any of this occurs because it's impossible to enforce that a programs' invariants are intact (eg. imagine termination in the middle of allocating a doubly-linked list).

Member

bterlson commented Feb 22, 2016

I agree that this should be made more formal. AFAICT this is also morally equivalent to specing a semantics for non-user-initiated aborts like OOM. I would like to see a proposal for this, although I suspect there may be a lot of subtle issues here. For example, from a security perspective, afaict, the right thing to do is to nuke the entire vat when any of this occurs because it's impossible to enforce that a programs' invariants are intact (eg. imagine termination in the middle of allocating a doubly-linked list).

@domenic

This comment has been minimized.

Show comment
Hide comment
@domenic

domenic Feb 22, 2016

Member

For example, from a security perspective, afaict, the right thing to do is to nuke the entire vat when any of this occurs because it's impossible to enforce that a programs' invariants are intact (eg. imagine termination in the middle of allocating a doubly-linked list).

While this is probably true, I am not sure that is what browsers do, and it's definitely not what Node's vm module does. @bzbarsky indicated in whatwg/html#715 that the semantics are more equivalent to an uncatchable (and unfinallyable) exception. At least, the slow script dialog is. I suppose worker termination/closing a window/cross-origin page navigation nuke the vat. (Not sure about same-origin navigation or disabling JS.)

Member

domenic commented Feb 22, 2016

For example, from a security perspective, afaict, the right thing to do is to nuke the entire vat when any of this occurs because it's impossible to enforce that a programs' invariants are intact (eg. imagine termination in the middle of allocating a doubly-linked list).

While this is probably true, I am not sure that is what browsers do, and it's definitely not what Node's vm module does. @bzbarsky indicated in whatwg/html#715 that the semantics are more equivalent to an uncatchable (and unfinallyable) exception. At least, the slow script dialog is. I suppose worker termination/closing a window/cross-origin page navigation nuke the vat. (Not sure about same-origin navigation or disabling JS.)

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Feb 22, 2016

Member

I am also pretty sure this is not what browsers do (uncatchable exception seems right, except for OOM which I think some of us let you catch?). But, I suspect it would be difficult to advance a proposal through TC39 that specifies an insecure semantics. On the other hand, it might be possible to convince browsers and node that they should do "the right thing"? Anyway, I don't have a strong opinion on this yet, just putting the data out there. I really do want to see these semantics specified (see also: https://twitter.com/bterlson/status/695339622893690882).

Member

bterlson commented Feb 22, 2016

I am also pretty sure this is not what browsers do (uncatchable exception seems right, except for OOM which I think some of us let you catch?). But, I suspect it would be difficult to advance a proposal through TC39 that specifies an insecure semantics. On the other hand, it might be possible to convince browsers and node that they should do "the right thing"? Anyway, I don't have a strong opinion on this yet, just putting the data out there. I really do want to see these semantics specified (see also: https://twitter.com/bterlson/status/695339622893690882).

@bterlson

This comment has been minimized.

Show comment
Hide comment
@bterlson

bterlson Feb 22, 2016

Member

Perhaps @erights can comment on this issue.

Member

bterlson commented Feb 22, 2016

Perhaps @erights can comment on this issue.

@bzbarsky

This comment has been minimized.

Show comment
Hide comment
@bzbarsky

bzbarsky Feb 22, 2016

I can't speak intelligently to this without knowing what definition of "vat" is being used here.

The only cases in Gecko that involve uncatchable exceptions are the slow script handling and worker termination, I believe. Everything else is catchable: that includes OOM, out of stack, and probably other weird things.

Oh, and I guess there are some OOM cases, certainly outside the JS implementation, but possibly also inside it, that abort the process on OOM. I doubt whatever qualifies as "a process" in browsers corresponds to whatever "vat" means in ES.

I can't speak intelligently to this without knowing what definition of "vat" is being used here.

The only cases in Gecko that involve uncatchable exceptions are the slow script handling and worker termination, I believe. Everything else is catchable: that includes OOM, out of stack, and probably other weird things.

Oh, and I guess there are some OOM cases, certainly outside the JS implementation, but possibly also inside it, that abort the process on OOM. I doubt whatever qualifies as "a process" in browsers corresponds to whatever "vat" means in ES.

@domenic

This comment has been minimized.

Show comment
Hide comment
@domenic

domenic Feb 22, 2016

Member

Vat ~ HTML event loop

Member

domenic commented Feb 22, 2016

Vat ~ HTML event loop

@bzbarsky

This comment has been minimized.

Show comment
Hide comment
@bzbarsky

bzbarsky Feb 22, 2016

In that case I doubt browsers would be willing to kill the whole vat.

In that case I doubt browsers would be willing to kill the whole vat.

@lars-t-hansen

This comment has been minimized.

Show comment
Hide comment
@lars-t-hansen

lars-t-hansen Mar 22, 2016

Contributor

FWIW, and related to this but not the same, the agents spec proposes a rule here that I'd love some feedback on. It pertains to agent clusters, which are collections of agents that can share memory (prototypically, a web page plus its dedicated workers), and the rule states that if the embedding forcibly terminates one of the agents in the cluster it must either (a) forcibly terminate all of them or (b) provide a reliable signaling mechanism so that the other agents can detect the termination. The motivation for this is avoiding deadlocks if one agent is killed.

I don't actually expect that rule to survive as-is -- termination is not the same as throwing an uncatchable but the latter could create a deadlock situation just as easily as the former -- but some of the discussions I found indicate that the lack of termination discovery mechanisms is a fairly significant pain on the web. Also see my third point on this HTML bug.

@domenic, I thought a Vat was more like a collection of HTML event loops that could communicate by messages?

Contributor

lars-t-hansen commented Mar 22, 2016

FWIW, and related to this but not the same, the agents spec proposes a rule here that I'd love some feedback on. It pertains to agent clusters, which are collections of agents that can share memory (prototypically, a web page plus its dedicated workers), and the rule states that if the embedding forcibly terminates one of the agents in the cluster it must either (a) forcibly terminate all of them or (b) provide a reliable signaling mechanism so that the other agents can detect the termination. The motivation for this is avoiding deadlocks if one agent is killed.

I don't actually expect that rule to survive as-is -- termination is not the same as throwing an uncatchable but the latter could create a deadlock situation just as easily as the former -- but some of the discussions I found indicate that the lack of termination discovery mechanisms is a fairly significant pain on the web. Also see my third point on this HTML bug.

@domenic, I thought a Vat was more like a collection of HTML event loops that could communicate by messages?

@domenic

This comment has been minimized.

Show comment
Hide comment
@domenic

domenic Mar 22, 2016

Member

I'm not sure what the latest is on Mark's thoughts on how the concept of vats from E translate to the web. But in this concept we are referring to an event loop. I would prefer we just use that term going forward to be precise (cough @bterlson cough).

Member

domenic commented Mar 22, 2016

I'm not sure what the latest is on Mark's thoughts on how the concept of vats from E translate to the web. But in this concept we are referring to an event loop. I would prefer we just use that term going forward to be precise (cough @bterlson cough).

@bzbarsky

This comment has been minimized.

Show comment
Hide comment
@bzbarsky

bzbarsky Mar 22, 2016

Given that there is an explicit API to forcibly terminate a worker, seems to me like "forcibly terminate all of them" is not an option.

Given that there is an explicit API to forcibly terminate a worker, seems to me like "forcibly terminate all of them" is not an option.

@erights

This comment has been minimized.

Show comment
Hide comment
@erights

erights Mar 22, 2016

First on terminology, a "vat" is precisely enough defined. Prior to SAB, a browser's various event loops (workers) were vats. But with the introduction of SAB, "vat" no longer applies to the browser. An agent is an event loop, but it is not a vat because it is synchronously coupled to other agents. An agent cluster is only asynchronously coupled to other agent clusters, but it is not a vat because it has internal concurrency.

So no "vat"s in the browser. What is the difference between "agent" and "event loop"? When should we say one vs the other?

(I will comment on the substantive issue soon)

erights commented Mar 22, 2016

First on terminology, a "vat" is precisely enough defined. Prior to SAB, a browser's various event loops (workers) were vats. But with the introduction of SAB, "vat" no longer applies to the browser. An agent is an event loop, but it is not a vat because it is synchronously coupled to other agents. An agent cluster is only asynchronously coupled to other agent clusters, but it is not a vat because it has internal concurrency.

So no "vat"s in the browser. What is the difference between "agent" and "event loop"? When should we say one vs the other?

(I will comment on the substantive issue soon)

@lars-t-hansen

This comment has been minimized.

Show comment
Hide comment
@lars-t-hansen

lars-t-hansen Mar 22, 2016

Contributor

@bzbarsky

Given that there is an explicit API to forcibly terminate a worker, seems to me like "forcibly terminate all of them" is not an option.

That's why I wrote "... if the embedding forcibly terminates one of the agents..." attempting to distinguish this from anything the program might do to itself. That line is superfine but at least the idea is that if programmatic action does something to the program, then fine, we assume the program knows what's going on (though I realize this knowledge can be hard to disseminate), but if the embedding does something without the program's distributed knowledge then it would be better if the program could know it. (This probably applies to the slow-script timeout too; a post-hoc notification, eg an event, is better than silence. Probably.)

Contributor

lars-t-hansen commented Mar 22, 2016

@bzbarsky

Given that there is an explicit API to forcibly terminate a worker, seems to me like "forcibly terminate all of them" is not an option.

That's why I wrote "... if the embedding forcibly terminates one of the agents..." attempting to distinguish this from anything the program might do to itself. That line is superfine but at least the idea is that if programmatic action does something to the program, then fine, we assume the program knows what's going on (though I realize this knowledge can be hard to disseminate), but if the embedding does something without the program's distributed knowledge then it would be better if the program could know it. (This probably applies to the slow-script timeout too; a post-hoc notification, eg an event, is better than silence. Probably.)

@erights

This comment has been minimized.

Show comment
Hide comment
@erights

erights Mar 22, 2016

On unpredictable errors, all synchronously coupled state should normally be assumed corrupt. Erlang's "fail only" strategy works so beautifully because processes are fine grain and are only asynchronously coupled to other processes. Once an unpredictable error happens, the process it happens in is in a confused state, and so is the last one that should be asked to engage in any recovery action. It is the least likely to be able to do so correctly. This applies to OOM and any other unpredictable resource exhaustion, including any externally imposed timeout that can stop a process between any two instructions.

I remain hopeful that we can exceed this common wisdom eventually with agents vs agent clusters. If we don't do anything clever, then we must terminate an agent cluster all at once. But since agents within a cluster are synchronously coupled only via SAB, I remain hopeful that we can find a way to kill one agent and let other agents in the cluster, somehow, continue.

Earlier it was stated that sudden detachment of an SAB on termination of one of its participants would be racy and hazardous. Since there is a known hazard, I agree we should steer clear of it. But I still don't understand that hazard. Does someone have a pointer to something I can read on this?

erights commented Mar 22, 2016

On unpredictable errors, all synchronously coupled state should normally be assumed corrupt. Erlang's "fail only" strategy works so beautifully because processes are fine grain and are only asynchronously coupled to other processes. Once an unpredictable error happens, the process it happens in is in a confused state, and so is the last one that should be asked to engage in any recovery action. It is the least likely to be able to do so correctly. This applies to OOM and any other unpredictable resource exhaustion, including any externally imposed timeout that can stop a process between any two instructions.

I remain hopeful that we can exceed this common wisdom eventually with agents vs agent clusters. If we don't do anything clever, then we must terminate an agent cluster all at once. But since agents within a cluster are synchronously coupled only via SAB, I remain hopeful that we can find a way to kill one agent and let other agents in the cluster, somehow, continue.

Earlier it was stated that sudden detachment of an SAB on termination of one of its participants would be racy and hazardous. Since there is a known hazard, I agree we should steer clear of it. But I still don't understand that hazard. Does someone have a pointer to something I can read on this?

@lars-t-hansen

This comment has been minimized.

Show comment
Hide comment
@lars-t-hansen

lars-t-hansen Mar 22, 2016

Contributor

@erights, just thinking out loud here:

The scenario is that an agent fails (is stopped for termination) and the embedding wishes to detach the memory of all SABs that that agent shares with any other agent in its cluster "quickly" to minimize the chance of the other agents "going wrong".

The other agents may be in various states: they may be at their event loop, they may be using the SAB actively from the interpreter and from jitted code, and they may be in the process of sending or receiving the SAB on a channel. Some of the agents may be blocked in futexWait (on the shared memory to be detached, or other shared memory), they may be computing in the DOM or elsewhere in the browser or its libraries (also on the shared memory to be detached), they may be blocked in the OS's scheduler, etc. In addition some of the SABs may be in flight between agents, indeed structured clone may be in the process of cloning the SAB. Agents that are currently computing may take a "long" time before they return (from JS to the event loop / from the DOM to JS / from GL to DOM / ...), in some cases they may not return at all (eg a worker running its own event loop on top of shared memory communication).

To detach the SAB safely means to perform some operation on the SAB in each of the agents so that the agent can recognize that the SAB has been detached, without risking any agent reading deallocated data, using a stale pointer, or similar problem. The timeliness requirement ("quickly") means that we can't wait until the agent is at its event loop.

If the embedding wishes to manipulate the SABs and their underlying data structures while the agents that own those SABs are running then the data structures must be thread-safe, for one thing. I think it should be apparent that there are many hazards here, given the scenarios outlined above. To avoid those we'll require some kind of critical section around DOM calls, for example. (May be affordable, probably isn't, might be optimized.) For performance reasons we won't have those critical sections in JS code, so in JS code the embedding must do something else (eg clear a pointer atomically might be ideal), but it's unclear what that action would be and how the embedding would choose what to do given that the remote agent is actively running. Generally it would seem to involve more synchronization. There's also a question of exactly when agreement has been reached among all the agents and the buffer memory can be freed; this will be sometime after all the agents have detached.

The cheapest, safest way to avoid that mess is probably for the embedding to set a flag on the memory of each SAB that is to be detached (probably on the buffer's header), and to check the flag every time the SAB's buffer is extracted for use, and if the system is in a safe state (in JS code and not nested within anything except the event loop?) I expect the thread could perform the local detachment if the flag is set. In practice that check may be too expensive so the flag could be checked less often (call + loop back-edge, or on a timer interrupt; an interrupt could even be triggered for this purpose). I think this is not hazardous, per se. (It is still more expensive than the detachment check for ArrayBuffer, probably, which IIUC depends on invalidating JIT code from the detach operation, but that works because it is all intra-thread.)

But I question the premise - what are we achieving here? The shared memory disappears to be sure but all that does is confuse the program even more, its invariants about available memory and so on are all gone. OOB reads on the shared memory return undefined and OOB writes silently fail. An agent is able to futexWait on shared memory that will become detached in another agent before the latter agent can wake the former (whatever we do, detachment cannot be instantaneous in all agents and these races will be typical).

cc @lukewagner

Contributor

lars-t-hansen commented Mar 22, 2016

@erights, just thinking out loud here:

The scenario is that an agent fails (is stopped for termination) and the embedding wishes to detach the memory of all SABs that that agent shares with any other agent in its cluster "quickly" to minimize the chance of the other agents "going wrong".

The other agents may be in various states: they may be at their event loop, they may be using the SAB actively from the interpreter and from jitted code, and they may be in the process of sending or receiving the SAB on a channel. Some of the agents may be blocked in futexWait (on the shared memory to be detached, or other shared memory), they may be computing in the DOM or elsewhere in the browser or its libraries (also on the shared memory to be detached), they may be blocked in the OS's scheduler, etc. In addition some of the SABs may be in flight between agents, indeed structured clone may be in the process of cloning the SAB. Agents that are currently computing may take a "long" time before they return (from JS to the event loop / from the DOM to JS / from GL to DOM / ...), in some cases they may not return at all (eg a worker running its own event loop on top of shared memory communication).

To detach the SAB safely means to perform some operation on the SAB in each of the agents so that the agent can recognize that the SAB has been detached, without risking any agent reading deallocated data, using a stale pointer, or similar problem. The timeliness requirement ("quickly") means that we can't wait until the agent is at its event loop.

If the embedding wishes to manipulate the SABs and their underlying data structures while the agents that own those SABs are running then the data structures must be thread-safe, for one thing. I think it should be apparent that there are many hazards here, given the scenarios outlined above. To avoid those we'll require some kind of critical section around DOM calls, for example. (May be affordable, probably isn't, might be optimized.) For performance reasons we won't have those critical sections in JS code, so in JS code the embedding must do something else (eg clear a pointer atomically might be ideal), but it's unclear what that action would be and how the embedding would choose what to do given that the remote agent is actively running. Generally it would seem to involve more synchronization. There's also a question of exactly when agreement has been reached among all the agents and the buffer memory can be freed; this will be sometime after all the agents have detached.

The cheapest, safest way to avoid that mess is probably for the embedding to set a flag on the memory of each SAB that is to be detached (probably on the buffer's header), and to check the flag every time the SAB's buffer is extracted for use, and if the system is in a safe state (in JS code and not nested within anything except the event loop?) I expect the thread could perform the local detachment if the flag is set. In practice that check may be too expensive so the flag could be checked less often (call + loop back-edge, or on a timer interrupt; an interrupt could even be triggered for this purpose). I think this is not hazardous, per se. (It is still more expensive than the detachment check for ArrayBuffer, probably, which IIUC depends on invalidating JIT code from the detach operation, but that works because it is all intra-thread.)

But I question the premise - what are we achieving here? The shared memory disappears to be sure but all that does is confuse the program even more, its invariants about available memory and so on are all gone. OOB reads on the shared memory return undefined and OOB writes silently fail. An agent is able to futexWait on shared memory that will become detached in another agent before the latter agent can wake the former (whatever we do, detachment cannot be instantaneous in all agents and these races will be typical).

cc @lukewagner

@lukewagner

This comment has been minimized.

Show comment
Hide comment
@lukewagner

lukewagner Mar 23, 2016

I happened to be experimenting with this recently and noticed that, in FF, if an inner same-origin iframe is slow-script-stopped, the outer window continues to run, but, in Chrome, the outer window also stops. While the latter seems abstractly safer (being same-origin, the outer window could easily be in a corrupt state as well), FF's behavior was actually more useful for the use case that prompted the experiment: to have have the outer window observe the "crash" of the inner window (say, via heartbeat) and notify the server.

Based on this, the abort-script issue feels symmetric to our discussion of the SAB-termination issue in shmem/#55: even though a graph of {same-origin windows, agents} might be in an incoherent state due to one of its nodes being externally stopped/killed, the program might be able to recover (even if "recover" only means "report an error and gracefully quit").

One compromise might be to default to "safe" (killing one node of the graph (for some definition of "the graph") kills every node in the graph) and require opt-in to notification instead of killing. To be clear, I'm talking about killing whole agents, which avoids the hazards Lars described above, not detachment of individual SABs. (Sorry, don't mean to scope-creep the original issue here.)

I happened to be experimenting with this recently and noticed that, in FF, if an inner same-origin iframe is slow-script-stopped, the outer window continues to run, but, in Chrome, the outer window also stops. While the latter seems abstractly safer (being same-origin, the outer window could easily be in a corrupt state as well), FF's behavior was actually more useful for the use case that prompted the experiment: to have have the outer window observe the "crash" of the inner window (say, via heartbeat) and notify the server.

Based on this, the abort-script issue feels symmetric to our discussion of the SAB-termination issue in shmem/#55: even though a graph of {same-origin windows, agents} might be in an incoherent state due to one of its nodes being externally stopped/killed, the program might be able to recover (even if "recover" only means "report an error and gracefully quit").

One compromise might be to default to "safe" (killing one node of the graph (for some definition of "the graph") kills every node in the graph) and require opt-in to notification instead of killing. To be clear, I'm talking about killing whole agents, which avoids the hazards Lars described above, not detachment of individual SABs. (Sorry, don't mean to scope-creep the original issue here.)

@bzbarsky

This comment has been minimized.

Show comment
Hide comment
@bzbarsky

bzbarsky Mar 24, 2016

the outer window continues to run

More precisely, slow-script-stopping a window doesn't prevent async execution of things in either that window or other windows in Firefox, last I checked.

the outer window continues to run

More precisely, slow-script-stopping a window doesn't prevent async execution of things in either that window or other windows in Firefox, last I checked.

@lukewagner

This comment has been minimized.

Show comment
Hide comment
@lukewagner

lukewagner Mar 26, 2016

@bzbarsky Ah, I was testing with just a setTimeout loop.

@bzbarsky Ah, I was testing with just a setTimeout loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment