Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Interruption After Fatal Error In FiberContext #4812

Merged
merged 4 commits into from Mar 17, 2021

Conversation

adamgfraser
Copy link
Contributor

Resolves #4810.

The cause of the issue is that when we attempt to interrupt the main fiber in the shutdown hook added by App we suspend indefinitely.

My analysis is that when we call interrupt it invokes kill0 in FiberContext, which if the fiber is still in an Executing state adds an interrupted cause and then calls await, which if the fiber is in an Executing state registers itself as an observer of the fiber and waits for the fiber to be done. Normally this works fine because adding the interrupted cause leads to the fiber entering a Done state and notifying the observers, allowing the interruption to successfully return.

However, when a fatal exception is thrown we stop interpreting further continuations in the main run loop for that fiber, so we never get to a Done state and never notify the observers of that fiber, hence the interruption hanging forever.

To address that, this PR implements a new fatal method that sets the state to Done and notifies any pending observers at that point.

@jdegoes
Copy link
Member

jdegoes commented Mar 17, 2021

The only issue here is that during fatal exception, we cannot guarantee success of the notifyObservers method. In fact, it may allocate more memory by trying to execute more effects, which can compound the underlying problem.

Another possible solution is to capture the fact that we are dealing with a fatal error somewhere, and in that event, do not try to interrupt the fiber in the shutdown hook.

@adamgfraser
Copy link
Contributor Author

Maybe a variant of that is we just set a flag in FiberContext that we are in a fatal state and then on interruption we check that flag and if it is set we just return immediately and don't do anything since there is no way to execute a graceful shutdown at this?

@jdegoes
Copy link
Member

jdegoes commented Mar 17, 2021

Well, it's a descendant fiber that had the fatal error. The top-level fiber doesn't have that problem.

The other thing we could decide to do is just "try" to do cleanup and the reportFatal later, after the cleanup is done. I think users would appreciate best effort but it is difficult to avoid swallowing the fatal error if you do anything too fancy.

@jdegoes
Copy link
Member

jdegoes commented Mar 17, 2021

Ah, actually, your suggestion might work: although it assumes the top-level fiber successfully interrupts everything, all the way down to the fiber that had the problem.

Basically, it's the whole "interrupt root and have your app shut down gracefully" that I question will actually work reliably in a real-life fatal error: because running all those effects, waiting for interruption, etc., involves a ton of allocation and thread pool work, all of which are not guaranteed to happen in the presence of a fatal error.

@adamgfraser
Copy link
Contributor Author

Yes I totally agree. I feel like there is a risk that we optimize for these examples where a fatal exception is thrown but we are not actually in a fatal state and if we are in a fatal state nothing we do may matter much.

@jdegoes
Copy link
Member

jdegoes commented Mar 17, 2021

All right, well, what if FiberContext.fatal were an AtomicBoolean, which was set when a Fiber encountered a fatal error. Then the interrupt hook could check that (zero allocations) before deciding to call fiber.interrupt.

If it were set, it could also print out some kind of message using println, like:

**** WARNING ***
Catastrophic JVM error encountered. Application not safely interrupted. Resources may be leaked. Check the logs for more details and consider overriding `Platform.reportFatal` to capture context.

@adamgfraser
Copy link
Contributor Author

Yes, I think that makes sense. I will revise accordingly.

@jdegoes jdegoes merged commit 82c8040 into zio:master Mar 17, 2021
@adamgfraser adamgfraser deleted the 4810 branch March 17, 2021 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zio.App does not exit if fatal error occurs
2 participants