-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prax freezes after one of multiple applications is killed by the monitor #82
Comments
Thanks for pointing this out @jacksonrayhamilton. It's good to know I'm not the only one experiencing these lockups. I spent quite some time trying to figure out the cause of this, together with my employer, so I guess I will share our all our findings here. It could be useful. we get similar behaviour even without the app-killerFirst, as you mentioned, the I have this small script that will make Prax get into deadlock 100% of the time: #!/usr/bin/env ruby
RESPONSIVE_HOST = 'index.test'
FAILING_HOST = 'fail.test' # the app on this domain will crash while starting up (missing gems)
def get_webpage(host)
puts "GET webpage from #{host}..."
`curl -m 10 http://#{host}`
puts 'done'
end
# ---------
get_webpage RESPONSIVE_HOST
# app gets started by Prax and responds with status 200
get_webpage RESPONSIVE_HOST
# app keeps responding to requests
get_webpage FAILING_HOST
# the failing app will crash and that will make Prax unresponsive, ...
# as we will see with the next statement
sleep 5
get_webpage RESPONSIVE_HOST
# even the responsive app will not be able to respond anymore Note that you need to have at least one (non-failing) app running for the problem to occurr. This will not result in a lock: # [...]
# ---------
get_webpage FAILING_HOST
# the failing app will crash but that will NOT make Prax unresponsive, ...
get_webpage FAILING_HOST
# the failing app will crash but that will NOT make Prax unresponsive, ...
# repeat this as often as you like
get_webpage RESPONSIVE_HOST
# everything should still be fine This, to me, seems to be the same situation that you described where the app-killer kills all applications at once. It seems that not only killing an application process will cause Prax to become unresponsive, it also happens when an application process process simply stops. (Maybe only with an unucessful exit status? 🤔 ) Deadlock?We overloaded the whole Prax source with Now which fiber could that be? Could it be a fiber that's managing the connection to the RESPONSIVE_HOST? It sounds plausible because without it we cannot seem to achieve deadlock. On the other hand, how can it be that it isn't locking without a crashing app? It seems logical that it must give other fibers a chance to run otherwise you could never open a second connection or run the app-killer fiber. Could it be that I think it should never happen that a fiber will freeze up forever after a call to (A silly idea for debugging: can I maybe somehow label the fibers to find the culprit? Do I have to recompile the whole compiler for that?) Prax did not have this problem a year agoI have this PC at home on which I am using an older version of Prax that I cannot reproduce this issue on. I compiled it on 2017-06-29 using my Arch Linux PKGBUILD. I would have to do some research to find out exactly which versions of Prax and Crystal were used but they were the most recent ones at the time. The problem seems to be independent from the operating systemI suspected that Arch Linux changed some system libraries between Summer 2017 and September 2018 that could have caused the error. I compiled Prax on a macOS (a BSD with completely different libs, right?) only to find that indeed I can easily reproduce the problem there too. It seems we can rule out the OS. |
Wow, this is some thorough investigation! Thanks a lot, especially for the reproducible scenarios. I'll try to understand the issue when I get some free time. Indeed, @tijn you can name crystal fibers, using Note that Crystal is single-threaded: the event-loop runs on a single thread —the app has threads, but they're all created by the garbage collector— the event-loop can be locked, blocking all fibers from running, if one fiber uses a blocking C syscall or ends up in a busy loop (such as |
Maybe this is related: https://github.com/ysbaddaden/prax.cr/blob/master/src/prax.cr#L50-L51 We reworked how SIGCHLD is handled in Crystal some months ago. |
Well, that was spot-on! Removing the call to I rewrote the signal handler to use Signal::CHLD.trap do
loop do
code = LibC.waitpid(-1, out exit_code, LibC::WNOHANG)
STDERR.puts "SIGCHLD #{code} #{exit_code} #{Errno.value}"
if code == -1 && Errno.value == Errno::EINTR
# FIXME: is this right?
sleep 0.1 # sleep and continue the loop until there is a proper return value
else
break
end
end
end I first had a version that didn't check |
I didn't notice the missing WNOHANG in the waitpid call, making it a blocking call. The loop is correct, but I should check the manpage, I'm fairly sure we can just discard the return value, or just loop |
Pushed comment button inadvertently... I think we can loop until it returns -1 and errno is EAGAIN, which means it would have blocked (no child process to reap). But I haven't verified the manpage, yet. |
Fixed in 966c903. Crystal's runtime now reaps child processes before delegating to any custom SIGCHLD handler. Trying to reap again led to blocking the process. Even using WNOHANG would be bad, since Crystal's runtime expects to reap child processes. Thanks for helping the debug guys! I should have known, I rewrote how SIGCHLD is handled in Crystal's runtime, which led to break Prax. Sigh. |
Fixed the "reaper skips an app" in 6973ce4. |
Thanks, the issue was fixed for me. |
First off, it seems like the monitor needs to duplicate
Prax.applications
when iterating over it, otherwise deleting an element early in the array can cause later elements to not be iterated over.You can see that by running the following code in
crystal play
:b
will be[1, 2]
. If you changea.each
toa.dup.each
, thenb
will be[1, 2, 3]
.With that out of the way, onto my main issue. I’ve noticed that if I have multiple applications running and one of my applications becomes inactive and the monitor kills it, after the monitor goes to sleep again, Prax will become completely unresponsive. I can’t make any new requests (all new requests just hang), and the monitor does not appear to run again after
DELAY
seconds.An additional bit of information: When I quit Prax by typing
^C
, I see the following printed to my terminal as Prax attempts to kill the other application (that one that it deemed was suitable to keep alive due to it being “active” earlier):Another bit of info: in another scenario, if all of my applications are deemed inactive and killed at the same time, then Prax does not freeze up.
My first guess was that, with the monitor running in its own thread, running
delete
on a class variable would cause another thread accessing the sharedPrax.applications
value to explode. However according to the Crystal docs this seems to be okay to do for now: https://crystal-lang.org/docs/guides/concurrency.html#communicating-dataMy next guess would be that the act of killing an application process causes one fiber to hang which prevents any other fibers from executing after the monitor goes to sleep. However this seems to be contradicted by the case where when the monitor kills all of the applications simultaneously things still work.
Your expertise would be much appreciated here.
The text was updated successfully, but these errors were encountered: