Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prax freezes after one of multiple applications is killed by the monitor #82

Closed
jacksonrayhamilton opened this issue Nov 26, 2018 · 9 comments

Comments

@jacksonrayhamilton
Copy link
Contributor

First off, it seems like the monitor needs to duplicate Prax.applications when iterating over it, otherwise deleting an element early in the array can cause later elements to not be iterated over.

diff --git a/src/prax/monitor.cr b/src/prax/monitor.cr
index 828b5ba..ea4863f 100644
--- a/src/prax/monitor.cr
+++ b/src/prax/monitor.cr
@@ -6,7 +6,7 @@ module Prax
     def self.clear_stalled_applications
       Prax.logger.debug "clearing stalled applications"
 
-      Prax.applications.each do |app|
+      Prax.applications.dup.each do |app|
         if app.last_accessed_at + TTL < Time.now
           Prax.applications.delete(app)
           app.stop

You can see that by running the following code in crystal play:

a = [1, 2, 3]
b = [] of Int32
a.each do |e|
  b << e
  if e == 2
    a.delete(2)
  end
end
a
b

b will be [1, 2]. If you change a.each to a.dup.each, then b will be [1, 2, 3].

With that out of the way, onto my main issue. I’ve noticed that if I have multiple applications running and one of my applications becomes inactive and the monitor kills it, after the monitor goes to sleep again, Prax will become completely unresponsive. I can’t make any new requests (all new requests just hang), and the monitor does not appear to run again after DELAY seconds.

An additional bit of information: When I quit Prax by typing ^C, I see the following printed to my terminal as Prax attempts to kill the other application (that one that it deemed was suitable to keep alive due to it being “active” earlier):

^CI, [2018-11-26 08:55:02 -08:00 #18881]  INFO -- prax: killing application: app
kill: No such process (Errno)
  from /usr/share/crystal/src/process.cr:42:7 in 'kill'
  from /usr/share/crystal/src/process.cr:340:5 in 'kill'
  from /usr/share/crystal/src/process.cr:339:3 in 'kill'
  from src/prax/application/spawner.cr:104:9 in 'kill'
  from src/prax/application/spawner.cr:74:7 in 'stop'
  from src/prax/application/spawner.cr:68:13 in 'stop'
  from src/prax/application/spawner.cr:38:9 in 'execute'
  from src/prax/application/spawner.cr:16:11 in '->'
  from /usr/share/crystal/src/fiber.cr:255:3 in 'run'
  from /usr/share/crystal/src/fiber.cr:75:34 in '->'
  from ???
FATAL: uncaught exception while processing handler for INT, exiting

Another bit of info: in another scenario, if all of my applications are deemed inactive and killed at the same time, then Prax does not freeze up.

My first guess was that, with the monitor running in its own thread, running delete on a class variable would cause another thread accessing the shared Prax.applications value to explode. However according to the Crystal docs this seems to be okay to do for now: https://crystal-lang.org/docs/guides/concurrency.html#communicating-data

My next guess would be that the act of killing an application process causes one fiber to hang which prevents any other fibers from executing after the monitor goes to sleep. However this seems to be contradicted by the case where when the monitor kills all of the applications simultaneously things still work.

Your expertise would be much appreciated here.

@tijn
Copy link
Contributor

tijn commented Nov 26, 2018

Thanks for pointing this out @jacksonrayhamilton. It's good to know I'm not the only one experiencing these lockups.

I spent quite some time trying to figure out the cause of this, together with my employer, so I guess I will share our all our findings here. It could be useful.

we get similar behaviour even without the app-killer

First, as you mentioned, the Prax.applications.dup.each seems a straw man. We saw it too but it doesn't really make a difference. You can easily reproduce the issue even without the app-killer. We actually compiled Prax without it for testing purposes.

I have this small script that will make Prax get into deadlock 100% of the time:

#!/usr/bin/env ruby

RESPONSIVE_HOST = 'index.test'
FAILING_HOST = 'fail.test' # the app on this domain will crash while starting up (missing gems)

def get_webpage(host)
  puts "GET webpage from #{host}..."
  `curl -m 10 http://#{host}`
  puts 'done'
end

# ---------

get_webpage RESPONSIVE_HOST
# app gets started by Prax and responds with status 200

get_webpage RESPONSIVE_HOST
# app keeps responding to requests

get_webpage FAILING_HOST
# the failing app will crash and that will make Prax unresponsive, ...
# as we will see with the next statement

sleep 5
get_webpage RESPONSIVE_HOST
# even the responsive app will not be able to respond anymore

Note that you need to have at least one (non-failing) app running for the problem to occurr. This will not result in a lock:

# [...]
# ---------

get_webpage FAILING_HOST
# the failing app will crash but that will NOT make Prax unresponsive, ...

get_webpage FAILING_HOST
# the failing app will crash but that will NOT make Prax unresponsive, ...
# repeat this as often as you like

get_webpage RESPONSIVE_HOST
# everything should still be fine

This, to me, seems to be the same situation that you described where the app-killer kills all applications at once.

It seems that not only killing an application process will cause Prax to become unresponsive, it also happens when an application process process simply stops. (Maybe only with an unucessful exit status? 🤔 )

Deadlock?

We overloaded the whole Prax source with puts statements and we found out it that the last line it will execute is line 112 in the application spawner that was spawning the crashing application. It simply says: sleep 0.1. In other words: some other fiber will now take over and prevent all other fibers from ever running again; until we press ctrl+c.

Now which fiber could that be? Could it be a fiber that's managing the connection to the RESPONSIVE_HOST? It sounds plausible because without it we cannot seem to achieve deadlock. On the other hand, how can it be that it isn't locking without a crashing app? It seems logical that it must give other fibers a chance to run otherwise you could never open a second connection or run the app-killer fiber.

Could it be that Process.new spawns some fiber that causes deadlock whenever the process hangs? But then again, why would it only do this when another process is running successfully?

I think it should never happen that a fiber will freeze up forever after a call to sleep. My guess is that we have to search for the cause in the Crystal standard library. Maybe the Fiber code changed? Maybe Process changed? We took a look at the changelog of course but we couldn't find anything suspicious. This was our first dive into the Crystal source though, someone more experienced would probably have a less difficult time understanding the changes.

(A silly idea for debugging: can I maybe somehow label the fibers to find the culprit? Do I have to recompile the whole compiler for that?)

Prax did not have this problem a year ago

I have this PC at home on which I am using an older version of Prax that I cannot reproduce this issue on. I compiled it on 2017-06-29 using my Arch Linux PKGBUILD. I would have to do some research to find out exactly which versions of Prax and Crystal were used but they were the most recent ones at the time.

The problem seems to be independent from the operating system

I suspected that Arch Linux changed some system libraries between Summer 2017 and September 2018 that could have caused the error. I compiled Prax on a macOS (a BSD with completely different libs, right?) only to find that indeed I can easily reproduce the problem there too. It seems we can rule out the OS.

@ysbaddaden
Copy link
Owner

ysbaddaden commented Nov 28, 2018

Wow, this is some thorough investigation! Thanks a lot, especially for the reproducible scenarios. I'll try to understand the issue when I get some free time.

Indeed, applications.each + applications.delete will skip an app to check, but only when it stopped the previous app, yet, it should be iterated on the next run, so it shouldn't create problems. We could use applications.reverse_each, instead of duplicating the array (please open a PR).

@tijn you can name crystal fibers, using spawn(name: "app-killer") for example, if I recall correctly.

Note that Crystal is single-threaded: the event-loop runs on a single thread —the app has threads, but they're all created by the garbage collector— the event-loop can be locked, blocking all fibers from running, if one fiber uses a blocking C syscall or ends up in a busy loop (such as loop { i += 1 }).

@ysbaddaden
Copy link
Owner

Maybe this is related: https://github.com/ysbaddaden/prax.cr/blob/master/src/prax.cr#L50-L51

We reworked how SIGCHLD is handled in Crystal some months ago.

@tijn
Copy link
Contributor

tijn commented Dec 2, 2018

Maybe this is related: https://github.com/ysbaddaden/prax.cr/blob/master/src/prax.cr#L50-L51

We reworked how SIGCHLD is handled in Crystal some months ago.

Well, that was spot-on!

Removing the call to waitpid resolved the issue. But it's there for a reason of course; I guess we could end up with zombies if we go without it so we have to find a better way to call it.

I rewrote the signal handler to use WNOHANG so it becomes non-blocking... However, I am totally unfamiliar with waitpid so I wonder if this could be correct:

Signal::CHLD.trap do
  loop do
    code = LibC.waitpid(-1, out exit_code, LibC::WNOHANG)
    STDERR.puts "SIGCHLD #{code} #{exit_code} #{Errno.value}"
    if code == -1 && Errno.value == Errno::EINTR
      # FIXME: is this right?
      sleep 0.1 # sleep and continue the loop until there is a proper return value
    else
      break
    end
  end
end

I first had a version that didn't check errno but sometimes that would end up in a never ending loop. This one seems to work better in that regard. I did not see it fail yet... 🤞 Still, expert advice is very welcome!

@ysbaddaden
Copy link
Owner

I didn't notice the missing WNOHANG in the waitpid call, making it a blocking call.

The loop is correct, but I should check the manpage, I'm fairly sure we can just discard the return value, or just loop

@ysbaddaden
Copy link
Owner

Pushed comment button inadvertently...

I think we can loop until it returns -1 and errno is EAGAIN, which means it would have blocked (no child process to reap). But I haven't verified the manpage, yet.

@ysbaddaden
Copy link
Owner

Fixed in 966c903.

Crystal's runtime now reaps child processes before delegating to any custom SIGCHLD handler. Trying to reap again led to blocking the process. Even using WNOHANG would be bad, since Crystal's runtime expects to reap child processes.

Thanks for helping the debug guys!

I should have known, I rewrote how SIGCHLD is handled in Crystal's runtime, which led to break Prax. Sigh.

@ysbaddaden
Copy link
Owner

Fixed the "reaper skips an app" in 6973ce4.

@jacksonrayhamilton
Copy link
Contributor Author

Thanks, the issue was fixed for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants