-
Notifications
You must be signed in to change notification settings - Fork 28
CA-285840: Make sure stunnel doesn't hang talking to dead hosts #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It would be way better to be using select rather than blocking reads, but that's a much bigger change and this will need to be backported. Also there's a new mechanism for breaking out of the retry loop if a host has been marked as dead, which will require a change in xapi to be used. Signed-off-by: Jon Ludlam <jonathan.ludlam@citrix.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky code that looks good and I don't have suggestions.
We can merge this without the xapi change and it won't break the build - but we do need the xapi change for the full fix. But I agree it's tricky code, and some other eyes on this bit would be useful. The xapi change is relatively trivial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on the Unixy race conditions and signals. I am not familiar with how the rest of the stunnel code works.
let run_watchdog timeout (fire_fn : unit -> unit) f (pout,pin) = | ||
let fired = ref None in | ||
let th = Thread.create (fun () -> | ||
match Unix.select [pout] [] [] timeout with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select can wake up early for a variety of reasons with EINTR, especially if the process is receiving signals. It would be good to use a function here that sleeps based on an absolute deadline, pthread_ and sem_ have a few but I haven't seem any of them bound in ocaml standard library.
Thought that Thread.wait_timed_read would DTRT, but it just calls select without retrying.
Might have to use gettimeofday, compute how long is left and retry ourselves unless we can find a helper function that does that already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the solution here helps: https://caml.inria.fr/mantis/print_bug_page.php?bug_id=793
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EINTR has a more dramatic effect on xapi: https://github.com/xapi-project/xen-api/blob/master/ocaml/xapi/xapi.ml#L974-L989
timeout | ||
(fun () -> | ||
StunnelDebug.warn "Watchdog fired: killing pid: %d" pid; | ||
Unix.kill pid Sys.sigterm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know this is the right PID and stunnel hasn't died already and the PID has been reused by something else? Does stunnel have a lockfile it keeps open with the pid that we can use to kill it without race conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only really by construction - we do this pattern elsewhere in the code too. I really want to throw away all this code and use direct openssl bindings instead, but since this is for backporting that's not really an option here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed that XAPI would die anyway if it receives EINTR due to the global EINTR handler, so not handling EINTR and getting a shorter sleep here is fine.
Also in-flight RPC calls through stunnel would get terminated due to the watchdog here killing stunnel so that should prevent RPCs to dead hosts getting stuck for too long.
It would be way better to be using select rather than blocking
reads, but that's a much bigger change and this will need to be
backported.
Also there's a new mechanism for breaking out of the retry loop
if a host has been marked as dead, which will require a change
in xapi to be used.
Signed-off-by: Jon Ludlam jonathan.ludlam@citrix.com