Skip to content

Conversation

jonludlam
Copy link
Collaborator

It would be way better to be using select rather than blocking
reads, but that's a much bigger change and this will need to be
backported.

Also there's a new mechanism for breaking out of the retry loop
if a host has been marked as dead, which will require a change
in xapi to be used.

Signed-off-by: Jon Ludlam jonathan.ludlam@citrix.com

It would be way better to be using select rather than blocking
reads, but that's a much bigger change and this will need to be
backported.

Also there's a new mechanism for breaking out of the retry loop
if a host has been marked as dead, which will require a change
in xapi to be used.

Signed-off-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Copy link
Contributor

@lindig lindig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky code that looks good and I don't have suggestions.

@jonludlam
Copy link
Collaborator Author

We can merge this without the xapi change and it won't break the build - but we do need the xapi change for the full fix. But I agree it's tricky code, and some other eyes on this bit would be useful. The xapi change is relatively trivial.

Copy link
Contributor

@edwintorok edwintorok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the Unixy race conditions and signals. I am not familiar with how the rest of the stunnel code works.

let run_watchdog timeout (fire_fn : unit -> unit) f (pout,pin) =
let fired = ref None in
let th = Thread.create (fun () ->
match Unix.select [pout] [] [] timeout with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select can wake up early for a variety of reasons with EINTR, especially if the process is receiving signals. It would be good to use a function here that sleeps based on an absolute deadline, pthread_ and sem_ have a few but I haven't seem any of them bound in ocaml standard library.
Thought that Thread.wait_timed_read would DTRT, but it just calls select without retrying.

Might have to use gettimeofday, compute how long is left and retry ourselves unless we can find a helper function that does that already

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout
(fun () ->
StunnelDebug.warn "Watchdog fired: killing pid: %d" pid;
Unix.kill pid Sys.sigterm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know this is the right PID and stunnel hasn't died already and the PID has been reused by something else? Does stunnel have a lockfile it keeps open with the pid that we can use to kill it without race conditions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only really by construction - we do this pattern elsewhere in the code too. I really want to throw away all this code and use direct openssl bindings instead, but since this is for backporting that's not really an option here.

Copy link
Contributor

@edwintorok edwintorok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed that XAPI would die anyway if it receives EINTR due to the global EINTR handler, so not handling EINTR and getting a shorter sleep here is fine.
Also in-flight RPC calls through stunnel would get terminated due to the watchdog here killing stunnel so that should prevent RPCs to dead hosts getting stuck for too long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants