Catch interrupted signals (on select, specificaly) in the connect loop#250
Catch interrupted signals (on select, specificaly) in the connect loop#250bbangert merged 6 commits intopython-zk:masterfrom
Conversation
…dition, use the socket_error_handling context manager in connect loop to raise nicer exceptions
c8215f0 to
9a6ecba
Compare
There was a problem hiding this comment.
I'm slightly confused how this tests the case you are trying. Can you add a comment as to how this actually tests this?
There was a problem hiding this comment.
This seems to be the only reliable way (from within python) to reproduce the issue. Basically I need a mechanism that will send a signal to the kazoo handler thread which interrupts the select system call. I've tried a bunch of different mechanisms (os.killpg, etc.) but they all seem to kill the test suite as well. I've updated the comment here.
|
@harlowja Thanks for the feedback, I've updated my pull req with the feedback. |
There was a problem hiding this comment.
Does this comment make sense anymore?
There was a problem hiding this comment.
No it does not -- comments cleaned up
Update comments in select error handling to reflect new behavior
|
@harlowja Anything else to modify, or are we good for merge? |
|
Ping |
1 similar comment
|
Ping |
There was a problem hiding this comment.
Shouldn't there be some assert or something to ensure this didn't break the world?
There was a problem hiding this comment.
I didn't think so, since this isn't testing the client, but rather checking for an interrupt. But its easy enough to check the node's data :)
|
Looks good to me. |
Catch interrupted signals (on select, specificaly) in the connect loop
…has elapsed Earlier I had submitted python-zk#250 to handle select being interrupted, and had changed this behavior to return as if it was a timeout. Now that we've been running this for a while we are seeing some occasional issues with that patch. Primarily that the remaining kazoo codebase assumes that a timeout from select is an actual timeout (which is reasonable)-- except that the previous patch changed that-- so you actually have to check the time elapsed in addition to the return. Instead of doing that (which seems like a mess, and error prone) this patch simply retries until the given timeout (assuming one was given). This does take into account that you may experience more than one interrupt in a given select() call (and we adjust the timeout accordingly for each iteration).
This is in a similar vein to python-zk#250, but the root cause of this issue is actually an upstream python issue (http://bugs.python.org/issue20611) which is not fixed in 2.x or <3.5. Python doesn't handle interrupted system calls-- so the applications are responsible for doing so. This patch simply retries the create_connection call on system call interrupt, while honoring the original timeout request. Note: although the timeout for `create_tcp_connection` will be honored, if there are interrupts each subsequent call to `create_connection` will have less time to complete.
This is in a similar vein to python-zk#250, but the root cause of this issue is actually an upstream python issue (http://bugs.python.org/issue20611) which is not fixed in 2.x or <3.5. Python doesn't handle interrupted system calls-- so the applications are responsible for doing so. This patch simply retries the create_connection call on system call interrupt, while honoring the original timeout request. Note: although the timeout for `create_tcp_connection` will be honored, if there are interrupts each subsequent call to `create_connection` will have less time to complete.
This is in a similar vein to python-zk#250, but the root cause of this issue is actually an upstream python issue (http://bugs.python.org/issue20611) which is not fixed in 2.x or <3.5. Python doesn't handle interrupted system calls-- so the applications are responsible for doing so. This patch simply retries the create_connection call on system call interrupt, while honoring the original timeout request. Note: although the timeout for `create_tcp_connection` will be honored, if there are interrupts each subsequent call to `create_connection` will have less time to complete.
This is in a similar vein to python-zk#250, but the root cause of this issue is actually an upstream python issue (http://bugs.python.org/issue20611) which is not fixed in 2.x or <3.5. Python doesn't handle interrupted system calls-- so the applications are responsible for doing so. This patch simply retries the create_connection call on system call interrupt, while honoring the original timeout request. Note: although the timeout for `create_tcp_connection` will be honored, if there are interrupts each subsequent call to `create_connection` will have less time to complete.
This is in a similar vein to python-zk#250, but the root cause of this issue is actually an upstream python issue (http://bugs.python.org/issue20611) which is not fixed in 2.x or <3.5. Python doesn't handle interrupted system calls-- so the applications are responsible for doing so. This patch simply retries the create_connection call on system call interrupt, while honoring the original timeout request. Note: although the timeout for `create_tcp_connection` will be honored, if there are interrupts each subsequent call to `create_connection` will have less time to complete.
In the current build if the process gets a signal you get a backtrace like:
This is due to the kazoo connection thread getting the signal and not handling the system call interrupt. It seems that your _socket_error_handling contextmanager covers that case, and with local testing it seems to have fixed the issue.