Skip to content

Catch interrupted signals (on select, specificaly) in the connect loop#250

Merged
bbangert merged 6 commits intopython-zk:masterfrom
jacksontj:signal_interrupt
Nov 19, 2014
Merged

Catch interrupted signals (on select, specificaly) in the connect loop#250
bbangert merged 6 commits intopython-zk:masterfrom
jacksontj:signal_interrupt

Conversation

@jacksontj
Copy link
Copy Markdown
Contributor

In the current build if the process gets a signal you get a backtrace like:

[ERROR   ] Unhandled exception in connection loop
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/kazoo/protocol/connection.py", line 522, in _connect_attempt
    [], [], timeout)[0]
  File "/usr/lib/python2.6/site-packages/kazoo/handlers/threading.py", line 250, in select
    return select.select(*args, **kwargs)
error: (4, 'Interrupted system call')
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.6/threading.py", line 484, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.6/site-packages/kazoo/protocol/connection.py", line 466, in zk_loop
    if retry(self._connect_loop, retry) is STOP_CONNECTING:
  File "/usr/lib/python2.6/site-packages/kazoo/retry.py", line 123, in __call__
    return func(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/kazoo/protocol/connection.py", line 483, in _connect_loop
    status = self._connect_attempt(host, port, retry)
  File "/usr/lib/python2.6/site-packages/kazoo/protocol/connection.py", line 522, in _connect_attempt
    [], [], timeout)[0]
  File "/usr/lib/python2.6/site-packages/kazoo/handlers/threading.py", line 250, in select
    return select.select(*args, **kwargs)
error: (4, 'Interrupted system call')

This is due to the kazoo connection thread getting the signal and not handling the system call interrupt. It seems that your _socket_error_handling contextmanager covers that case, and with local testing it seems to have fixed the issue.

…dition, use the socket_error_handling context manager in connect loop to raise nicer exceptions
@jacksontj jacksontj closed this Sep 26, 2014
@jacksontj jacksontj reopened this Sep 26, 2014
@jacksontj jacksontj changed the title Catch interrupted signals (on select, specificall) in the connect loop Catch interrupted signals (on select, specificaly) in the connect loop Oct 2, 2014
Comment thread kazoo/tests/test_interrupt.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly confused how this tests the case you are trying. Can you add a comment as to how this actually tests this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the only reliable way (from within python) to reproduce the issue. Basically I need a mechanism that will send a signal to the kazoo handler thread which interrupts the select system call. I've tried a bunch of different mechanisms (os.killpg, etc.) but they all seem to kill the test suite as well. I've updated the comment here.

@jacksontj
Copy link
Copy Markdown
Contributor Author

@harlowja Thanks for the feedback, I've updated my pull req with the feedback.

Comment thread kazoo/handlers/threading.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment make sense anymore?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it does not -- comments cleaned up

Update comments in select error handling to reflect new behavior
@jacksontj
Copy link
Copy Markdown
Contributor Author

@harlowja Anything else to modify, or are we good for merge?

@jacksontj
Copy link
Copy Markdown
Contributor Author

Ping

1 similar comment
@jacksontj
Copy link
Copy Markdown
Contributor Author

Ping

Comment thread kazoo/tests/test_interrupt.py Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be some assert or something to ensure this didn't break the world?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think so, since this isn't testing the client, but rather checking for an interrupt. But its easy enough to check the node's data :)

@bbangert
Copy link
Copy Markdown
Member

Looks good to me.

bbangert added a commit that referenced this pull request Nov 19, 2014
Catch interrupted signals (on select, specificaly) in the connect loop
@bbangert bbangert merged commit d14303a into python-zk:master Nov 19, 2014
jacksontj added a commit to jacksontj/kazoo that referenced this pull request May 14, 2016
…has elapsed

Earlier I had submitted python-zk#250 to handle select being interrupted, and had changed this behavior to return as if it was a timeout. Now that we've been running this for a while we are seeing some occasional issues with that patch. Primarily that the remaining kazoo codebase assumes that a timeout from select is an actual timeout (which is reasonable)-- except that the previous patch changed that-- so you actually have to check the time elapsed in addition to the return. Instead of doing that (which seems like a mess, and error prone) this patch simply retries until the given timeout (assuming one was given). This does take into account that you may experience more than one interrupt in a given select() call (and we adjust the timeout accordingly for each iteration).
jacksontj added a commit to jacksontj/kazoo that referenced this pull request May 14, 2016
This is in a similar vein to python-zk#250, but the root cause of this issue is actually an upstream python issue (http://bugs.python.org/issue20611) which is not fixed in 2.x or <3.5. Python doesn't handle interrupted system calls-- so the applications are responsible for doing so.

This patch simply retries the create_connection call on system call interrupt, while honoring the original timeout request.

Note: although the timeout for `create_tcp_connection` will be honored, if there are interrupts each subsequent call to `create_connection` will have less time to complete.
@jacksontj jacksontj mentioned this pull request May 14, 2016
jacksontj added a commit to jacksontj/kazoo that referenced this pull request Jun 3, 2016
This is in a similar vein to python-zk#250, but the root cause of this issue is actually an upstream python issue (http://bugs.python.org/issue20611) which is not fixed in 2.x or <3.5. Python doesn't handle interrupted system calls-- so the applications are responsible for doing so.

This patch simply retries the create_connection call on system call interrupt, while honoring the original timeout request.

Note: although the timeout for `create_tcp_connection` will be honored, if there are interrupts each subsequent call to `create_connection` will have less time to complete.
jacksontj added a commit to jacksontj/kazoo that referenced this pull request Jun 3, 2016
This is in a similar vein to python-zk#250, but the root cause of this issue is actually an upstream python issue (http://bugs.python.org/issue20611) which is not fixed in 2.x or <3.5. Python doesn't handle interrupted system calls-- so the applications are responsible for doing so.

This patch simply retries the create_connection call on system call interrupt, while honoring the original timeout request.

Note: although the timeout for `create_tcp_connection` will be honored, if there are interrupts each subsequent call to `create_connection` will have less time to complete.
jacksontj added a commit to jacksontj/kazoo that referenced this pull request Jun 3, 2016
This is in a similar vein to python-zk#250, but the root cause of this issue is actually an upstream python issue (http://bugs.python.org/issue20611) which is not fixed in 2.x or <3.5. Python doesn't handle interrupted system calls-- so the applications are responsible for doing so.

This patch simply retries the create_connection call on system call interrupt, while honoring the original timeout request.

Note: although the timeout for `create_tcp_connection` will be honored, if there are interrupts each subsequent call to `create_connection` will have less time to complete.
jacksontj added a commit to jacksontj/kazoo that referenced this pull request Jun 3, 2016
This is in a similar vein to python-zk#250, but the root cause of this issue is actually an upstream python issue (http://bugs.python.org/issue20611) which is not fixed in 2.x or <3.5. Python doesn't handle interrupted system calls-- so the applications are responsible for doing so.

This patch simply retries the create_connection call on system call interrupt, while honoring the original timeout request.

Note: although the timeout for `create_tcp_connection` will be honored, if there are interrupts each subsequent call to `create_connection` will have less time to complete.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants