Catch interrupted signals (on select, specificaly) in the connect loop by jacksontj · Pull Request #250 · python-zk/kazoo

jacksontj · 2014-09-25T23:53:13Z

In the current build if the process gets a signal you get a backtrace like:

[ERROR   ] Unhandled exception in connection loop
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/kazoo/protocol/connection.py", line 522, in _connect_attempt
    [], [], timeout)[0]
  File "/usr/lib/python2.6/site-packages/kazoo/handlers/threading.py", line 250, in select
    return select.select(*args, **kwargs)
error: (4, 'Interrupted system call')
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.6/threading.py", line 484, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.6/site-packages/kazoo/protocol/connection.py", line 466, in zk_loop
    if retry(self._connect_loop, retry) is STOP_CONNECTING:
  File "/usr/lib/python2.6/site-packages/kazoo/retry.py", line 123, in __call__
    return func(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/kazoo/protocol/connection.py", line 483, in _connect_loop
    status = self._connect_attempt(host, port, retry)
  File "/usr/lib/python2.6/site-packages/kazoo/protocol/connection.py", line 522, in _connect_attempt
    [], [], timeout)[0]
  File "/usr/lib/python2.6/site-packages/kazoo/handlers/threading.py", line 250, in select
    return select.select(*args, **kwargs)
error: (4, 'Interrupted system call')

This is due to the kazoo connection thread getting the signal and not handling the system call interrupt. It seems that your _socket_error_handling contextmanager covers that case, and with local testing it seems to have fixed the issue.

…dition, use the socket_error_handling context manager in connect loop to raise nicer exceptions

harlowja · 2014-10-09T19:23:16Z

I'm slightly confused how this tests the case you are trying. Can you add a comment as to how this actually tests this?

This seems to be the only reliable way (from within python) to reproduce the issue. Basically I need a mechanism that will send a signal to the kazoo handler thread which interrupts the select system call. I've tried a bunch of different mechanisms (os.killpg, etc.) but they all seem to kill the test suite as well. I've updated the comment here.

jacksontj · 2014-10-10T23:42:23Z

@harlowja Thanks for the feedback, I've updated my pull req with the feedback.

harlowja · 2014-10-11T06:07:07Z

Does this comment make sense anymore?

No it does not -- comments cleaned up

Update comments in select error handling to reflect new behavior

jacksontj · 2014-10-27T19:07:33Z

@harlowja Anything else to modify, or are we good for merge?

jacksontj · 2014-11-08T20:30:31Z

Ping

jacksontj · 2014-11-18T04:37:30Z

Ping

bbangert · 2014-11-18T19:04:16Z

Shouldn't there be some assert or something to ensure this didn't break the world?

I didn't think so, since this isn't testing the client, but rather checking for an interrupt. But its easy enough to check the node's data :)

…gnal_interrupt

bbangert · 2014-11-19T17:16:49Z

Looks good to me.

Catch interrupted signals (on select, specificaly) in the connect loop

…has elapsed Earlier I had submitted python-zk#250 to handle select being interrupted, and had changed this behavior to return as if it was a timeout. Now that we've been running this for a while we are seeing some occasional issues with that patch. Primarily that the remaining kazoo codebase assumes that a timeout from select is an actual timeout (which is reasonable)-- except that the previous patch changed that-- so you actually have to check the time elapsed in addition to the return. Instead of doing that (which seems like a mess, and error prone) this patch simply retries until the given timeout (assuming one was given). This does take into account that you may experience more than one interrupt in a given select() call (and we adjust the timeout accordingly for each iteration).

This is in a similar vein to python-zk#250, but the root cause of this issue is actually an upstream python issue (http://bugs.python.org/issue20611) which is not fixed in 2.x or <3.5. Python doesn't handle interrupted system calls-- so the applications are responsible for doing so. This patch simply retries the create_connection call on system call interrupt, while honoring the original timeout request. Note: although the timeout for `create_tcp_connection` will be honored, if there are interrupts each subsequent call to `create_connection` will have less time to complete.

Catch interrupted signals (in select) in the threading handler. In ad…

9a6ecba

…dition, use the socket_error_handling context manager in connect loop to raise nicer exceptions

jacksontj force-pushed the signal_interrupt branch from c8215f0 to 9a6ecba Compare September 26, 2014 01:33

Add python 3 compatibility

6cfb013

jacksontj closed this Sep 26, 2014

jacksontj reopened this Sep 26, 2014

jacksontj changed the title ~~Catch interrupted signals (on select, specificall) in the connect loop~~ Catch interrupted signals (on select, specificaly) in the connect loop Oct 2, 2014

harlowja reviewed Oct 9, 2014
View reviewed changes

Treate system call interrupts as a timeout (to simplify timeouts)

b332dec

harlowja reviewed Oct 11, 2014
View reviewed changes

Update comments in select error handling

0776b5a

Update comments in select error handling to reflect new behavior

bbangert reviewed Nov 18, 2014
View reviewed changes

jacksontj added 2 commits November 18, 2014 17:43

Add a sanity check per @bbangert

5caa758

Merge branch 'signal_interrupt' of github.com:jacksontj/kazoo into si…

1038d52

…gnal_interrupt

bbangert added a commit that referenced this pull request Nov 19, 2014

Merge pull request #250 from jacksontj/signal_interrupt

d14303a

Catch interrupted signals (on select, specificaly) in the connect loop

bbangert merged commit d14303a into python-zk:master Nov 19, 2014

jacksontj mentioned this pull request May 14, 2016

System call interrupts #395

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch interrupted signals (on select, specificaly) in the connect loop#250

Catch interrupted signals (on select, specificaly) in the connect loop#250
bbangert merged 6 commits intopython-zk:masterfrom
jacksontj:signal_interrupt

jacksontj commented Sep 25, 2014

Uh oh!

harlowja Oct 9, 2014

Uh oh!

jacksontj Oct 10, 2014

Uh oh!

jacksontj commented Oct 10, 2014

Uh oh!

harlowja Oct 11, 2014

Uh oh!

jacksontj Oct 11, 2014

Uh oh!

jacksontj commented Oct 27, 2014

Uh oh!

jacksontj commented Nov 8, 2014

Uh oh!

jacksontj commented Nov 18, 2014

Uh oh!

bbangert Nov 18, 2014

Uh oh!

jacksontj Nov 19, 2014

Uh oh!

bbangert commented Nov 19, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jacksontj commented Sep 25, 2014

Uh oh!

harlowja Oct 9, 2014

Choose a reason for hiding this comment

Uh oh!

jacksontj Oct 10, 2014

Choose a reason for hiding this comment

Uh oh!

jacksontj commented Oct 10, 2014

Uh oh!

harlowja Oct 11, 2014

Choose a reason for hiding this comment

Uh oh!

jacksontj Oct 11, 2014

Choose a reason for hiding this comment

Uh oh!

jacksontj commented Oct 27, 2014

Uh oh!

jacksontj commented Nov 8, 2014

Uh oh!

jacksontj commented Nov 18, 2014

Uh oh!

bbangert Nov 18, 2014

Choose a reason for hiding this comment

Uh oh!

jacksontj Nov 19, 2014

Choose a reason for hiding this comment

Uh oh!

bbangert commented Nov 19, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants