Skip to content

System call interrupts#395

Merged
bbangert merged 2 commits intopython-zk:masterfrom
jacksontj:system_call_interrupts
Jun 3, 2016
Merged

System call interrupts#395
bbangert merged 2 commits intopython-zk:masterfrom
jacksontj:system_call_interrupts

Conversation

@jacksontj
Copy link
Copy Markdown
Contributor

This is a followup to #250

This PR includes 2 patches both fixing more interrupt issues. Basically this is an upstream issue (in python-- http://bugs.python.org/issue20611).

The first patch changes my previous fix (#250) from raising a timeout to retrying-- the issue we see is that the rest of the kazoo code assumes that a timeout is a timeout (which is reasonable). So instead of reworking all timeouts within kazoo, we can simply retry.

The second patch handles the same during connection setup.

…has elapsed

Earlier I had submitted python-zk#250 to handle select being interrupted, and had changed this behavior to return as if it was a timeout. Now that we've been running this for a while we are seeing some occasional issues with that patch. Primarily that the remaining kazoo codebase assumes that a timeout from select is an actual timeout (which is reasonable)-- except that the previous patch changed that-- so you actually have to check the time elapsed in addition to the return. Instead of doing that (which seems like a mess, and error prone) this patch simply retries until the given timeout (assuming one was given). This does take into account that you may experience more than one interrupt in a given select() call (and we adjust the timeout accordingly for each iteration).
@jacksontj
Copy link
Copy Markdown
Contributor Author

Seems that all the tests failed with some gzip error? Sounds unrelated-- esp. since all tests failed with the same error (and I don't get that error locally)

@jacksontj jacksontj closed this May 16, 2016
@jacksontj jacksontj reopened this May 16, 2016
@jacksontj
Copy link
Copy Markdown
Contributor Author

cc @bbangert @harlowja
Since you guys helped review my patch last time-- I figure you'll be interested in this one as well.

@bbangert
Copy link
Copy Markdown
Member

bbangert commented May 17, 2016

This looks fine to me, but I'm also stumped by the Travis errors. Restarting Travis didn't seem to fix anything either.

@jacksontj
Copy link
Copy Markdown
Contributor Author

@bbangert do you know of a way to re-trigger the run? From the errors it looks like travis might have had an issue.

@bbangert
Copy link
Copy Markdown
Member

@jacksontj yup, triggered it again just now

@jacksontj
Copy link
Copy Markdown
Contributor Author

This is really odd, when I run the tests locally they seem to pass just fine. Is there some way to get additional debugging info from travis? Or is there someone who maintains it that we could ping?

@jacksontj
Copy link
Copy Markdown
Contributor Author

@bbangert any ideas who we can talk to?

@bbangert
Copy link
Copy Markdown
Member

@jacksontj not offhand. there might have been a few failures before this one though on master, so I need to retry them as well to see when it broke for good

@bbangert
Copy link
Copy Markdown
Member

@jacksontj ok, well, I went back to a prior travis test that only had 2 fail, and reran it.... and then they all failed. I'm guessing something on Travis has changed, such that all our tests are now insta-fail.

@bbangert
Copy link
Copy Markdown
Member

I've found the issue. The mirror chosen no longer has 3.4.7, I reverted that so that the tests can run. Will be restarting this shortly.

@bbangert
Copy link
Copy Markdown
Member

@jacksontj looks like there's 1 error reported.

======================================================================

ERROR: test_dirty_sock (kazoo.tests.test_connection.TestConnectionHandler)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/python-zk/kazoo/kazoo/tests/test_connection.py", line 229, in test_dirty_sock

    wait(lambda: client.handler.select([read_sock], [], [], 0)[0] == [])

  File "/home/travis/build/python-zk/kazoo/kazoo/tests/util.py", line 105, in __call__

    if func():

  File "/home/travis/build/python-zk/kazoo/kazoo/tests/test_connection.py", line 229, in <lambda>

    wait(lambda: client.handler.select([read_sock], [], [], 0)[0] == [])

TypeError: 'NoneType' object is unsubscriptable

@jacksontj jacksontj force-pushed the system_call_interrupts branch 3 times, most recently from d9699ac to c8e3f22 Compare June 3, 2016 01:25
This is in a similar vein to python-zk#250, but the root cause of this issue is actually an upstream python issue (http://bugs.python.org/issue20611) which is not fixed in 2.x or <3.5. Python doesn't handle interrupted system calls-- so the applications are responsible for doing so.

This patch simply retries the create_connection call on system call interrupt, while honoring the original timeout request.

Note: although the timeout for `create_tcp_connection` will be honored, if there are interrupts each subsequent call to `create_connection` will have less time to complete.
@jacksontj jacksontj force-pushed the system_call_interrupts branch from c8e3f22 to e7aac2e Compare June 3, 2016 01:40
@jacksontj
Copy link
Copy Markdown
Contributor Author

@bbangert fixed the issues :) now travis is happy.

@bbangert
Copy link
Copy Markdown
Member

bbangert commented Jun 3, 2016

@jacksontj looks good!

@bbangert bbangert merged commit 1b4bca7 into python-zk:master Jun 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants