System call interrupts by jacksontj · Pull Request #395 · python-zk/kazoo

jacksontj · 2016-05-14T01:03:05Z

This is a followup to #250

This PR includes 2 patches both fixing more interrupt issues. Basically this is an upstream issue (in python-- http://bugs.python.org/issue20611).

The first patch changes my previous fix (#250) from raising a timeout to retrying-- the issue we see is that the rest of the kazoo code assumes that a timeout is a timeout (which is reasonable). So instead of reworking all timeouts within kazoo, we can simply retry.

The second patch handles the same during connection setup.

…has elapsed Earlier I had submitted python-zk#250 to handle select being interrupted, and had changed this behavior to return as if it was a timeout. Now that we've been running this for a while we are seeing some occasional issues with that patch. Primarily that the remaining kazoo codebase assumes that a timeout from select is an actual timeout (which is reasonable)-- except that the previous patch changed that-- so you actually have to check the time elapsed in addition to the return. Instead of doing that (which seems like a mess, and error prone) this patch simply retries until the given timeout (assuming one was given). This does take into account that you may experience more than one interrupt in a given select() call (and we adjust the timeout accordingly for each iteration).

jacksontj · 2016-05-16T18:04:51Z

Seems that all the tests failed with some gzip error? Sounds unrelated-- esp. since all tests failed with the same error (and I don't get that error locally)

jacksontj · 2016-05-17T18:43:32Z

cc @bbangert @harlowja
Since you guys helped review my patch last time-- I figure you'll be interested in this one as well.

bbangert · 2016-05-17T23:26:52Z

This looks fine to me, but I'm also stumped by the Travis errors. Restarting Travis didn't seem to fix anything either.

jacksontj · 2016-05-18T15:14:45Z

@bbangert do you know of a way to re-trigger the run? From the errors it looks like travis might have had an issue.

bbangert · 2016-05-18T15:49:13Z

@jacksontj yup, triggered it again just now

jacksontj · 2016-05-19T21:24:53Z

This is really odd, when I run the tests locally they seem to pass just fine. Is there some way to get additional debugging info from travis? Or is there someone who maintains it that we could ping?

jacksontj · 2016-05-24T17:44:31Z

@bbangert any ideas who we can talk to?

bbangert · 2016-05-24T19:04:47Z

@jacksontj not offhand. there might have been a few failures before this one though on master, so I need to retry them as well to see when it broke for good

bbangert · 2016-05-24T19:55:42Z

@jacksontj ok, well, I went back to a prior travis test that only had 2 fail, and reran it.... and then they all failed. I'm guessing something on Travis has changed, such that all our tests are now insta-fail.

bbangert · 2016-05-24T20:02:54Z

I've found the issue. The mirror chosen no longer has 3.4.7, I reverted that so that the tests can run. Will be restarting this shortly.

bbangert · 2016-05-24T20:12:45Z

@jacksontj looks like there's 1 error reported.

======================================================================

ERROR: test_dirty_sock (kazoo.tests.test_connection.TestConnectionHandler)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/python-zk/kazoo/kazoo/tests/test_connection.py", line 229, in test_dirty_sock

    wait(lambda: client.handler.select([read_sock], [], [], 0)[0] == [])

  File "/home/travis/build/python-zk/kazoo/kazoo/tests/util.py", line 105, in __call__

    if func():

  File "/home/travis/build/python-zk/kazoo/kazoo/tests/test_connection.py", line 229, in <lambda>

    wait(lambda: client.handler.select([read_sock], [], [], 0)[0] == [])

TypeError: 'NoneType' object is unsubscriptable

This is in a similar vein to python-zk#250, but the root cause of this issue is actually an upstream python issue (http://bugs.python.org/issue20611) which is not fixed in 2.x or <3.5. Python doesn't handle interrupted system calls-- so the applications are responsible for doing so. This patch simply retries the create_connection call on system call interrupt, while honoring the original timeout request. Note: although the timeout for `create_tcp_connection` will be honored, if there are interrupts each subsequent call to `create_connection` will have less time to complete.

jacksontj · 2016-06-03T01:51:39Z

@bbangert fixed the issues :) now travis is happy.

bbangert · 2016-06-03T01:58:11Z

@jacksontj looks good!

jacksontj closed this May 16, 2016

jacksontj reopened this May 16, 2016

jacksontj force-pushed the system_call_interrupts branch 3 times, most recently from d9699ac to c8e3f22 Compare June 3, 2016 01:25

jacksontj force-pushed the system_call_interrupts branch from c8e3f22 to e7aac2e Compare June 3, 2016 01:40

bbangert merged commit 1b4bca7 into python-zk:master Jun 3, 2016

Conversation

jacksontj commented May 14, 2016

Uh oh!

jacksontj commented May 16, 2016

Uh oh!

jacksontj commented May 17, 2016

Uh oh!

bbangert commented May 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacksontj commented May 18, 2016

Uh oh!

bbangert commented May 18, 2016

Uh oh!

jacksontj commented May 19, 2016

Uh oh!

jacksontj commented May 24, 2016

Uh oh!

bbangert commented May 24, 2016

Uh oh!

bbangert commented May 24, 2016

Uh oh!

bbangert commented May 24, 2016

Uh oh!

bbangert commented May 24, 2016

Uh oh!

jacksontj commented Jun 3, 2016

Uh oh!

bbangert commented Jun 3, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bbangert commented May 17, 2016 •

edited

Loading