Experiment with flaky project to re-run designated flakey tests #748

sigmavirus24 · 2015-11-23T04:18:11Z

https://github.com/box/flaky looks like it might help our CI problem over come it's flaky-ness (for a temporary solution until we can make the test suite itself less flaky by default).

shazow · 2015-11-24T00:10:20Z

Ideally we'd want to re-run tests that fail with a specific error. All of the flaky travis failures look like this:

  ...
  File "/opt/python/3.2.5/lib/python3.2/socket.py", line 287, in readinto
    return self._sock.recv_into(b)
nose.proxy.error: [Errno 104] Connection reset by peer

As far as I can tell, the machine is overloaded and is randomly killing sockets.

Since like 80% of our tests depend on sockets, it won't be super productive to annotate them. I (mostly) deflaked all the failures that were related to using time.sleep. I think there might be one or two left, but they're almost never the cause of failures anymore.

Lukasa · 2015-11-26T11:10:47Z

I'd like to fix this properly, if we can. The problem is something in the Travis environment, and it should be possible for us to track it down and fix it.

I seem to recall seeing two failures. The first, as pointed out by @shazow, is:

======================================================================
ERROR: test_connect_timeout (test.with_dummyserver.test_connectionpool.TestConnectionPoolTimeouts)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/shazow/urllib3/test/with_dummyserver/test_connectionpool.py", line 158, in test_connect_timeout
    self.assertRaises(ConnectTimeoutError, pool._make_request, conn, 'GET', url)
  File "/opt/python/pypy-2.5.0/lib-python/2.7/unittest/case.py", line 473, in assertRaises
    callableObj(*args, **kwargs)
  File "/home/travis/build/shazow/urllib3/urllib3/connectionpool.py", line 376, in _make_request
    httplib_response = conn.getresponse(buffering=True)
  File "/opt/python/pypy-2.5.0/lib-python/2.7/httplib.py", line 1068, in getresponse
    response.begin()
  File "/opt/python/pypy-2.5.0/lib-python/2.7/httplib.py", line 409, in begin
    version, status, reason = self._read_status()
  File "/opt/python/pypy-2.5.0/lib-python/2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/opt/python/pypy-2.5.0/lib-python/2.7/socket.py", line 534, in readline
    data = self._sock.recv(self._rbufsize)
error: [Errno 104] Connection reset by peer

The second, far less common, is:

======================================================================
ERROR: test suite for <class 'test.with_dummyserver.test_https.TestHTTPS_NoSAN'>
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/shazow/urllib3/.tox/py32/lib/python3.2/site-packages/nose/suite.py", line 210, in run
    self.setUp()
  File "/home/travis/build/shazow/urllib3/.tox/py32/lib/python3.2/site-packages/nose/suite.py", line 293, in setUp
    self.setupContext(ancestor)
  File "/home/travis/build/shazow/urllib3/.tox/py32/lib/python3.2/site-packages/nose/suite.py", line 316, in setupContext
    try_run(context, names)
  File "/home/travis/build/shazow/urllib3/.tox/py32/lib/python3.2/site-packages/nose/util.py", line 471, in try_run
    return func()
  File "/home/travis/build/shazow/urllib3/dummyserver/testcase.py", line 103, in setUpClass
    cls._start_server()
  File "/home/travis/build/shazow/urllib3/dummyserver/testcase.py", line 92, in _start_server
    cls.scheme, cls.host)
  File "/home/travis/build/shazow/urllib3/dummyserver/server.py", line 206, in run_tornado_app
    sockets = bind_sockets(None, address=host)
  File "/home/travis/build/shazow/urllib3/dummyserver/server.py", line 192, in bind_sockets
    sock.bind(sockaddr)
socket.error: [Errno 98] Address already in use

The second one of these should be fairly fixable: in line 206 of server.py, we should wrap the bind_sockets() call in a try...except block that simply tries again if EADDRINUSE is raised. That should fairly safely resolve that problem.

Lukasa · 2015-11-26T11:16:37Z

Hmm, hang on...the test I have a log for was expecting a connection timeout but didn't get one: it actually connected. Let me see if that's a common thread or a one-off.

Lukasa · 2015-11-26T11:21:43Z

It's the same here, and here, and here, and here, and here (though there's a bonus failure in that one), and here, and...

Actually, I got bored. Basically, I think that many of these Travis failures boil down to the fact that the connection timeout is not actually firing on Travis. The listener then returns immediately, causing Python to close the socket and raise our ECONNRESET error (because the socket got an RST when we wrote data to it). It would be better if assertRaises thought of that as a failure, rather than an error, because if it did we might have spotted this sooner.

I think, then, that we can fix this class of errors by making SHORT_TIMEOUT even shorter. Let me see if I'm right.

Lukasa · 2015-11-26T11:32:07Z

Ah, the tests in there sometimes use SHORT_TIMEOUT as something that will cause a connection timeout, and sometimes as something that won't. Line 82 in connectionpool.py calls urlopen expecting that it'll get a ReadTimeoutError, but if I shorten SHORT_TIMEOUT it gets a ConnectionTimeoutError instead.

I think the problem here is that the two failing test cases have noop_handler functions that do nothing, where what they should actually do is sleep for a short while and then return.

There's a further problem with some of these tests, which is that they actually make multiple connection attempts. That's not actually how the socket server works: if you want a test to accept multiple inbound connections, the socket handler actually needs to make multiple accept calls. This one doesn't, which means that the later connection attempts are going to no socket at all. I think the only reason they work is because of the TCP linger time: otherwise I'd expect a different error altogether.

I'm going to attempt to fix this in two stages. Firstly, I'll add a sleep to the noop_handler. Second, I'll split the two offending tests up into smaller tests that all do only one thing and don't rely on the TCP linger time.

Lukasa · 2015-11-26T11:44:22Z

Weirdly, putting a sleep in those functions appears to make them more likely to fail, not less. That's...unexpected.

Lukasa · 2015-11-26T11:58:44Z

OH, I SEE. 😎

Any time those tests were passing it was because of the TCP linger time. The tests only passed when that function exited so quickly that the CPython garbage collector closed the socket, and the TCP linger time caused us to quietly sit there ignoring the packets, but not refusing them.

If the tests take longer, the OS actually accepts the connection under the hood because of the listen backlog. This means the connection time is really fast! In principle we'd expect this not to happen because we're setting the listen backlog to zero, but the listen backlog is advisory and, at least on OS X, if you set it to zero the OS just happily ignores you.

Now we have to work out how to construct a scenario where we know that there is an unreachable port. We could try always connecting to an unroutable IP address, I suppose...

Lukasa · 2015-11-26T12:08:58Z

So, that appears to fix the "connection reset by peer" errors. But there's a new one we need to chase (unrelated to the fix, I think):

======================================================================
ERROR: test_cross_host_redirect (test.with_dummyserver.test_poolmanager.TestPoolManager)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/shazow/urllib3/test/with_dummyserver/test_poolmanager.py", line 80, in test_cross_host_redirect
    timeout=0.01, retries=1)
  File "/home/travis/build/shazow/urllib3/urllib3/request.py", line 69, in request
    **urlopen_kw)
  File "/home/travis/build/shazow/urllib3/urllib3/request.py", line 90, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/travis/build/shazow/urllib3/urllib3/poolmanager.py", line 190, in urlopen
    return self.urlopen(method, redirect_location, **kw)
  File "/home/travis/build/shazow/urllib3/urllib3/poolmanager.py", line 162, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/travis/build/shazow/urllib3/urllib3/connectionpool.py", line 609, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/home/travis/build/shazow/urllib3/urllib3/util/retry.py", line 273, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=47312): Max retries exceeded with url: /echo?a=b (Caused by ReadTimeoutError("HTTPConnectionPool(host='127.0.0.1', port=47312): Read timed out. (read timeout=0.01)",))
-------------------- >> begin captured logging << --------------------
tornado.access: INFO: 200 GET / (::1) 8.65ms
urllib3.util.retry: DEBUG: Converted retries value: 0 -> Retry(total=0, connect=None, read=None, redirect=0)
urllib3.connectionpool: INFO: Starting new HTTP connection (1): localhost
tornado.access: INFO: 303 GET /redirect?target=http%3A%2F%2F127.0.0.1%3A47312%2Fecho%3Fa%3Db (::1) 6.20ms
urllib3.connectionpool: DEBUG: "GET /redirect?target=http%3A%2F%2F127.0.0.1%3A47312%2Fecho%3Fa%3Db HTTP/1.1" 303 0
urllib3.util.retry: DEBUG: Converted retries value: 0 -> Retry(total=0, connect=None, read=None, redirect=None)
urllib3.util.retry: DEBUG: Converted retries value: 1 -> Retry(total=1, connect=None, read=None, redirect=0)
tornado.access: INFO: 303 GET /redirect?target=http%3A%2F%2F127.0.0.1%3A47312%2Fecho%3Fa%3Db (::1) 6.98ms
urllib3.connectionpool: DEBUG: "GET /redirect?target=http%3A%2F%2F127.0.0.1%3A47312%2Fecho%3Fa%3Db HTTP/1.1" 303 0
urllib3.util.retry: DEBUG: Converted retries value: 1 -> Retry(total=1, connect=None, read=None, redirect=None)
urllib3.util.retry: DEBUG: Incremented Retry for (url='http://localhost:47312/redirect?target=http%3A%2F%2F127.0.0.1%3A47312%2Fecho%3Fa%3Db'): Retry(total=0, connect=None, read=None, redirect=None)
urllib3.poolmanager: INFO: Redirecting http://localhost:47312/redirect?target=http%3A%2F%2F127.0.0.1%3A47312%2Fecho%3Fa%3Db -> http://127.0.0.1:47312/echo?a=b
urllib3.connectionpool: INFO: Starting new HTTP connection (1): 127.0.0.1
tornado.access: INFO: 200 GET /echo?a=b (127.0.0.1) 7.51ms
--------------------- >> end captured logging << ---------------------

Lukasa · 2015-11-26T12:11:59Z

Here I think the test timeout is just way too low: we will raise a ReadTimeoutError if the request takes more than 0.01 seconds, which is probably too short. Given that this test is suppose to test the Retry logic, not the timeout logic, I think we should raise that timeout to the level that it will still fail the test if something goes wrong, but not so low that it causes spurious test failures.

Lukasa · 2015-11-26T12:14:07Z

And weirdly, it's only the cross_host_redirect test that contains a timeout. @shazow, it seems you added the timeout in da66d83, and I can't for the life of me work out why you did.

Lukasa · 2015-11-26T12:18:20Z

Ok, we also hit this occasionally (on my local machine):

======================================================================
ERROR: test_retries (test.with_dummyserver.test_socketlevel.TestProxyManager)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/cory/Documents/Python/urllib3/test/with_dummyserver/test_socketlevel.py", line 538, in test_retries
    assert_same_host=False, retries=False)
  File "/Users/cory/.pyenv/versions/2.7.10/lib/python2.7/unittest/case.py", line 475, in assertRaises
    callableObj(*args, **kwargs)
  File "/Users/cory/Documents/Python/urllib3/urllib3/connectionpool.py", line 609, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/Users/cory/Documents/Python/urllib3/urllib3/util/retry.py", line 224, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/Users/cory/Documents/Python/urllib3/urllib3/connectionpool.py", line 559, in urlopen
    body=body, headers=headers)
  File "/Users/cory/Documents/Python/urllib3/urllib3/connectionpool.py", line 376, in _make_request
    httplib_response = conn.getresponse(buffering=True)
  File "/Users/cory/.pyenv/versions/2.7.10/lib/python2.7/httplib.py", line 1132, in getresponse
    response.begin()
  File "/Users/cory/.pyenv/versions/2.7.10/lib/python2.7/httplib.py", line 453, in begin
    version, status, reason = self._read_status()
  File "/Users/cory/.pyenv/versions/2.7.10/lib/python2.7/httplib.py", line 417, in _read_status
    raise BadStatusLine(line)
ProtocolError: ('Connection aborted.', BadStatusLine("''",))
-------------------- >> begin captured logging << --------------------
urllib3.util.retry: DEBUG: Converted retries value: 1 -> Retry(total=1, connect=None, read=None, redirect=None)
urllib3.connectionpool: INFO: Starting new HTTP connection (1): localhost
urllib3.util.retry: DEBUG: Incremented Retry for (url='http://www.google.com'): Retry(total=0, connect=None, read=None, redirect=None)
urllib3.connectionpool: WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', BadStatusLine("''",))': http://www.google.com
urllib3.connectionpool: INFO: Starting new HTTP connection (2): localhost
urllib3.connectionpool: DEBUG: "GET http://www.google.com HTTP/1.1" 200 87
urllib3.util.retry: DEBUG: Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0)
--------------------- >> end captured logging << ---------------------

This one seems to be about whether the socket is closed before we send the final request. Let me see if I can fix that up as well. I'm on a roll here, so let's just do it.

shazow · 2015-12-03T19:24:26Z

Is this issue still relevant?

Lukasa · 2015-12-03T20:44:44Z

I think probably not? At this point I think we want to make a dedicated push to un-flake the tests, rather than live with the flakiness.

shazow · 2015-12-03T22:20:51Z

I diiiiid, I did several such pushes, but there is always more. :P

Lukasa mentioned this issue Nov 26, 2015

Repair flaky connection timeout tests. #752

Merged

shazow closed this as completed Dec 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with flaky project to re-run designated flakey tests #748

Experiment with flaky project to re-run designated flakey tests #748

sigmavirus24 commented Nov 23, 2015

shazow commented Nov 24, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

shazow commented Dec 3, 2015

Lukasa commented Dec 3, 2015

shazow commented Dec 3, 2015

Experiment with flaky project to re-run designated flakey tests #748

Experiment with flaky project to re-run designated flakey tests #748

Comments

sigmavirus24 commented Nov 23, 2015

shazow commented Nov 24, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

Lukasa commented Nov 26, 2015

shazow commented Dec 3, 2015

Lukasa commented Dec 3, 2015

shazow commented Dec 3, 2015