-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment with flaky project to re-run designated flakey tests #748
Comments
Ideally we'd want to re-run tests that fail with a specific error. All of the flaky travis failures look like this:
As far as I can tell, the machine is overloaded and is randomly killing sockets. Since like 80% of our tests depend on sockets, it won't be super productive to annotate them. I (mostly) deflaked all the failures that were related to using time.sleep. I think there might be one or two left, but they're almost never the cause of failures anymore. |
I'd like to fix this properly, if we can. The problem is something in the Travis environment, and it should be possible for us to track it down and fix it. I seem to recall seeing two failures. The first, as pointed out by @shazow, is:
The second, far less common, is:
The second one of these should be fairly fixable: in line 206 of |
Hmm, hang on...the test I have a log for was expecting a connection timeout but didn't get one: it actually connected. Let me see if that's a common thread or a one-off. |
It's the same here, and here, and here, and here, and here (though there's a bonus failure in that one), and here, and... Actually, I got bored. Basically, I think that many of these Travis failures boil down to the fact that the connection timeout is not actually firing on Travis. The listener then returns immediately, causing Python to close the socket and raise our ECONNRESET error (because the socket got an RST when we wrote data to it). It would be better if I think, then, that we can fix this class of errors by making |
Ah, the tests in there sometimes use I think the problem here is that the two failing test cases have There's a further problem with some of these tests, which is that they actually make multiple connection attempts. That's not actually how the socket server works: if you want a test to accept multiple inbound connections, the socket handler actually needs to make multiple accept calls. This one doesn't, which means that the later connection attempts are going to no socket at all. I think the only reason they work is because of the TCP linger time: otherwise I'd expect a different error altogether. I'm going to attempt to fix this in two stages. Firstly, I'll add a sleep to the noop_handler. Second, I'll split the two offending tests up into smaller tests that all do only one thing and don't rely on the TCP linger time. |
Weirdly, putting a sleep in those functions appears to make them more likely to fail, not less. That's...unexpected. |
OH, I SEE. 😎 Any time those tests were passing it was because of the TCP linger time. The tests only passed when that function exited so quickly that the CPython garbage collector closed the socket, and the TCP linger time caused us to quietly sit there ignoring the packets, but not refusing them. If the tests take longer, the OS actually accepts the connection under the hood because of the listen backlog. This means the connection time is really fast! In principle we'd expect this not to happen because we're setting the listen backlog to zero, but the listen backlog is advisory and, at least on OS X, if you set it to zero the OS just happily ignores you. Now we have to work out how to construct a scenario where we know that there is an unreachable port. We could try always connecting to an unroutable IP address, I suppose... |
So, that appears to fix the "connection reset by peer" errors. But there's a new one we need to chase (unrelated to the fix, I think):
|
Here I think the test timeout is just way too low: we will raise a |
Ok, we also hit this occasionally (on my local machine):
This one seems to be about whether the socket is closed before we send the final request. Let me see if I can fix that up as well. I'm on a roll here, so let's just do it. |
Is this issue still relevant? |
I think probably not? At this point I think we want to make a dedicated push to un-flake the tests, rather than live with the flakiness. |
I diiiiid, I did several such pushes, but there is always more. :P |
https://github.com/box/flaky looks like it might help our CI problem over come it's flaky-ness (for a temporary solution until we can make the test suite itself less flaky by default).
The text was updated successfully, but these errors were encountered: