New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue: intermittent test failures #200

Open
njsmith opened this Issue Jun 11, 2017 · 18 comments

Comments

Projects
None yet
3 participants
@njsmith
Member

njsmith commented Jun 11, 2017

It's the nature of I/O libraries like Trio that their test suites are prone to weird intermittent failures. But they're often hard to track down, and usually the way you encounter them is that you're trying to land some other unrelated feature and the CI randomly fails, so the temptation is to click "re-run build" and worry about it later.

This temptation must be resisted. If left unchecked, you eventually end up with tests that fail all the time for unknown reasons and no-one trusts them and it's this constant drag on development. Flaky tests must be eradicated.

But to make things extra fun, there's another problem: CI services genuinely are a bit flaky, so when you see a weird failure or lock-up in the tests then it's often unclear whether this is a bug in our code, or just some cloud provider having indigestion. And you don't want to waste hours trying to reproduce indigestion. Which means we need to compare notes across multiple failures. Which is tricky when I see one failure, and you see another, and neither of us realizes that we're seeing the same thing. Hence: this issue.

What to do if you see a weird test failure that makes no sense:

  • Visit this bug; it's #200 so it's hopefully easy to remember.

  • Check to see if anyone else has reported the same failure

  • Either way, add a note recording what you saw. Make sure to link to the failed test log.

  • If it's a failed travis-ci run, DO NOT CLICK THE "RESTART BUILD" OR "RESTART JOB" BUTTON! That will wipe out the log and replace it with your new run, so we lose the information about what failed. Instead, close and then re-open the PR; this will tickle Travis into re-testing your commit, but in a way that gives the new build a new URL, so that the old log remains accessible.

Currently known issues

On the radar:

  • Why did this appveyor build fail with "Could not find a version that satisfies the requirement requests>=2.7.9 (from codecov) (from versions: )" (re: #535 (comment))
  • segfault in pypy 3.6 nightly after faulthandler timeout fired: #200 (comment)
@njsmith

This comment has been minimized.

Member

njsmith commented Jun 17, 2017

Regarding the weird pypy nightly freeze in test_local: I downloaded the pypy-c-jit-91601-609a3cdf9cf7-linux64 nightly, and have let it loop running the trio testsuite for the last a few hours on my laptop, and I haven't been able to reproduce the problem so far. (Though I did get two failures in test_ki.py, both on this line, which is getting "did not raise KeyboardInterrupt". The test sessions otherwise finished normally.)

@njsmith njsmith changed the title from testing.py needs to get broken down into submodules to Tracking issue: intermittent test failures Jun 17, 2017

This was referenced Jun 17, 2017

@njsmith

This comment has been minimized.

Member

njsmith commented Aug 4, 2017

#119 is now fixed, I think / hope. Demoting it to "on the radar".

@njsmith

This comment has been minimized.

Member

njsmith commented Aug 8, 2017

Got fed up and fixed #140 :-)

@njsmith

This comment has been minimized.

Member

njsmith commented Aug 18, 2017

#140 came back. This makes no sense at all.

@njsmith

This comment has been minimized.

Member

njsmith commented Sep 7, 2017

Freeze in test_local.py::test_run_local_simultaneous_runs on pypy3 5.8 – maybe just travis being flaky, maybe something more.

@njsmith

This comment has been minimized.

Member

njsmith commented Nov 6, 2017

Here's a weird one: https://travis-ci.org/python-trio/trio/jobs/298164123

It looks like our test for CPython 3.6.2 on MacOS, one of our calls to the synchronous, stdlib function SSLSocket.unwrap() raised an SSLWantWriteError. Which should be impossible for a synchronous call, I think? Maybe this is some weird intermittent bug in the stdlib ssl module?

@njsmith

This comment has been minimized.

Member

njsmith commented Dec 1, 2017

Here's a new one, I guess somehow introduced by #358: a timeout test failing on windows because a 1 second sleep is measured to take just a tinnnnny bit less than 1 second.
https://ci.appveyor.com/project/njsmith/trio/build/1.0.768/job/3lbdyxl63q3h9s21
#361 attempts a diagnosis and fix.

@njsmith

This comment has been minimized.

Member

njsmith commented Dec 5, 2017

The weird SSL failure happened again: https://travis-ci.org/python-trio/trio/jobs/311618077
Again on CPython 3.6 on MacOS.

Filed upstream as bpo-32219. Possibly for now we should ignore SSLWantWriteError there as a workaround.

Edit: #365 is the PR for ignoring it.

@njsmith

This comment has been minimized.

Member

njsmith commented Dec 21, 2017

Another freeze on PyPy nightly in tests/test_local.py::test_run_local_simultaneous_runs: https://travis-ci.org/python-trio/trio/jobs/319497598

Same thing happened on Sept. 7, above: #200 (comment)
And back in June: #200 (comment)

Filed a bug: #379

@njsmith

This comment has been minimized.

Member

njsmith commented Feb 21, 2018

Sigh: #447

@njsmith

This comment has been minimized.

Member

njsmith commented Apr 21, 2018

@njsmith

This comment has been minimized.

Member

njsmith commented May 19, 2018

There was a mysterious appveyor build failure here: #535

@pquentin

This comment has been minimized.

Member

pquentin commented Jun 13, 2018

Strange PyPy nightly failure: https://travis-ci.org/python-trio/trio/jobs/391945139

Since it happened on master, I can't close/reopen a PR, but restarting the job produced the same effects.

(I think someone restarted the job above and it finally worked: the job is now green.)

@Fuyukai

This comment has been minimized.

Contributor

Fuyukai commented Jul 29, 2018

Jenkins keeps eating pull requests with a segfault (but only sometimes). Looks like a bug in immutable library - but can't reproduce it locally, and I don't know how to get the core dump.

@njsmith

This comment has been minimized.

Member

njsmith commented Jul 29, 2018

Here's a log with the segfault on Jenkins: https://ci.cryptography.io/blue/rest/organizations/jenkins/pipelines/python-trio/pipelines/trio/branches/PR-575/runs/2/nodes/6/steps/33/log/?start=0

The crash-handler traceback shows it as happening on line 27 of contextvars/__init__.py, which seems to be:

        self._data = immutables.Map()

And immutables.Map is a complex data structure implemented in C, so agreed that this smells like a bug in that code.

Filed a bug upstream here: MagicStack/immutables#7

@njsmith

This comment has been minimized.

Member

njsmith commented Jul 30, 2018

The Jenkins thing has been worked around by #583... but now that we're running MacOS tests on Travis, we get #584 instead.

@webknjaz webknjaz referenced this issue Sep 3, 2018

Open

Fix pyopenssl adapter under Python 3 #113

1 of 3 tasks complete
@njsmith

This comment has been minimized.

Member

njsmith commented Sep 26, 2018

a segfault in pypy 3.6 nightly, apparently related to the faulthandler timeout firing in trio/tests/test_ssl.py::test_renegotiation_randomized: https://travis-ci.org/python-trio/trio/jobs/433409577

Reported it on the #pypy irc channel anyway, though there's not much to go on yet

@njsmith

This comment has been minimized.

Member

njsmith commented Oct 4, 2018

Another strange pypy 3.6 nightly faulthandler traceback: https://travis-ci.org/python-trio/trio/jobs/436962955

I don't really understand this one at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment