Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control-C sometimes getting missed on Windows #119

Closed
njsmith opened this issue Apr 11, 2017 · 13 comments
Closed

control-C sometimes getting missed on Windows #119

njsmith opened this issue Apr 11, 2017 · 13 comments
Labels

Comments

@njsmith
Copy link
Member

njsmith commented Apr 11, 2017

We're getting occasional failures on appveyor, that look like:

\trio\_core\tests\test_ki.py::test_ki_is_good_neighbor PASSED
..\trio\_core\tests\test_ki.py::test_ki_wakes_us_up 
[... coverage info ...]
!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
to show a full traceback on KeyboardInterrupt use --fulltrace
C:\projects\trio\trio\_core\_run.py:1136: KeyboardInterrupt
==================== 14 passed, 1 skipped in 23.20 seconds ====================

(source)

This is a test that delivers a synthetic SIGINT while the trio thread is sleeping, to make sure that the KeyboardInterrupt gets delivered promptly.

_run.py:1136 is the raise KeyboardInterrupt at the end of run() – the one that checks ki.pending one last time as run is exiting.

Given the time report at the end, it looks like the await sleep(20) in test_ki_wakes_us_up may be expiring? And that's the only checkpoint in the test, so if we somehow aren't being woken up by the signal arriving, but the signal is in fact arriving, then this would make sense.

...how can this be happening though? I'm really not sure :-(. May need to push some instrumented builds in a PR and keep rebuilding them on appveyor until I get failures...

@njsmith
Copy link
Member Author

njsmith commented Apr 11, 2017

Here's another one from a few days ago, which looks identical to the one linked above: https://ci.appveyor.com/project/njsmith/trio/build/1.0.265/job/9gyy7y6jp4vlu472
[Edit: and again: https://ci.appveyor.com/project/njsmith/trio/build/1.0.265/job/9gyy7y6jp4vlu472 ]

Instrumentation is in #120, but it looks like this was a bit rare to start with so tracking it down may be difficult :-(

@njsmith
Copy link
Member Author

njsmith commented Apr 11, 2017

Okay, managed to reproduce locally. Some tricky things required to get a trace: make the sigint trace print to stderr, b/c if it runs re-entrant with one of the stdout prints then you have two calls to the same BufferedWriter.write at the same time and it blows up; run without pytest's -s option b/c console printing seems to mess up timing, but then hack the raise KeyboardInterrupt to be raise KeyError so that pytest will actually print the captured stdout/stderr. Anyway:

---------------------------- Captured stdout call -----------------------------
sleeping @ 1491898063.0975676
woke up @ 1491898063.0975676
draining wakeup socket @ 1491898063.0975676
done draining
sleeping @ 1491898063.0975676
woke up @ 1491898063.0975676
draining wakeup socket @ 1491898063.0975676
done draining
sleeping @ 1491898063.0975676
sleeping @ 1491898063.0975676
woke up @ 1491898064.112795
wakeup sock is readable
PyErr_CheckSignals returned 0
draining wakeup socket @ 1491898064.112795
drained b'\x02'
done draining
sleeping @ 1491898064.112795
woke up @ 1491898083.113062
draining wakeup socket @ 1491898083.113062
done draining
finally @ 1491898083.113062
sleeping @ 1491898083.113062
woke up @ 1491898083.113062
draining wakeup socket @ 1491898083.113062
done draining
draining wakeup socket @ 1491898083.113062
drained b'\x00'
done draining
sleeping @ 1491898083.113062
woke up @ 1491898083.113062
draining wakeup socket @ 1491898083.113062
done draining
sleeping @ 1491898083.113062
woke up @ 1491898083.113062
draining wakeup socket @ 1491898083.113062
done draining
---------------------------- Captured stderr call -----------------------------
sigint handler @ 1491898083.113062

So the smoking gun here is:

  • wake up at 1491898064.112795
  • wakeup socket is readable, and we'll see contains \x02, so the C-level signal handler's wakeup fd code ran
  • but PyErr_CheckSignals (called via ctypes.pythonapi.PyErr_CheckSignals) says that there are no signals pending and does not run the Python-level signal handler! The bytecode eval loop doesn't run the signal handler either!
  • so we go back to sleep at 1491898064.112795
  • then ~20 seconds later at 1491898083.113062, we wake up b/c the select timer expired
  • and now the Python handler runs immediately (1491898083.113062)

The problem seems to be that in CPython's implementation of trip_signal, the logic goes:

  • set a flag saying that the signal arrived
  • then write to the wakeup fd (good, we dodged a race condition! ...but wait!)
  • then set a global flag saying "some signal arrived" and call Py_AddPendingCall (the latter of which is what sets the magic flag that tells the bytecode eval loop to immediately pause and run signal handlers)

Normatively, it should never be possible to get a write to the wakeup fd but not have a pending signal. So the fact that we observed this in the trace above seems to me to be a smoking gun: the problem is that the IO thread got scheduled to run after the write to the wakeup fd, but before Py_AddPendingCall.

So it's difficult to validate this because I can only reproduce the bug on Windows and I don't have a way to build CPython on Windows, but I'm ~99% confident that this is happening because trip_signal does things in the wrong order. It should write to the wakeup fd last, after it's set all the global variables.

@njsmith
Copy link
Member Author

njsmith commented Apr 11, 2017

Here's a run I managed to get showing the problem on appveyor: https://ci.appveyor.com/project/njsmith/trio/build/1.0.286/job/mcc0rdjx5fpoqvo6

@njsmith
Copy link
Member Author

njsmith commented Apr 11, 2017

CPython bug report: https://bugs.python.org/issue30038
Proposed fix: python/cpython#1082

PyPy doesn't seem to be affected.

@njsmith
Copy link
Member Author

njsmith commented Apr 11, 2017

Maybe this will be a useful workaround for the test suite until fixed versions of CPython are released? https://github.com/box/flaky

njsmith added a commit to njsmith/trio that referenced this issue Apr 11, 2017
This is being flaky on Windows, almost certainly due to a bug in
CPython 3.5.{0,1,2,3} and 3.6.{0,1}:
    https://bugs.python.org/issue30038
More details in python-triogh-119.

I've submitted (what I think is) a fix to CPython, but in the mean
time the random appveyor failures are pretty annoying and I don't see
any way to fix them except wait for CPython to make a new release, so
as a temporary measure this commit adds some retry logic to this
test.

This is an ugly hack and should be removed as soon as possible.
(Unfortunately that probably won't be until after we can drop 3.6
support...)
@njsmith
Copy link
Member Author

njsmith commented Apr 11, 2017

Committed a temporary workaround in #121

@njsmith
Copy link
Member Author

njsmith commented Apr 11, 2017

Hmm, another more-or-less tolerable workaround would be to have the test simply raise the signal twice. This is what a real person would do if they hit this (i.e., hit control-C again if the first try didn't work), and should work (since if we do write-to-fd, mark-signal-for-delivery, write-to-fd, mark-signal-for-delivery, then that sequence does contain the correct [mark-signal-for-delivery, write-to-fd] sequence as a subsequence).

njsmith added a commit to njsmith/trio that referenced this issue Apr 12, 2017
This makes the workaround more realistic, more precisely targeted, and
less messy code-wise.
njsmith added a commit that referenced this issue Apr 12, 2017
@njsmith
Copy link
Member Author

njsmith commented Apr 12, 2017

#122 implemented the try-control-C-twice workaround, and more precisely targeted it to CPython/Windows/current-versions.

@njsmith
Copy link
Member Author

njsmith commented Apr 19, 2017

I guess there is one way to work around this without fixing CPython, though it's pretty nasty...

Use SetConsoleCtrlHandler to register our own handler. In our handler, use cffi to call raise(SIGINT), and run the regular signal handler synchronously. Then write to our wakeup fd. Then return the magic value that tells Windows not to run any other handlers. So basically this would be monkeypatching and reimplementing the C runtime's signal delivery code... but I guess it would work?

@njsmith njsmith changed the title nondeterministic failure on appveyor control-C sometimes getting missed on Windows Apr 20, 2017
njsmith added a commit to njsmith/cpython that referenced this issue May 16, 2017
Before, it was possible to get the following sequence of
events (especially on Windows, where the C-level signal handler for
SIGINT is run in a separate thread):

- SIGINT arrives
- trip_signal is called
- trip_signal writes to the wakeup fd
- the main thread wakes up from select()-or-equivalent
- the main thread checks for pending signals, but doesn't see any
- the main thread drains the wakeup fd
- the main thread goes back to sleep
- trip_signal sets is_tripped=1 and calls Py_AddPendingCall to notify
  the main thread the it should run the Python-level signal handler
- the main thread doesn't notice because it's asleep

This has been causing repeated failures in the Trio test suite:
  python-trio/trio#119
vstinner pushed a commit to python/cpython that referenced this issue May 16, 2017
Before, it was possible to get the following sequence of
events (especially on Windows, where the C-level signal handler for
SIGINT is run in a separate thread):

- SIGINT arrives
- trip_signal is called
- trip_signal writes to the wakeup fd
- the main thread wakes up from select()-or-equivalent
- the main thread checks for pending signals, but doesn't see any
- the main thread drains the wakeup fd
- the main thread goes back to sleep
- trip_signal sets is_tripped=1 and calls Py_AddPendingCall to notify
  the main thread the it should run the Python-level signal handler
- the main thread doesn't notice because it's asleep

This has been causing repeated failures in the Trio test suite:
  python-trio/trio#119
@njsmith
Copy link
Member Author

njsmith commented Jun 8, 2017

@njsmith
Copy link
Member Author

njsmith commented Jun 8, 2017

I just saw a different and also very weird failure mode in this test: trying to reproduce in my local VM, I ran with --count=10, and on this particular run all 10 tries the test finished in like 0.85-0.95 seconds, so the assert 1 <= duration check failed. I've never seen this before or since. Super weird.

vstinner added a commit to python/cpython that referenced this issue Jun 10, 2017
…2075)

Before, it was possible to get the following sequence of
events (especially on Windows, where the C-level signal handler for
SIGINT is run in a separate thread):

- SIGINT arrives
- trip_signal is called
- trip_signal writes to the wakeup fd
- the main thread wakes up from select()-or-equivalent
- the main thread checks for pending signals, but doesn't see any
- the main thread drains the wakeup fd
- the main thread goes back to sleep
- trip_signal sets is_tripped=1 and calls Py_AddPendingCall to notify
  the main thread the it should run the Python-level signal handler
- the main thread doesn't notice because it's asleep

This has been causing repeated failures in the Trio test suite:
  python-trio/trio#119
(cherry picked from commit 4ae0149)
@njsmith
Copy link
Member Author

njsmith commented Aug 4, 2017

Update:

I figured out one possible reason this could still be failing, that's almost certainly wrong: bpo-31119. AFAICT this is a real bug, but x86/x86-64 provide such strong memory ordering guarantees that it probably doesn't happen in practice on those platforms.

I also figured out why it's actually failing (I think)! Here's a log with --fulltrace enabled: https://ci.appveyor.com/project/njsmith/trio/build/1.0.602/job/9lx65iv7nw877jjf
(which I got by hacking the appveyor build to do pip install pytest-repeat and then pytest [stuff] --fulltrace trio/_core/tests/test_ki.py -k test_ki_wakes_us_up --count=1000)

So the problem is that sometimes, the second ki arrives at a moment when we are in a KI-protected section, so we need to add a checkpoint to check for it. Otherwise it gets detected when our main function exits, which is what we've been seeing.

@njsmith
Copy link
Member Author

njsmith commented Aug 4, 2017

ran 4 builds with --count=1000, 2 of them failed: https://ci.appveyor.com/project/njsmith/trio/build/1.0.602

added a checkpoint then did another 4 builds with --count=1000, they all passed: https://ci.appveyor.com/project/njsmith/trio/build/1.0.603

So tentatively, I think that this must be it.

njsmith added a commit to njsmith/trio that referenced this issue Aug 4, 2017
@njsmith njsmith closed this as completed in 50edf8a Aug 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant