New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in how trip_signal writes to wakeup fd #74224
Comments
In trip_signal [1], the logic goes:
So the problem here is that it's step (2) that wakes up the main thread to check for signals, but it's step (3) that actually arranges for the Python-level signal handler to run. (Step (1) turns out to be irrelevant, because no-one looks at the per-signal flags unless the global is_tripped flag is set. This might be why no-one noticed this bug through code inspection though – I certainly missed it, despite explicitly checking for it several times!) The result is that the following sequence of events is possible:
It turns out that this is a real thing that can actually happen; it's causing an annoying intermittent failure in the trio testsuite on appveyor; and under the correct conditions I can reproduce it very reliably in my local Windows VM. See [2]. I think the fix is just to swap the order of steps (2) and (3), so we write to the wakeup fd last. Unfortunately I can't easily test this because I don't have a way to build CPython on Windows. But [2] has some IMHO pretty compelling evidence that this is what's happening. [1] cpython/Modules/signalmodule.c Lines 238 to 291 in 6fab78e
[2] python-trio/trio#119 |
Last time I had to make a major change related to signal handling, it was in the asyncio module because of a race conditon which occurred on FreeBSD. commit fe5649c
|
Right. My claim would be that the PR I just submitted is the correct fix for bpo-21645 as well. The approach asyncio uses is very elegant, but unfortunately it assumes that the wakeup fd has infinite buffer, which isn't true. If enough signals or other callbacks are scheduled to the event loop quickly enough, then the buffer can overflow and signals can get lost. This is a very classic issue and source of confusion for everyone learning Unix – the reason that there are traditionally 32 unix signals, and that signals can be coalesced and SIGCHLD is so annoying, is that the kernel treats pending signals as flags, not a queue; this means that in the worst case it only takes 32 bits to store all pending signals. (Even the posix committee screwed this up when designing the real-time signals system, which is almost unusable as a result.) On Unix, if you send a process 10 SIGHUP and 1 SIGCHLD, then it might get only 1 SIGHUP and 1 SIGCHLD, but it's guaranteed to get that. With asyncio's model, it might get 10 SIGHUP and 0 SIGCHLD, or even 0 SIGHUP and 0 SIGCHLD (since other methods like call_soon_threadsafe also write to the wakeup fd). In any case, this is why other async libraries (at least twisted, libuv, tornado, trio) treat the wakeup fd as a boolean empty-or-not, and pass the actual information via some other out-of-band mechanism that involves a Python-level signal handler. So the patch here is necessary for everyone else to safely use set_wakeup_fd. |
Previous changes in signal handling. It's the commit c13ef66 of the issue bpo-8407 which changed the order: <add_pending, write into wakeup fd> became <write into wakeup fd, add_pending>. I really *hate* having to think to these evil things which are signals and threads... Signals and threads don't go well altogether :-p It reminds me the "Ghosts of Unix past, part 3: Unfixable designs" article... commit 6c9b35b
commit c13ef66
|
Err, libuv obviously doesn't use a Python-level signal handler. I just meant to include them as another example of a library I checked that uses a self-pipe to handle signals but relies on out-of-band information to transmit what the actual signal is :-). |
I think the idea in c13ef66 wasn't so much that we wanted the wakeup fd to be written to first, as that the way the code was written back then, the presence of 'if (is_tripped) return;' meant that it wasn't getting written to *at all* in some cases. Since then the code got restructured and the early return was removed, so this isn't an issue anymore. |
If it helps, notice that the SetEvent(sigint_event) call used to wake up the main thread on windows is also performed unconditionally and after the call to Py_AddPendingEvent. From the point of view of twisted/tornado/trio, this is exactly the same as the write to the wakeup fd -- the only reason we use the wakeup fd instead of the sigint_event is that it's more convenient to wait on an fd than on an event object. |
"I think the idea in c13ef66 wasn't so much that we wanted the wakeup fd to be written to first, as that the way the code was written back then, the presence of 'if (is_tripped) return;' meant that it wasn't getting written to *at all* in some cases. Since then the code got restructured and the early return was removed, so this isn't an issue anymore." I agree. It wasn't deliberate to change the order, it's was more to write a short patch fixing a very specific use case. |
I know that it can be very difficult to write such test, but can you please try to write a script trying to reproduce the describe the race condition? Later, we can run the script in a test to check for non-regression. |
The attached script wakeup-fd-racer.py fails consistently for me using cpython 3.6.0 on my windows 10 vm:
(It may help that the VM only has 1 CPU? But the same test passes on Linux even when I use taskset to restrict it to 1 cpu. Maybe Windows has some scheduling heuristic where one thread writing to a socket when another thread is blocked on it triggers an immediate context switch.) |
While I suggest you to *not* use an event loop (wakeup fd pipe/socket handle with select) and signal.signal(), you are true that there is a race condition if you use select() with signal.signal() so I merged your change. Do you consider that it's worth it to backport the change to 3.5 and 3.6? Python 2.7 is not affected, right? |
Unfortunately this is the only 100% reliable way to do signal handling with an event loop so I guess you are unlikely to convince any of the projects that use it this way to change their minds :-)
I guess it'd be nice? At least for 3.6? I just had yet another random PR fail its tests due to this bug -- it's turning out to be quite difficult to make my tests reliable :-/. (BTW do you happen to know any tricks to force CPython to do an immediate PyErr_CheckSignals on Windows?) |
Never mind on this... it looks like calling repr() on any object is sufficient. |
I guess now would be a good time to decide whether this should be backported to 3.6, with 3.6.2 coming out in a few days :-). (Or if not, then it can probably be closed?) |
@ned Deily: Hey, Nathaniel wants to backport the commit 4ae0149 which changes how the C signal handler of CPython. There is a low risk of regression, it can be seen as a backward incompatible change. I'm not super excited to backport such change. I consider that there are other solutions to handle signals without hitting this bug. For example, asyncio works well on Python 3.5, there is no known race condition. Basically, the only impacted application is Nathaniel's Trio project. @yury: And you, what do you think? |
I think you mean it's backwards *compatible*? There's definitely no change in APIs or ABIs or anything, all that changes is the order of some statements inside Python's signal handler. (We used to we write to the magic wakeup fd and then set a flag saying a signal arrived; now we do the same things but in the opposite order. I wouldn't call it zero risk, just on the general principle that signal handlers are tricksy beasts, but I think it's about as low risk as a signal handler change can be.) AFAICT the race condition also affects twisted and tornado, though I don't think they've noticed. The main symptom is that it makes control-C handling flaky on Windows, so it's a difficult thing to test for. I would prefer to see it backported to 3.6, but it's a weird edge case bug so the world is not going to end if the fix has to wait for 3.7. |
I mean incompatible since it changes the behaviour in a subtle way. |
But, by that definition, like... every change is backwards incompatible. I'm pretty confident that no-one was relying on this race condition. I can't even figure out what that would mean. |
It sounds like a bug to me, and thus suitable to backport, but it certainly would be worth getting other opinions since this is a tricky and critical area. Antoine, Steve, others: thoughts? |
I think it's safe to merge this in 3.6. |
Ned Deily and Yury Selivanov are in favor of a backport, thank you for looking at this one guys. So I created the PR 2075. I will merge it once the CI pass. |
Misc/NEWS entries for 3.6 and master for these changes? |
Why is this PR not closed? |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: