-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in Thread._wait_for_tstate_lock() #89437
Comments
Bernát Gábor found an interesting bug on Windows. Sometimes, when a process is interrupted on Windows with CTRL+C, its parent process hangs in thread.join(): Reproducer:
tox.ini: [testenv] Source: https://gist.github.com/gaborbernat/f1e1aff0f2ee514b50a92a4d019d4d13 I tried to attach the Python process in Python: there is a single thread, the main thread which is blocked in thread.join(). You can also see it in the faulthandler traceback. I did a long analysis of the _tstate_lock and I checked that thread really completed. Raw debug traces: == thread 6200 exit == thread_run[pid=3984, thread_id=6200]: clear == main thread is calling join() but gets a KeyboardInterrupt == [pid=3984, thread_id=8000] Lock<obj=000001C1122669C0>.acquire() -> ACQUIRED == main thread calls repr(thread) == ROOT: [3984] KeyboardInterrupt - teardown started == main thread calls thread.join()... which hangs == _wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: acquire(block=True, timeout=-1): lock obj= 0x1c1122669c0 Explanation:
def _wait_for_tstate_lock(self, block=True, timeout=-1):
lock = self._tstate_lock
if lock is None:
assert self._is_stopped
elif lock.acquire(block, timeout):
# -- got KeyboardInterrupt here ---
lock.release()
self._stop()
You can reproduce the issue on Linux with attached patch and script:
Example: $ git apply threading_bug.patch
$ ./python threading_bug.py
join...
join failed with: KeyboardInterrupt()
join again... I'm now working on a PR to fix the race condition. |
For curious people, see also bpo-44422 "Fix threading.enumerate() reentrant call": another example of race condition recently fixed in the threading module. But it's unrelated to this bug ;-) |
I am not sure that it can be solved at Python level. Possible solutions:
if lock._blink(block, timeout):
self._stop()
with suppress_interrupt():
if not lock._blink(block, timeout):
return
self._stop() |
That's not enough technically. This can be any signal handler that raises an exception, no? |
See bpo-21822 from 2014, which I think was the first report of this issue. |
I just want to note that my maximal reproducer moved to tox-dev/tox@e5d1a43 (that has been merged on the branch, so should be now stable). |
Right. In pure Python, we cannot write code which works in all cases. My PR 28532 fix the most common case: application interrupted by a single CTRL+C.
It's important to be able to interrupt acquire() which can be called in blocking mode with no timeout: it's exactly what tox does, and users expect to be able to interrupt tox in this case.
The acquire()+release() sequence can be made atomic in C, but it doesn't solve the problem of _stop() which can be interrupted by a second exception. This bug is likely as old as Python. I don't think that we should attempt to design a perfect solution. I only propose to make the race condition (way) less likely. |
This is the same as bpo-21822, so I suppose it should be closed as well with a reference to this issue. |
You're right. I marked bpo-21822 as a duplicate of this issue. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: