-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
python3.8 regression - ThreadPool join via __del__ hangs forever #83541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Assume the following code: from multiprocessing.pool import ThreadPool
class A(object):
def __init__(self):
self.pool = ThreadPool()
def __del__(self):
self.pool.close()
self.pool.join()
a = A()
print(a) The code snippet above hangs forever on Python 3.8+ (works ok on Python 3.7 and earlier). An example output where I've added some extra prints on to the thread joins:
I've tested on MacOs, but could replicate on Linux too within the CI. |
An FYI update, the code is used by swagger-codegen to generate HTTP clients (https://github.com/swagger-api/swagger-codegen/blob/master/modules/swagger-codegen/src/main/resources/python/api_client.mustache#L73). The example below basically replicates creating such client as a global. |
This code snippet seems to trigger a crash in the _struct module: |
Reported the issue downstream too under swagger-api/swagger-codegen#9991 |
This is not a regression: the code only worked on 3.7 because there is a sleep in the pool maintainance loop that makes the pool more likely being more consistent, but is the same problem as there is a race condition. For example, reducing the sleep in the loop (notice 3.8 does not use a sleep but instead uses a more resilient system of sentinels) for the 3.7 version reproduces the problem: diff --git a/Lib/multiprocessing/pool.py b/Lib/multiprocessing/pool.py
index 3e9a0d6b48..f8d438d87c 100644
--- a/Lib/multiprocessing/pool.py
+++ b/Lib/multiprocessing/pool.py
@@ -410,7 +410,7 @@ class Pool(object):
# is terminated.
while thread._state == RUN or (pool._cache and thread._state != TERMINATE):
pool._maintain_pool()
- time.sleep(0.1)
+ time.sleep(0.00000001)
# send sentinel to stop workers
pool._taskqueue.put(None)
util.debug('worker handler exiting') With that patch, 3.7 hangs as well. The problem here is that something regarding the locks inside the SimpleQueue of inbound tasks is not in a consistent state when Python finalization calls __del__. |
Check out also https://bugs.python.org/issue38744 |
I've applied pull request patch (which looks a bit weird to me, removes something then adds the same thing back etc https://patch-diff.githubusercontent.com/raw/python/cpython/pull/19009.patch). Essentially the change is: diff --git a/Lib/multiprocessing/pool.py b/Lib/multiprocessing/pool.py - worker_handler._state = TERMINATE
+ # Explicitly do the cleanup here if it didn't come from terminate()
+ # (required for if the queue will block, if it is already closed)
+ if worker_handler._state != TERMINATE:
+ # Notify that the worker_handler state has been changed so the
+ # _handle_workers loop can be unblocked (and exited) in order to
+ # send the finalization sentinel all the workers.
+ worker_handler._state = TERMINATE
+ change_notifier.put(None)
+
task_handler._state = TERMINATE util.debug('helping task handler/workers to finish') Unfortunately test case from first message in this bug report still hangs for me with this applied. |
(note, testing on 3.8 branch + pull request patch) |
Is explained in the message in https://bugs.python.org/msg364211: What happens is that is moving that code so is executed in both code paths: explicit termination and multiprocessing finalization.
Hummm, the test case from this bug is precisely the test case in PR 19009 so it should not hang. |
Oh, actually I am wrong as this is a different issue I had in mind. PR 19009 should fix the case reported in https://bugs.python.org/issue38744 and this equivalent: from multiprocessing import Pool
class A:
def __init__(self):
self.pool = Pool(processes=1)
def test():
problem = A()
problem.pool.map(float, tuple(range(10)))
if __name__ == "__main__":
test() Notice that in this code there is no __del__ call. |
I tried and I cannot reproduce the hang with the code provided in the first message here after PR 19009. (Tested on Linux, Windows and MacOS) |
Ok, so two test cases from multiprocessing.pool import ThreadPool
class A(object):
def __init__(self):
self.pool = ThreadPool()
def __del__(self):
self.pool.close()
self.pool.join()
a = A()
print(a) from multiprocessing import Pool
class A:
def __init__(self):
self.pool = Pool(processes=1)
def test():
problem = A()
problem.pool.map(float, tuple(range(10)))
if __name__ == "__main__":
test() On two my test environments with Linux, 5.5.2 and 4.9.208 kernel, glibc 2.31 (userspace is mostly the same on both) test 1) hangs, 2) doesn't hang. Testing done on 3.8 branch + https://patch-diff.githubusercontent.com/raw/python/cpython/pull/19023.patch For test 1) no traceback on ctrl+c and also no trace logged with faulthandler installed, so no idea where it hangt. strace shows |
Sadly, I am unable to reproduce, so without more clarification or a reliable reproducer, we cannot start debugging the issue.
Can you attach with gdb and tell the C trace and the python trace (check the gdb helpers on the CPython repo for the later). |
http://ixion.pld-linux.org/~arekm/python-gdb1.txt it's a result of thread apply all bt, thread apply all bt full, where pystack command shows nothing. ps. I'm arekm@freenode and can give ssh account for playing with the issue |
Can you send me an email to pablogsal@gmail.com to set up ssh access? |
I still cannot reproduce it in your system: [test@ixion-pld cpython]$ hostname [test@ixion-pld cpython]$ cat lel.py real 0m0.124s |
I tested it on the master branch with commit 5b66ec1 |
master works fine so there is some other/related issue. |
On master test 1) hangs before commit below and works after commit below. Unfortunately applying that commit to 3.8 branch doesn't help - 3.8 still hangs. Some other fix is also needed I guess commit 9ad58ac
|
This one is also needed on 3.8 to get it not hang with 1) test case. So 3.8 branch + 9ad58ac + 4d96b46 is working for me. commit 4d96b46
|
Oh, sorry, I missed your last message. |
Let me look at commit 9ad58ac: Calling _PyThreadState_DeleteExcept() in Py_FinalizeEx() is really dangerous. It frees PyThreadState memory of daemon threads. Daemon threads continue to run while Py_FinalizeEx() is running (which takes an unknown amount of time, we only know that it's larger than 0 seconds). When a daemon thread attempts to acquire the GIL, it will likely crash if its PyThreadState memory is freed. This memory can be overriden by another memory allocation, or dereferencing the pointer can trigger a segmentation fault. This change caused multiple regressions in the master branch. I had hard time to fix all crashes: I modified take_gil() 4 times, and I'm still not sure that my fix is correct. I had to modify take_gil() function which acquire the GIL: this function is really fragile and I would prefer to not touch it in a stable branch. See bpo-39877 changes to have an idea of the complexity of the problem. Python finalization is really fragile: https://pythondev.readthedocs.io/finalization.html |
commit 4d96b46 "(...) The release_sentinel() function is now called when the C API is still This change is "safer" than the other one, but there is still a risk of introducing a regression. The problem is that this commit is part of a larger effort to make the Python finalization more reliable. The commit makes sense in the serie of commits pushed into the master branch, but I'm not sure that it makes sense if it's taken alone. I don't have the bandwidth right now to investigate this issue, to identify the root issue, and suggest which commits should be backported or not. |
Ok, so sadly it seems that we need to close this issue as is fixed in master but the backport is risky. |
If you want to get a reliable behavior, don't rely on destructors. Python finalization is not determistic and destructors can be called while Python is not longer fully functional. Release ressources explicitly. For example, use multiprocessing.Pool with a context manager, or even call close() and join() methods explicitly. |
So only this test (from first commit) should be added on master to check for further regressions in the area, right? |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: