-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiprocessing.Pool and ThreadPool leak resources after being deleted #78353
Comments
In multiprocessing.Pool documentation it's written "When the pool object is garbage collected terminate() will be called immediately.": https://docs.python.org/3.7/library/multiprocessing.html#multiprocessing.pool.Pool.terminate A. This does not happen, creating a Pool, deleting it and collecting the garbage, does not call terminate. B. The documentation for Pool itself does not specify it has a context manager (but the examples show it). C. This bug is both in Python 3 & 2. |
Would you give me an example how you delete the Pool and collecting the garbage? If you use context manager, It will call terminate() function.
You can find this info on the same page: New in version 3.3: Pool objects now support the context management protocol – see Context Manager Types. __enter__() returns the pool object, and __exit__() calls terminate(). |
>>> from multiprocessing import Pool
>>> import gc
>>> a = Pool(10)
>>> del a
>>> gc.collect()
0
>>> After this, there are still left behind Process (Pool) or Dummy (ThreadPool) and big _cache data (If you did something with it) which lingers till the process dies. You are correct on the other issue (I'm using and reading the Python 2 documentation which does not have that...). |
A patch would just add def __del__(self):
self.terminate() in the Pool object. |
But alas that does not work... |
Add a __del__ method in the Pool class should work. But I'm not sure we should do this. |
It would be sufficient to modify the documentation to reflect the code. There are other objects like [0] https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files |
Indeed, I think this simply needs a documentation fix. |
What other object in the standard lib, leaks resources when deleted in CPython ? Even that documentation says the garbage collector will eventually destroy it, just like here... I think there is an implementation bug. |
I think I've found the code bug causing the leak: cpython/Lib/multiprocessing/pool.py Line 180 in caa331d
There is a circular reference between the Pool object, and the self._worker_handler Thread object (and it's also saved in the frame locals for the thread object, which prevents it from being garbage collected). |
Threads come to mind, for example: >>> import time, threading, weakref
>>> t = threading.Thread(target=time.sleep, args=(100000,))
>>> t.start()
>>> wr = weakref.ref(t)
>>> del t
>>> wr()
<Thread(Thread-1, started 139937234327296)> Note I'm not against fixing this issue, just saying it's not that surprising for Pool to keep lingering around when you lost any user-visible reference to it. |
It actually makes tons of sense that while the thread is running, that the object representing it is alive. After the thread finishes its work, the object dies. >>> import time, threading, weakref, gc
>>> t = threading.Thread(target=time.sleep, args=(10,))
>>> wr = weakref.ref(t)
>>> t.start()
>>> del t
>>> gc.collect()
>>> wr()
<Thread(Thread-1, started 139937234327296)>
Wait 10 seconds...
>>> gc.collect()
>>> wr() The thread is gone (which doesn't happen with the pool). Anyhow, I've submitted a patch to fix the bug that was introduced 9 years ago on GH, feel free to check it. |
Thanks a lot tzickle, I'll take a look. |
Thanks tzickler for the report and pull request, and sorry for the delay. This is now fixed in all 3.x branches. I will close this now as multiprocessing in 2.7 diverges quite a bit from 3.x. If you want to fix the issue in 2.7 as well, please say so and I'll reopen. |
(tzickel, sorry for mistyping your handle :-/) |
Its ok, you only did it twice :) I've submitted a manual 2.7 fix on GH. |
multiprocessing.Pool.imap hangs in MacOs after applying this commit: import multiprocessing
def the_test():
print("Begin")
for x in multiprocessing.Pool().imap(int,
["4", "3"]):
print(x)
print("End")
the_test() This also happens in the backported branches. |
The previous posts here touch all this subjects: B. Large amount of code was developed for this technique: C. The reason I opened this bug was because I was called to see why a long running process crashes after a while, and found out it leaked tons of subprocesses / pool._cache memory. D. The quoted code, will currently simply leak each invocation lots of subprocesses... I too, think we should push for the said fix. |
tzickel:
It is a *very bad* practice to rely on __del__. Don't do that. That's why we introduced ResourceWarning. tzickel:
Is this API *incompatible* with pool.close()? Explicitly release resources? Pablo:
I'm not comfortable with the fix. I cannot explain why but I feel like adding a strong dependency from a child to its parent is going to lead to more bugs, not less. It sounds like a recipe for reference cycles. Maybe I'm just plain wrong. At this point, I would like that someone explains me what the problem is. #10852 is a solution, ok, but what is the problem? What does the code hangs, whereas previously it was fine? Is the example code really correct? Do we want to support such usage? I understand that msg330864 rely on black magic to expect that it's going to be fine. The lifetime of the pool is implicit and it sounds like a bad design. I don't want to endorse that. |
The pool child objects (imap iterators, async results...etc) need to keep a reference to the parent because if not, the caller is in charge of doing that and that may lead to bugs. Is the same scenario as if I get a dictionary iterator and then I destroy my reference to the dictionary: if the iterator does not keep a reference to the parent (the dictionary) it will not be possible to be used in the future. Indeed, we can see that this is what happens: I'm not comfortable with the fix. I cannot explain why but I feel like adding a strong dependency from a child to its parent is going to lead to more bugs, not less. It sounds like a recipe for reference cycles. Maybe I'm just plain wrong. >>> x = {1:2}
>>> y = iter(x)
>>> gc.get_referrers(x)
[<dict_keyiterator object at 0x0000024447D6D598>,
{'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <class '_frozen_importlib.BuiltinImporter'>, '__spec__': None, '__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, 'y': <dict_keyiterator object at 0x0000024447D6D598>, 'gc': <module 'gc' (built-in)>, 'x': {1: 2}}
] We can see that the dict_keyiterator refers to the dictionary, keeping it alive. Here we have the same situation: if we do not keep the pool alive, the iterator will hang when iterating because the jobs won't get finished.
The code hangs because the pool was not being destroyed before due to the reference cycle between the pool and an internal object (a Thread). Now it hangs because the worker thread is destroyed with the pool as no references are kept to it and the job that the iterator is waiting on is never finished.
I found the weird code in the example in several projects. I have corrected it to use the pool as a context manager or to call close(), but this means that users are doing this and it used to work and not it does not: technically is a regression. |
That's why I'm asking for a revert :-) I suggest to revert this change immediately from 2.7, 3.6 and 3.7, and later see what can be done for master. Even if we design carefully an API, there will be always someone to misuse it :-) I would prefer to stop promoting such bad code and find a transition to more correct code. I disagree that a child should keep its parent alive. I would be ok to use a *weak reference* from the child to the parent to detect when the parent goes away, and maybe trigger an action in that case. For example, raise an exception or log a warning. |
But this is normal across the standard library. For example, here is how a deque iterator keeps the deque alive: >>> x = deque([1,2,3])
>>> deque_iter = iter(x)
>>> deque_weakref = weakref.ref(x)
>>> del x
>>> gc.collect()
>>> gc.get_referrers(deque_weakref())
[<_collections._deque_iterator object at 0x0000024447ED6EA8>]
Here, the deque iterator is the *only* reference to the deque. When we destroy it, the deque is destroyed:
>>> del deque_iter
>>> gc.collect()
>>> deque_weakref()
None |
Reverting the code will cause another class of problems, like the reason I fixed it. Programs written such as the example that Pablo gave (and what I've seen) will quietly leak child processes, file descriptors (for the pipes) and memory to a variety degree might not be detected, or in the end detected in a big error or crash. Also in some ResourceWarnings (if not all), the resources are closed in the end (like in sockets), here without this code patch you cannot implicitly reclaim the resources (because there is a Thread involved here), which I think is a high bar for the user to think about. You can also enable multiprocessing's debug logging to see how the code behaves with and without the fix: I also agree with Pablo that there is code in the stdlib that holdes reference between child and parent. There is also code that has circular reference (for example Python 2's OrderedDict) and that is ok as well (not that this is the situation here). |
Just to clarify: is not that is just code in the stdlib that keeps a reference between child and parent. The examples I have given are the exact same situation that we have here: the iterator object associated with another needs to keep its parent alive to work correctly. |
Another example of complex issue related to object lifetime, resources (file descriptors) and multiprocessing: bpo-30966, add SimpleQueue.close(). |
I reverted the change in 2.7, 3.6, 3.7 and master branches because it introduces a regression and we are very close to a release: I don't want to have the pressure to push a quick fix. I would like to make sure that we have enough time to design a proper fix. I'm not saying that Pablo's fix is not correct, it's just bad timing. This bug is likely here for a long time, so I think that it's ok to still have it in the next 3.6 and 3.7 bugfix releases. I suggest to open a discussion on the python-dev mailing list about multiprocessing relying on the garbage collector and lifetime of multiprocessing objects (Pool, Process, result, etc.). It seems like I disagree with Pablo and tzickel, whereas Armin Rigo (PyPy which has a different GC) is more on my side (release explicitly resources) :-) I would prefer to move towards explicit resource managment instead of relying on destructors and the garbage collector. For example, it's a bad practice to rely on these when using PyPy. See my previous comments about issues related to multiprocessing objects lifetime. |
I agree that reverting in bugfix branches was the right thing to do. I think the fix should have remained in master, though. |
See also bpo-35424: "multiprocessing.Pool: emit ResourceWarning". I wrote 10986 to fix 2 tests which leak resources. I have a question. Why do tests have to call "pool.join()" after "with pool:"? When I use a file, I know that the resources are released after "with file:". Should Pool.__exit__() call Pool.join()? This question reminds me my fix in socketserver (bpo-31151 and bpo-31233) which leaked processes and threads, and my bug bpo-34037 (asyncio leaks threads). |
I started a thread on python-dev to discuss these issues: |
The new test_del_pool() test of the fix failed on a buildbot: bpo-35413 "test_multiprocessing_fork: test_del_pool() leaks dangling threads and processes on AMD64 FreeBSD CURRENT Shared 3.x". |
multiprocessing.Pool destructor now emits a ResourceWarning if it is still running: if .close() nor .terminate() have been called, see bpo- 35424. It is a first alarm that the problematic example is wrong. Should reconsider to fix this bug in the master branch? If yes, we should carefully document this backward incompatible change. |
Pablo fixed bpo-35378 with: New changeset 3766f18 by Pablo Galindo in branch 'master': Does this change also fix this issue? If not, can we attempt again to fix this issue? Moreover, should we do something in Python 3.7? Sadly, I don't think that we can do anything for 3.7 and 2.7. |
What's the status of this issue? |
Pablo's fix looks like a superset of the original fix applied here, so I'm assuming it fixes this issue as well. |
It should probably be backport to all supported 3.x branches though, so people aren't required to move to 3.8 to benefit from it. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: