Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing.Pool and ThreadPool leak resources after being deleted #78353

Closed
tzickel mannequin opened this issue Jul 20, 2018 · 52 comments
Closed

multiprocessing.Pool and ThreadPool leak resources after being deleted #78353

tzickel mannequin opened this issue Jul 20, 2018 · 52 comments
Labels
3.8 (EOL) end of life docs Documentation in the Doc dir stdlib Python modules in the Lib dir topic-multiprocessing type-bug An unexpected behavior, bug, or error

Comments

@tzickel
Copy link
Mannequin

tzickel mannequin commented Jul 20, 2018

BPO 34172
Nosy @pitrou, @vstinner, @benjaminp, @ned-deily, @zware, @mattip, @MojoVampire, @applio, @tzickel, @pablogsal, @Windsooon
PRs
  • bpo-34172: multiprocessing.Pool leaks resources after being deleted #8450
  • [3.7] bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) #9676
  • [3.6] bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) #9677
  • [2.7] bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) #9686
  • [3.7] Revert "bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) (GH-9676)" #10968
  • [3.6] Revert "bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) (GH-9677)" #10969
  • [2.7] Revert "[2.7] bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-9686)" #10970
  • Revert "bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450)" #10971
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2018-07-20.16:45:29.720>
    labels = ['3.8', 'type-bug', 'library', 'docs']
    title = 'multiprocessing.Pool and ThreadPool leak resources after being deleted'
    updated_at = <Date 2019-10-23.18:45:00.937>
    user = 'https://github.com/tzickel'

    bugs.python.org fields:

    activity = <Date 2019-10-23.18:45:00.937>
    actor = 'josh.r'
    assignee = 'docs@python'
    closed = False
    closed_date = None
    closer = None
    components = ['Documentation', 'Library (Lib)']
    creation = <Date 2018-07-20.16:45:29.720>
    creator = 'tzickel'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 34172
    keywords = ['patch']
    message_count = 52.0
    messages = ['322028', '322076', '322090', '322129', '322147', '322150', '322172', '322190', '322194', '322230', '322231', '323103', '323104', '326904', '326909', '326913', '326914', '326915', '326948', '326964', '330864', '330865', '330866', '330868', '330869', '330891', '330954', '330955', '330960', '330962', '330988', '330996', '330997', '330999', '331009', '331026', '331087', '331190', '331198', '331199', '331200', '331210', '331216', '331218', '331221', '331629', '331703', '332294', '335357', '355229', '355247', '355248']
    nosy_count = 13.0
    nosy_names = ['pitrou', 'vstinner', 'benjamin.peterson', 'ned.deily', 'docs@python', 'zach.ware', 'mattip', 'josh.r', 'davin', 'tzickel', 'pablogsal', 'Windson Yang', 'Vy Nguyen']
    pr_nums = ['8450', '9676', '9677', '9686', '10968', '10969', '10970', '10971']
    priority = None
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue34172'
    versions = ['Python 3.8']

    @tzickel
    Copy link
    Mannequin Author

    tzickel mannequin commented Jul 20, 2018

    In multiprocessing.Pool documentation it's written "When the pool object is garbage collected terminate() will be called immediately.":

    https://docs.python.org/3.7/library/multiprocessing.html#multiprocessing.pool.Pool.terminate

    A. This does not happen, creating a Pool, deleting it and collecting the garbage, does not call terminate.

    B. The documentation for Pool itself does not specify it has a context manager (but the examples show it).

    C. This bug is both in Python 3 & 2.

    @tzickel tzickel mannequin added 3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jul 20, 2018
    @Windsooon
    Copy link
    Mannequin

    Windsooon mannequin commented Jul 21, 2018

    A. This does not happen, creating a Pool, deleting it and collecting the garbage, does not call terminate.

    Would you give me an example how you delete the Pool and collecting the garbage? If you use context manager, It will call terminate() function.

    B. The documentation for Pool itself does not specify it has a context manager (but the examples show it).

    You can find this info on the same page:

    New in version 3.3: Pool objects now support the context management protocol – see Context Manager Types. __enter__() returns the pool object, and __exit__() calls terminate().

    @tzickel
    Copy link
    Mannequin Author

    tzickel mannequin commented Jul 21, 2018

    >>> from multiprocessing import Pool
    >>> import gc
    >>> a = Pool(10)
    >>> del a
    >>> gc.collect()
    0
    >>>

    After this, there are still left behind Process (Pool) or Dummy (ThreadPool) and big _cache data (If you did something with it) which lingers till the process dies.

    You are correct on the other issue (I'm using and reading the Python 2 documentation which does not have that...).

    @Windsooon
    Copy link
    Mannequin

    Windsooon mannequin commented Jul 22, 2018

    A patch would just add

        def __del__(self):
            self.terminate()

    in the Pool object.

    @tzickel
    Copy link
    Mannequin Author

    tzickel mannequin commented Jul 22, 2018

    But alas that does not work...

    @Windsooon
    Copy link
    Mannequin

    Windsooon mannequin commented Jul 22, 2018

    Add a __del__ method in the Pool class should work. But I'm not sure we should do this.

    @mattip
    Copy link
    Contributor

    mattip commented Jul 23, 2018

    It would be sufficient to modify the documentation to reflect the code.

    There are other objects like file that recommend[0] calling a method to release resources without depending on implementation-specific details like garbage collection.

    [0] https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

    @pitrou
    Copy link
    Member

    pitrou commented Jul 23, 2018

    Indeed, I think this simply needs a documentation fix.

    @pitrou pitrou added docs Documentation in the Doc dir 3.8 (EOL) end of life labels Jul 23, 2018
    @tzickel
    Copy link
    Mannequin Author

    tzickel mannequin commented Jul 23, 2018

    What other object in the standard lib, leaks resources when deleted in CPython ? Even that documentation says the garbage collector will eventually destroy it, just like here... I think there is an implementation bug.

    @tzickel
    Copy link
    Mannequin Author

    tzickel mannequin commented Jul 23, 2018

    I think I've found the code bug causing the leak:

    args=(self, )

    There is a circular reference between the Pool object, and the self._worker_handler Thread object (and it's also saved in the frame locals for the thread object, which prevents it from being garbage collected).

    @pitrou
    Copy link
    Member

    pitrou commented Jul 23, 2018

    What other object in the standard lib, leaks resources when deleted in CPython ?

    Threads come to mind, for example:

    >>> import time, threading, weakref
    >>> t = threading.Thread(target=time.sleep, args=(100000,))
    >>> t.start()
    >>> wr = weakref.ref(t)
    >>> del t
    >>> wr()
    <Thread(Thread-1, started 139937234327296)>

    Note I'm not against fixing this issue, just saying it's not that surprising for Pool to keep lingering around when you lost any user-visible reference to it.

    @tzickel
    Copy link
    Mannequin Author

    tzickel mannequin commented Aug 4, 2018

    It actually makes tons of sense that while the thread is running, that the object representing it is alive. After the thread finishes its work, the object dies.

    >>> import time, threading, weakref, gc
    >>> t = threading.Thread(target=time.sleep, args=(10,))
    >>> wr = weakref.ref(t)
    >>> t.start()
    >>> del t
    >>> gc.collect()
    >>> wr()
    <Thread(Thread-1, started 139937234327296)>
    Wait 10 seconds...
    >>> gc.collect()
    >>> wr()

    The thread is gone (which doesn't happen with the pool).

    Anyhow, I've submitted a patch to fix the bug that was introduced 9 years ago on GH, feel free to check it.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 4, 2018

    Thanks a lot tzickle, I'll take a look.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 2, 2018

    New changeset 97bfe8d by Antoine Pitrou (tzickel) in branch 'master':
    bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450)
    97bfe8d

    @pitrou
    Copy link
    Member

    pitrou commented Oct 2, 2018

    New changeset 97f998a by Antoine Pitrou (Miss Islington (bot)) in branch '3.7':
    bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) (GH-9676)
    97f998a

    @pitrou
    Copy link
    Member

    pitrou commented Oct 2, 2018

    New changeset 07b96a9 by Antoine Pitrou (Miss Islington (bot)) in branch '3.6':
    bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) (GH-9677)
    07b96a9

    @pitrou
    Copy link
    Member

    pitrou commented Oct 2, 2018

    Thanks tzickler for the report and pull request, and sorry for the delay.

    This is now fixed in all 3.x branches. I will close this now as multiprocessing in 2.7 diverges quite a bit from 3.x. If you want to fix the issue in 2.7 as well, please say so and I'll reopen.

    @pitrou pitrou closed this as completed Oct 2, 2018
    @pitrou
    Copy link
    Member

    pitrou commented Oct 2, 2018

    (tzickel, sorry for mistyping your handle :-/)

    @tzickel
    Copy link
    Mannequin Author

    tzickel mannequin commented Oct 3, 2018

    Its ok, you only did it twice :) I've submitted a manual 2.7 fix on GH.

    @pitrou pitrou reopened this Oct 3, 2018
    @pitrou
    Copy link
    Member

    pitrou commented Oct 3, 2018

    New changeset 4a7dd30 by Antoine Pitrou (tzickel) in branch '2.7':
    [2.7] bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-9686)
    4a7dd30

    @pitrou pitrou closed this as completed Oct 3, 2018
    @pablogsal
    Copy link
    Member

    multiprocessing.Pool.imap hangs in MacOs after applying this commit:

    import multiprocessing
    
    def the_test():
        print("Begin")
        for x in multiprocessing.Pool().imap(int,
                ["4", "3"]):
            print(x)
        print("End")
    
    the_test()

    This also happens in the backported branches.

    @tzickel
    Copy link
    Mannequin Author

    tzickel mannequin commented Dec 3, 2018

    The previous posts here touch all this subjects:
    A. The documentation explicitly says: "When the pool object is garbage collected terminate() will be called immediately." (Happened till a code refactor 9 years ago introduced this bug).

    B. Large amount of code was developed for this technique:
    https://github.com/python/cpython/blob/master/Lib/multiprocessing/util.py#L147 (Till the end of the file almost)

    C. The reason I opened this bug was because I was called to see why a long running process crashes after a while, and found out it leaked tons of subprocesses / pool._cache memory.

    D. The quoted code, will currently simply leak each invocation lots of subprocesses...

    I too, think we should push for the said fix.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 3, 2018

    tzickel:

    A. The documentation explicitly says: "When the pool object is garbage collected terminate() will be called immediately." (Happened till a code refactor 9 years ago introduced this bug).

    It is a *very bad* practice to rely on __del__. Don't do that. That's why we introduced ResourceWarning.

    tzickel:

    https://github.com/python/cpython/blob/master/Lib/multiprocessing/util.py#L147 (Till the end of the file almost)

    Is this API *incompatible* with pool.close()? Explicitly release resources?

    Pablo:

    Víctor, I have a PR fixing this in: (...) The fix is really simple, so I think we should fix it and not revert the change on this case.

    I'm not comfortable with the fix. I cannot explain why but I feel like adding a strong dependency from a child to its parent is going to lead to more bugs, not less. It sounds like a recipe for reference cycles. Maybe I'm just plain wrong.

    At this point, I would like that someone explains me what the problem is. #10852 is a solution, ok, but what is the problem? What does the code hangs, whereas previously it was fine? Is the example code really correct? Do we want to support such usage?

    I understand that msg330864 rely on black magic to expect that it's going to be fine. The lifetime of the pool is implicit and it sounds like a bad design. I don't want to endorse that.

    @pablogsal
    Copy link
    Member

    I'm not comfortable with the fix. I cannot explain why but I feel like adding a strong dependency from a child to its parent is going to lead to more bugs, not less. It sounds like a recipe for reference cycles. Maybe I'm just plain wrong.

    The pool child objects (imap iterators, async results...etc) need to keep a reference to the parent because if not, the caller is in charge of doing that and that may lead to bugs. Is the same scenario as if I get a dictionary iterator and then I destroy my reference to the dictionary: if the iterator does not keep a reference to the parent (the dictionary) it will not be possible to be used in the future. Indeed, we can see that this is what happens:

    I'm not comfortable with the fix. I cannot explain why but I feel like adding a strong dependency from a child to its parent is going to lead to more bugs, not less. It sounds like a recipe for reference cycles. Maybe I'm just plain wrong.

    >>> x = {1:2}
    >>> y = iter(x)
    >>> gc.get_referrers(x)
    [<dict_keyiterator object at 0x0000024447D6D598>, 
    {'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <class '_frozen_importlib.BuiltinImporter'>, '__spec__': None, '__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, 'y': <dict_keyiterator object at 0x0000024447D6D598>, 'gc': <module 'gc' (built-in)>, 'x': {1: 2}}
    ]

    We can see that the dict_keyiterator refers to the dictionary, keeping it alive.

    Here we have the same situation: if we do not keep the pool alive, the iterator will hang when iterating because the jobs won't get finished.

    At this point, I would like that someone explains to me what the problem is. #10852 is a solution, ok, but what is the problem? What does the code hangs, whereas >previously it was fine? Is the example code really correct? Do we want to support such usage?

    The code hangs because the pool was not being destroyed before due to the reference cycle between the pool and an internal object (a Thread). Now it hangs because the worker thread is destroyed with the pool as no references are kept to it and the job that the iterator is waiting on is never finished.

    I understand that msg330864 rely on black magic to expect that it's going to be fine. The lifetime of the pool is implicit and it sounds like a bad design. I don't want to endorse that.

    I found the weird code in the example in several projects. I have corrected it to use the pool as a context manager or to call close(), but this means that users are doing this and it used to work and not it does not: technically is a regression.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 3, 2018

    I found the weird code in the example in several projects. I have corrected it to use the pool as a context manager or to call close(), but this means that users are doing this and it used to work and not it does not: technically is a regression.

    That's why I'm asking for a revert :-) I suggest to revert this change immediately from 2.7, 3.6 and 3.7, and later see what can be done for master.

    Even if we design carefully an API, there will be always someone to misuse it :-) I would prefer to stop promoting such bad code and find a transition to more correct code.

    I disagree that a child should keep its parent alive.

    I would be ok to use a *weak reference* from the child to the parent to detect when the parent goes away, and maybe trigger an action in that case. For example, raise an exception or log a warning.

    @pablogsal
    Copy link
    Member

    I disagree that a child should keep its parent alive.

    But this is normal across the standard library. For example, here is how a deque iterator keeps the deque alive:

    >>> x = deque([1,2,3])
    >>> deque_iter = iter(x)
    >>> deque_weakref = weakref.ref(x)
    >>> del x
    >>> gc.collect()
    >>> gc.get_referrers(deque_weakref())
    [<_collections._deque_iterator object at 0x0000024447ED6EA8>]
    
    Here, the deque iterator is the *only* reference to the deque. When we destroy it, the deque is destroyed:
    >>> del deque_iter
    >>> gc.collect()
    >>> deque_weakref()
    None

    @tzickel
    Copy link
    Mannequin Author

    tzickel mannequin commented Dec 4, 2018

    Reverting the code will cause another class of problems, like the reason I fixed it. Programs written such as the example that Pablo gave (and what I've seen) will quietly leak child processes, file descriptors (for the pipes) and memory to a variety degree might not be detected, or in the end detected in a big error or crash.

    Also in some ResourceWarnings (if not all), the resources are closed in the end (like in sockets), here without this code patch you cannot implicitly reclaim the resources (because there is a Thread involved here), which I think is a high bar for the user to think about.

    You can also enable multiprocessing's debug logging to see how the code behaves with and without the fix:
    https://stackoverflow.com/a/1353037

    I also agree with Pablo that there is code in the stdlib that holdes reference between child and parent. There is also code that has circular reference (for example Python 2's OrderedDict) and that is ok as well (not that this is the situation here).

    @pablogsal
    Copy link
    Member

    that there is code in the stdlib that holdes reference between child and parent

    Just to clarify: is not that is just code in the stdlib that keeps a reference between child and parent. The examples I have given are the exact same situation that we have here: the iterator object associated with another needs to keep its parent alive to work correctly.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 5, 2018

    multiprocessing should help the developer to detect when the API is misused. For example, emit a ResourceWarning if a pool is not released explicitly. It might help to detect such bugs:

    @vstinner
    Copy link
    Member

    vstinner commented Dec 5, 2018

    Another example of complex issue related to object lifetime, resources (file descriptors) and multiprocessing: bpo-30966, add SimpleQueue.close().

    @vstinner
    Copy link
    Member

    vstinner commented Dec 6, 2018

    New changeset 3c6b0d9 by Victor Stinner in branch '3.7':
    [3.7] Revert "bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) (GH-9676)" (bpo-10968)
    3c6b0d9

    @vstinner
    Copy link
    Member

    vstinner commented Dec 6, 2018

    New changeset eb38ee0 by Victor Stinner in branch '3.6':
    [3.6] Revert "bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450) (GH-9677)" (GH-10969)
    eb38ee0

    @vstinner
    Copy link
    Member

    vstinner commented Dec 6, 2018

    New changeset 358fc87 by Victor Stinner in branch '2.7':
    Revert "[2.7] bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-9686)" (GH-10970)
    358fc87

    @vstinner
    Copy link
    Member

    vstinner commented Dec 6, 2018

    New changeset 9dfc754 by Victor Stinner in branch 'master':
    Revert "bpo-34172: multiprocessing.Pool leaks resources after being deleted (GH-8450)" (GH-10971)
    9dfc754

    @vstinner
    Copy link
    Member

    vstinner commented Dec 6, 2018

    I reverted the change in 2.7, 3.6, 3.7 and master branches because it introduces a regression and we are very close to a release:
    https://mail.python.org/pipermail/python-dev/2018-December/155920.html

    I don't want to have the pressure to push a quick fix. I would like to make sure that we have enough time to design a proper fix. I'm not saying that Pablo's fix is not correct, it's just bad timing.

    This bug is likely here for a long time, so I think that it's ok to still have it in the next 3.6 and 3.7 bugfix releases.

    I suggest to open a discussion on the python-dev mailing list about multiprocessing relying on the garbage collector and lifetime of multiprocessing objects (Pool, Process, result, etc.). It seems like I disagree with Pablo and tzickel, whereas Armin Rigo (PyPy which has a different GC) is more on my side (release explicitly resources) :-) I would prefer to move towards explicit resource managment instead of relying on destructors and the garbage collector. For example, it's a bad practice to rely on these when using PyPy.

    See my previous comments about issues related to multiprocessing objects lifetime.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 6, 2018

    I agree that reverting in bugfix branches was the right thing to do. I think the fix should have remained in master, though.

    @pitrou pitrou removed the 3.7 (EOL) end of life label Dec 6, 2018
    @vstinner
    Copy link
    Member

    vstinner commented Dec 6, 2018

    See also bpo-35424: "multiprocessing.Pool: emit ResourceWarning".

    I wrote 10986 to fix 2 tests which leak resources.

    I have a question. Why do tests have to call "pool.join()" after "with pool:"? When I use a file, I know that the resources are released after "with file:".

    Should Pool.__exit__() call Pool.join()?

    This question reminds me my fix in socketserver (bpo-31151 and bpo-31233) which leaked processes and threads, and my bug bpo-34037 (asyncio leaks threads).

    @vstinner
    Copy link
    Member

    I started a thread on python-dev to discuss these issues:
    https://mail.python.org/pipermail/python-dev/2018-December/155946.html

    @vstinner
    Copy link
    Member

    The new test_del_pool() test of the fix failed on a buildbot: bpo-35413 "test_multiprocessing_fork: test_del_pool() leaks dangling threads and processes on AMD64 FreeBSD CURRENT Shared 3.x".

    @vstinner
    Copy link
    Member

    multiprocessing.Pool destructor now emits a ResourceWarning if it is still running: if .close() nor .terminate() have been called, see bpo- 35424.

    It is a first alarm that the problematic example is wrong.

    Should reconsider to fix this bug in the master branch? If yes, we should carefully document this backward incompatible change.

    @vstinner
    Copy link
    Member

    Pablo fixed bpo-35378 with:

    New changeset 3766f18 by Pablo Galindo in branch 'master':
    bpo-35378: Fix multiprocessing.Pool references (GH-11627)
    3766f18

    Does this change also fix this issue? If not, can we attempt again to fix this issue?

    Moreover, should we do something in Python 3.7? Sadly, I don't think that we can do anything for 3.7 and 2.7.

    @vstinner
    Copy link
    Member

    What's the status of this issue?

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Oct 23, 2019

    Pablo's fix looks like a superset of the original fix applied here, so I'm assuming it fixes this issue as well.

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Oct 23, 2019

    It should probably be backport to all supported 3.x branches though, so people aren't required to move to 3.8 to benefit from it.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 (EOL) end of life docs Documentation in the Doc dir stdlib Python modules in the Lib dir topic-multiprocessing type-bug An unexpected behavior, bug, or error
    Projects
    Development

    No branches or pull requests

    5 participants