Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importlib lock race issue in deadlock handling code #91351

Closed
rpurdie mannequin opened this issue Apr 1, 2022 · 17 comments · Fixed by #94504
Closed

importlib lock race issue in deadlock handling code #91351

rpurdie mannequin opened this issue Apr 1, 2022 · 17 comments · Fixed by #94504
Labels
3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-importlib

Comments

@rpurdie
Copy link
Mannequin

rpurdie mannequin commented Apr 1, 2022

BPO 47195
Nosy @rpurdie
Files
  • testit2.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2022-04-01.20:23:24.167>
    labels = ['interpreter-core', '3.10']
    title = 'importlib lock race issue in deadlock handling code'
    updated_at = <Date 2022-04-02.06:56:46.108>
    user = 'https://github.com/rpurdie'

    bugs.python.org fields:

    activity = <Date 2022-04-02.06:56:46.108>
    actor = 'rpurdie'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2022-04-01.20:23:24.167>
    creator = 'rpurdie'
    dependencies = []
    files = ['50714']
    hgrepos = []
    issue_num = 47195
    keywords = []
    message_count = 2.0
    messages = ['416517', '416539']
    nosy_count = 1.0
    nosy_names = ['rpurdie']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue47195'
    versions = ['Python 3.10']

    Linked PRs

    @rpurdie
    Copy link
    Mannequin Author

    rpurdie mannequin commented Apr 1, 2022

    We've seen tracebacks in production like:

    File "<frozen importlib._bootstrap>", line 1004, in _find_and_load(name='oe.gpg_sign', import_=<built-in function __import__>)
    File "<frozen importlib._bootstrap>", line 158, in _ModuleLockManager.__enter__()
    File "<frozen importlib._bootstrap>", line 110, in _ModuleLock.acquire()
    KeyError: 139622474778432

    and

    File "<frozen importlib._bootstrap>", line 1004, in _find_and_load(name='oe.path', import_=<built-in function __import__>)
    File "<frozen importlib._bootstrap>", line 158, in _ModuleLockManager.__enter__()
    File "<frozen importlib._bootstrap>", line 110, in _ModuleLock.acquire()
    KeyError: 140438942700992

    I've attached a reproduction script which shows that if an import XXX is in progress and waiting at the wrong point when an interrupt arrives (in this case a signal) and triggers it's own import YYY, _blocking_on[tid] in importlib/_bootstrap.py gets overwritten and lost, triggering the traceback we see above upon exit from the second import.

    I'm using a signal handler here as the interrupt, I don't know what our production source is as yet but this reproducer proves it is possible.

    @rpurdie rpurdie mannequin added 3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Apr 1, 2022
    @rpurdie
    Copy link
    Mannequin Author

    rpurdie mannequin commented Apr 2, 2022

    This is a production backtrace after I inserted code to traceback if tid was already in _blocking_on. It is being triggered by a warning about an unclosed asyncio event loop and confirms my theory about nested imports, in the production case I'd guess being triggered by gc given the __del__.

    File "/home/pokybuild/yocto-worker/oe-selftest-fedora/build/meta/classes/base.bbclass", line 26, in oe_import
    import oe.data
    File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
    File "<frozen importlib._bootstrap>", line 171, in __enter__
    File "/home/pokybuild/yocto-worker/oe-selftest-fedora/build/bitbake/lib/bb/cooker.py", line 168, in acquire
    return orig_acquire(self)
    File "<frozen importlib._bootstrap>", line 110, in acquire
    File "/usr/lib64/python3.10/asyncio/base_events.py", line 685, in __del__
    _warn(f"unclosed event loop {self!r}", ResourceWarning, source=self)
    File "/usr/lib64/python3.10/warnings.py", line 112, in _showwarnmsg
    _showwarnmsg_impl(msg)
    File "/usr/lib64/python3.10/warnings.py", line 28, in _showwarnmsg_impl
    text = _formatwarnmsg(msg)
    File "/usr/lib64/python3.10/warnings.py", line 128, in _formatwarnmsg
    return _formatwarnmsg_impl(msg)
    File "/usr/lib64/python3.10/warnings.py", line 56, in _formatwarnmsg_impl
    import tracemalloc
    File "<frozen importlib._bootstrap>", line 1024, in _find_and_load
    File "<frozen importlib._bootstrap>", line 171, in __enter__
    File "/home/pokybuild/yocto-worker/oe-selftest-fedora/build/bitbake/lib/bb/cooker.py", line 167, in acquire
    bb.warn("\n".join(traceback.format_stack()))

    @exarkun
    Copy link
    Contributor

    exarkun commented Jun 22, 2022

    I also encountered this case, triggered by the resource warning emitted by an unclosed socket.socket. On Python 3.9.12 and Python 3.9.13 the following reproduces the problem on Windows and Linux (deterministically, if a call to gc.collect() is inserted into acquire immediately after _blocking_on[tid] = self in _bootstrap.py):

    import socket, gc
    
    class Cycle:
        pass
    
    def main():
        c = Cycle()
        c.cycle = c
        c.s = socket.socket()
    
        del c
        # import any module that hasn't been imported yet to trigger the bug
        import base64
    
    main()
    

    Here's the _bootstrap.py diff that makes it deterministic:

    diff --git a/Lib/importlib/_bootstrap.py b/Lib/importlib/_bootstrap.py
    index e00b27ece2..808d7bf501 100644
    --- a/Lib/importlib/_bootstrap.py
    +++ b/Lib/importlib/_bootstrap.py
    @@ -36,6 +36,23 @@ def _new_module(name):
         return type(sys)(name)
    
    
    +def collect():
    +    gc = sys.modules.get("gc")
    +    collect = getattr(gc, "collect", None)
    +    if collect is not None:
    +        collect()
    +
    +_fd = None
    +def debug(s):
    +    global _fd
    +    os = sys.modules.get("os")
    +    if _fd is None:
    +        if getattr(os, "open", None) is not None:
    +            _fd = os.open("C:/bootstrap.log", os.O_CREAT|os.O_APPEND|os.O_WRONLY)
    +    if _fd is not None:
    +        if getattr(os, "write", None) is not None:
    +            os.write(_fd, (s + "\n").encode("utf-8"))
    +    
     # Module-level locking ########################################################
    
     # A dict mapping module names to weakrefs of _ModuleLock instances
    @@ -91,7 +108,12 @@ class _ModuleLock:
             Otherwise, the lock is always acquired and True is returned.
             """
             tid = _thread.get_ident()
    +        
    +        # debug(f"Hello {self.name} {len(_blocking_on)} {tid}") # >>> {self.name} {tid} {_blocking_on} {self}")
    +        debug(f"import {self.name}")
             _blocking_on[tid] = self
    +        if self.name == "base64":
    +            collect()
             try:
                 while True:
                     with self.lock:
    @@ -107,6 +129,7 @@ class _ModuleLock:
                     self.wakeup.acquire()
                     self.wakeup.release()
             finally:
    +            # debug(f"Goodbye") # <<< {self.name} {tid} {_blocking_on} {self}")
                 del _blocking_on[tid]
    
         def release(self):
    

    This basically shows the same thing as the original reproducer attached above but with a patch instead of a monkey-patch and no sleep/signals.

    Seeing how these reproducers work, the cause seems pretty straightforward to me.

    1. _ModuleLock.acquire puts the main thread into _blocking_on.
    2. The garbage collector runs and finds something like an asyncio or a socket
    3. The resource's __del__ calls warnings.warn
    4. warnings.warn leads to something being imported - linecache or tracemalloc
    5. _ModuleLock.acquire is re-entered and puts the main thread into _blocking_on again
    6. _ModuleLock.acquire is left and removes the main thread from _blocking_on (which is now empty)
    7. warnings.warn, __del__ and the garbage collector all unwinds
    8. _ModuleLock.acquire is left again but on its way out it tries to remove the main thread from _blocking_on again. However, it has already been removed and we get the KeyError.

    I haven't fully understood what this part of _bootstrap.py is trying to accomplish yet so I'm not going to try to suggest what the fix might be yet.

    @AA-Turner
    Copy link
    Member

    @exarkun opened a PR against 3.9 which is security only -- I've closed though a PR against HEAD would be welcome.

    cc also @brettcannon as importlib expert.

    A

    @brettcannon
    Copy link
    Member

    If the failure is due to KeyError on _blocking_on when trying to delete a key from the dict, then we could catch the KeyError, send out an ImportWarning that stuff got weird, and then simply move on.

    My guess is that wouldn't actually break anything as there isn't something to unlock, so it's really just stale caching at that point instead of a semantic failure.

    @exarkun
    Copy link
    Contributor

    exarkun commented Jun 27, 2022

    Hi @brettcannon . Did you have a chance to look at my PR?

    @brettcannon
    Copy link
    Member

    @exarkun which PR? The only one I see referenced in this issue was closed for being targeted at the wrong branch. Do you have a PR for main to look at?

    @exarkun
    Copy link
    Contributor

    exarkun commented Jun 28, 2022

    Hi @brettcannon. I meant the PR that was closed for being targeted at the wrong branch, yes. I am happy to fix that process mistake if the code in the branch is useful. If it is not going to be accepted then I'd prefer not to sink more effort into it.

    I've looked at the recent history of importlib/_bootstrap.py and I don't think there are any changes that make a difference to this bug or to the changes in the PR so I think the current branch is quite representative of what the changes will be after they are retargeted at main.

    Thanks again for your attention.

    @brettcannon
    Copy link
    Member

    (For other's reference, the PR under discussion is #94342 )

    I took a very quick glance and this is not a small change, so getting it reviewed may take some time (and honestly @pitrou did a lot of the work on the locking mechanism). Does the simpler "ignore KeyError" solution potentially work?

    @exarkun
    Copy link
    Contributor

    exarkun commented Jun 29, 2022

    I took a very quick glance and this is not a small change, so getting it reviewed may take some time (and honestly @pitrou did a lot of the work on the locking mechanism).

    Quite so. The cases it is trying to handle are not simple and the implementation strategy taken is not simple and the problems that arise as a result are not simple. :( One thing I'll point out is that the commits in the PR should be sequentially reviewable, with each commit focusing on a different and (I hope) internally consistent part of the change.

    Does the simpler "ignore KeyError" solution potentially work?

    Unfortunately I don't think so.

    It doesn't solve the problem with deadlock detection during a re-entrant import. That problem goes like this:

    1. Outer import starts and populates _blocking_on
    2. Inner import starts and stomps on outer import's _blocking_on state
    3. Inner import completes and clears _blocking_on state for the running thread
    4. Another import begins and makes an incorrect has_deadlock decision based on the corrupted _blocking_on state (the original thread is blocking on something but it is no longer recorded as blocking on that thing)
    5. The original outer import may now get around to clearing _blocking_on and adding the KeyError handling prevents it from raising that exception but a problem may already have been created by (4)

    It also doesn't solve the single-threaded re-entrant deadlock. That problem goes like this:

    1. Outer import starts and populates _blocking_on
    2. Outer import takes _ModuleLock.lock for the module being imported
    3. Inner import for the same module begins
    4. Inner import takes _ModuleLock.lock for that module and deadlocks

    In this case the outer import never even gets to cleaning up _blocking_on so there is no KeyError to handle.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 30, 2022

    Thanks a lot @exarkun for tackling this. I would agree with using a RLock if it makes things simpler (RLock is written in C nowadays so should be usable from the import bootstrap).

    @exarkun
    Copy link
    Contributor

    exarkun commented Jun 30, 2022

    Thanks @pitrou. I do think RLock will help.

    If I adapt the implementation to that approach then should I also re-target it at main?

    @pitrou
    Copy link
    Member

    pitrou commented Jun 30, 2022

    If I adapt the implementation to that approach then should I also re-target it at main?

    Yes. Backports get cherry-picked from main afterwards.

    @exarkun
    Copy link
    Contributor

    exarkun commented Jul 1, 2022

    I submitted a new PR against main and with the RLock simplification - #94504.

    @exarkun
    Copy link
    Contributor

    exarkun commented Jul 11, 2022

    Hello. Could someone (@brettcannon @pitrou @ericsnowcurrently ?) take a look at #94504? Thanks.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 11, 2022

    Yeah, I have it on my TODO list, but not high priority, sorry :-)

    @brettcannon
    Copy link
    Member

    It's in my review queue, but it's 12/15.

    brettcannon added a commit that referenced this issue Jan 21, 2023
    facebook-github-bot pushed a commit to facebookincubator/cinder that referenced this issue Feb 12, 2024
    …imports
    
    Summary:
    upstream issue: python/cpython#91351
    upstream PR: python/cpython#94504
    upstream merge commit: python/cpython@3325f05
    
    symptom:
    
    ```
    File "<frozen importlib._bootstrap>", line 1004, in _find_and_load(name='oe.gpg_sign', import_=<built-in function __import__>)
    File "<frozen importlib._bootstrap>", line 158, in _ModuleLockManager.__enter__()
    File "<frozen importlib._bootstrap>", line 110, in _ModuleLock.acquire()
    KeyError: 139622474778432
    
    and
    
    File "<frozen importlib._bootstrap>", line 1004, in _find_and_load(name='oe.path', import_=<built-in function __import__>)
    File "<frozen importlib._bootstrap>", line 158, in _ModuleLockManager.__enter__()
    File "<frozen importlib._bootstrap>", line 110, in _ModuleLock.acquire()
    KeyError: 140438942700992
    ```
    
    Reviewed By: carljm
    
    Differential Revision: D53641441
    
    fbshipit-source-id: e142eb17442da370861cd3a3398b0eef9930d041
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-importlib
    Projects
    None yet
    Development

    Successfully merging a pull request may close this issue.

    5 participants