Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Popen.terminate fails with ProcessLookupError under certain conditions #84730

Closed
AlexanderOvervoorde mannequin opened this issue May 7, 2020 · 10 comments
Closed

Popen.terminate fails with ProcessLookupError under certain conditions #84730

AlexanderOvervoorde mannequin opened this issue May 7, 2020 · 10 comments
Assignees
Labels
3.9 only security fixes 3.10 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@AlexanderOvervoorde
Copy link
Mannequin

AlexanderOvervoorde mannequin commented May 7, 2020

BPO 40550
Nosy @gpshead, @oconnor663, @miss-islington, @FFY00
PRs
  • bpo-40550: Fix time-of-check/time-of-action issue in subprocess.Popen.send_signal. #20010
  • [3.9] bpo-40550: Fix time-of-check/time-of-action issue in subprocess.Popen.send_signal. (GH-20010) #23439
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gpshead'
    closed_at = <Date 2020-11-21.09:26:37.579>
    created_at = <Date 2020-05-07.20:01:24.417>
    labels = ['type-bug', 'library', '3.9', '3.10']
    title = 'Popen.terminate fails with ProcessLookupError under certain conditions'
    updated_at = <Date 2020-12-03.16:32:27.047>
    user = 'https://bugs.python.org/AlexanderOvervoorde'

    bugs.python.org fields:

    activity = <Date 2020-12-03.16:32:27.047>
    actor = 'oconnor663'
    assignee = 'gregory.p.smith'
    closed = True
    closed_date = <Date 2020-11-21.09:26:37.579>
    closer = 'gregory.p.smith'
    components = ['Library (Lib)']
    creation = <Date 2020-05-07.20:01:24.417>
    creator = 'Alexander Overvoorde'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 40550
    keywords = ['patch']
    message_count = 10.0
    messages = ['368370', '368389', '368462', '368466', '368470', '368492', '381535', '381537', '381538', '382427']
    nosy_count = 5.0
    nosy_names = ['gregory.p.smith', 'oconnor663', 'miss-islington', 'FFY00', 'Alexander Overvoorde']
    pr_nums = ['20010', '23439']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'commit review'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue40550'
    versions = ['Python 3.9', 'Python 3.10']

    @AlexanderOvervoorde
    Copy link
    Mannequin Author

    AlexanderOvervoorde mannequin commented May 7, 2020

    The following program frequently raises a ProcessLookupError exception when calling proc.terminate():

    import threading
    import subprocess
    import multiprocessing
    
    proc = subprocess.Popen(["sh", "-c", "exit 0"])
    
    q = multiprocessing.Queue()
    
    def communicate_thread(proc):
        proc.communicate()
    
    t = threading.Thread(target=communicate_thread, args=(proc,))
    t.start()
    
    proc.terminate()
    
    t.join()
    

    I'm reproducing this with Python 3.8.2 on Arch Linux by saving the script and rapidly executing it like this:

    $ bash -e -c "while true; do python3 test.py; done

    The (unused) multiprocessing.Queue seems to play a role here because the problem vanishes when removing that one line.

    @AlexanderOvervoorde AlexanderOvervoorde mannequin added 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels May 7, 2020
    @FFY00
    Copy link
    Member

    FFY00 commented May 7, 2020

    This is the backtrace I get:

    Traceback (most recent call last):
      File "/home/anubis/test/multiprocessing-error.py", line 16, in <module>
        proc.terminate()
      File "/home/anubis/git/cpython/Lib/subprocess.py", line 2069, in terminate
        self.send_signal(signal.SIGTERM)
      File "/home/anubis/git/cpython/Lib/subprocess.py", line 2064, in send_signal
        os.kill(self.pid, sig)
    ProcessLookupError: [Errno 3] No such process

    Is yours the same? This is expected, the process exited before proc.terminate().

    You should wrap proc.terminate() in a try..except block:

    try:
    proc.terminate()
    except ProcessLookupError:
    pass

    I am not sure we want to suppress this.

    @AlexanderOvervoorde
    Copy link
    Mannequin Author

    AlexanderOvervoorde mannequin commented May 8, 2020

    I'm not sure that it is expected since Popen.send_signal does contain the following check:

    def send_signal(self, sig):
        """Send a signal to the process."""
        # Skip signalling a process that we know has already died.
        if self.returncode is None:
            os.kill(self.pid, sig)
    

    Additionally, the following program does not raise a ProcessLookupError despite the program already having exited:

    import subprocess
    import time
    
    proc = subprocess.Popen(["sh", "-c", "exit 0"])
    
    time.sleep(5)
    
    proc.terminate()
    

    @FFY00
    Copy link
    Member

    FFY00 commented May 8, 2020

    This is a simple time-of-check - time-of-action issue, which is why I suggested that it shouldn't be fixed. I was not aware send_signal did have this check, which tells us it is supposed to be suported(?). Anyway, to fix it we need to catch the exception and check for errno == 3 so that we can ignore it.

    Optimally we would want to have an atomic operation here, but no such thing exists. There is still the very faint possibility that after your process exits a new process will take its id and we kill it instead.

    We should keep the returncode check and just ignore the exception when errno == 3. This is the best option.

    @FFY00
    Copy link
    Member

    FFY00 commented May 8, 2020

    I submitted a patch. As explained above, this only mitigates the time-of-check/time-of-action issue in case the process doesn't exist, it *is not* a proper fix for the issue. But don't see any way to properly fix it.

    @AlexanderOvervoorde
    Copy link
    Mannequin Author

    AlexanderOvervoorde mannequin commented May 9, 2020

    I understand that it's not a perfect solution, but at least it's a little bit closer. Thanks for your patch :)

    @gpshead gpshead added 3.9 only security fixes 3.10 only security fixes labels Nov 21, 2020
    @gpshead gpshead self-assigned this Nov 21, 2020
    @gpshead gpshead added 3.9 only security fixes 3.10 only security fixes labels Nov 21, 2020
    @gpshead gpshead self-assigned this Nov 21, 2020
    @gpshead
    Copy link
    Member

    gpshead commented Nov 21, 2020

    New changeset 01a202a by Filipe Laíns in branch 'master':
    bpo-40550: Fix time-of-check/time-of-action issue in subprocess.Popen.send_signal. (GH-20010)
    01a202a

    @gpshead
    Copy link
    Member

    gpshead commented Nov 21, 2020

    Thanks for the patch!

    PRs are in or on their way in for 3.10 and 3.9.

    The 3.8 auto-backport failed, if anyone wants it on a future 3.8.x please follow up with a manual cherry pick to make a PR for the 3.8 branch.

    @gpshead gpshead removed the 3.8 only security fixes label Nov 21, 2020
    @gpshead gpshead closed this as completed Nov 21, 2020
    @gpshead gpshead removed the 3.8 only security fixes label Nov 21, 2020
    @gpshead gpshead closed this as completed Nov 21, 2020
    @miss-islington
    Copy link
    Contributor

    New changeset 713b4bb by Miss Islington (bot) in branch '3.9':
    bpo-40550: Fix time-of-check/time-of-action issue in subprocess.Popen.send_signal. (GH-20010)
    713b4bb

    @oconnor663
    Copy link
    Mannequin

    oconnor663 mannequin commented Dec 3, 2020

    I'm late to the party, but I want to explain what's going on here in case it's helpful to folks. The issue you're seeing here has to do with whether a child processs has been "reaped". (Windows is different from Unix here, because the parent keeps an open handle to the child, so this is mostly a Unix thing.) In short, when a child exits, it leaves a "zombie" process whose only job is to hold some metadata and keep the child's PID reserved. When the parent calls wait/waitpid/waitid or similar, that zombie process is cleaned up. That means that waiting has important correctness properties apart from just blocking the parent -- signaling after wait returns is unsafe, and forgetting to wait also leaks kernel resources.

    Here's a short example demonstrating this:

      import signal                                                                                                                                                                                
      import subprocess                                                                                                                                                                            
      import time                                                                                                                                                                                  
                                                                                                                                                                                                   
      # Start a child process and sleep a little bit so that we know it's exited.                                                                                                                              
      child = subprocess.Popen(["true"])                                                                                                                                                           
      time.sleep(1)                                                                                                                                                                                
                                                                                                                                                                                                   
      # Signal it. Even though it's definitely exited, this is not an error.                                                                                                                                  
      os.kill(child.pid, signal.SIGKILL)                                                                                                                                                           
      print("signaling before waiting works fine")                                                                                                                                                 
                                                                                                                                                                                                   
      # Now wait on it. We could also use os.waitpid or os.waitid here. This reaps                                                                                                                 
      # the zombie child.                                                                                                                                                                          
      child.wait()                                                                                                                                                                                 
                                                                                                                                                                                                   
      # Try to signal it again. This raises ProcessLookupError, because the child's                                                                                                                
      # PID has been freed. But note that Popen.kill() would be a no-op here,
      # because it knows the child has already been waited on.                                                                                                                                                    
      os.kill(child.pid, signal.SIGKILL)                                                                                                                                                           
    

    With that in mind, the original behavior with communicate() that started this bug is expected. The docs say that communicate() "waits for process to terminate and sets the returncode attribute." That means internally it calls waitpid, so your terminate() thread is racing against process exit. Catching the exception thrown by terminate() will hide the problem, but the underlying race condition means your program might end up killing an unrelated process that just happens to reuse the same PID at the wrong time. Doing this properly requires using waitid(WNOWAIT), which is...tricky.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants