Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential) #80714

Closed
gpshead opened this issue Apr 5, 2019 · 19 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 (EOL) end of life release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@gpshead
Copy link
Member

gpshead commented Apr 5, 2019

BPO 36533
Nosy @rhettinger, @gpshead, @ned-deily, @ambv, @cagney, @kevans91, @koubaa
PRs
  • bpo-36533: Reinit logging.Handler locks on fork(). #12704
  • [3.7] bpo-36533: Reinit logging.Handler locks on fork(). (GH-12704) #13170
  • bpo-1635741 port _curses_panel to multi-phase init (PEP 489) #21986
  • Update all Python Cookbook links after migration to GitHub #22205
  • bpo-40423: Optimization: use close_range(2) if available #22651
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gpshead'
    closed_at = <Date 2019-05-07.20:32:05.698>
    created_at = <Date 2019-04-05.08:15:51.465>
    labels = ['3.7', '3.8', 'type-bug', 'library', 'release-blocker']
    title = 'logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential)'
    updated_at = <Date 2020-10-11.20:42:06.792>
    user = 'https://github.com/gpshead'

    bugs.python.org fields:

    activity = <Date 2020-10-11.20:42:06.792>
    actor = 'kevans'
    assignee = 'gregory.p.smith'
    closed = True
    closed_date = <Date 2019-05-07.20:32:05.698>
    closer = 'gregory.p.smith'
    components = ['Library (Lib)']
    creation = <Date 2019-04-05.08:15:51.465>
    creator = 'gregory.p.smith'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 36533
    keywords = ['patch', '3.7regression']
    message_count = 19.0
    messages = ['339472', '339474', '339475', '339480', '339481', '339503', '339505', '339529', '339650', '339688', '339760', '339825', '339863', '339864', '339882', '341543', '341768', '341815', '341816']
    nosy_count = 9.0
    nosy_names = ['rhettinger', 'gregory.p.smith', 'ned.deily', 'lukasz.langa', 'cagney', 'kevans', 'hugh', 'ebizzinfotech', 'koubaa']
    pr_nums = ['12704', '13170', '21986', '22205', '22651']
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'commit review'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue36533'
    versions = ['Python 3.7', 'Python 3.8']

    @gpshead
    Copy link
    Member Author

    gpshead commented Apr 5, 2019

    I'm spawning a dicussion buried in the way too long thread of https://bugs.python.org/issue6721 over here into its own specific issue to treat as a 3.7 release blocker for a rollback or repair decision before 3.7.4.

    3b69993

    I believe that was released in 3.7.1 is leading to a behavior regression for an application (the Fedora installer's libreswan kvmrunner?). Full details can be found in the messages of the other issue starting with:
    https://bugs.python.org/issue6721#msg329474

    TL;DR - logging.Handler instances each have their own threading.Rlock.
    libreswan implements at least one logging.Handler subclass. That subclass's custom emit() implementation directly calls potentially many other sub-handlers emit() methods. Some of those emit() methods (such as logging.StreamHandler) call flush() which acquires the handler's lock.

    So they've got a dependency between these two locks, the first's must be acquired before the second.

    But the logging module APIs have no concept of sub-handlers and lock ordering.

    I see many flaws with the libreswan code's design (I'm already ignoring the futility of threading + fork) but this still caused a behavior regression in the stable 3.7 release.

    (more comments coming as followups to avoid a wall of text with too many topics)

    @gpshead gpshead self-assigned this Apr 5, 2019
    @gpshead gpshead added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Apr 5, 2019
    @gpshead
    Copy link
    Member Author

    gpshead commented Apr 5, 2019

    That custom DebugHandler's emit() implementation that calls into one or more sub-handlers suggests that libreswan _might_ be able to fix it in the custom DebugHandler by implementing custom acquire() and release() methods... BUT that is a fundamentally flawed problem: It requires acquiring multiple locks in a single atomic operation OR guaranteeing that they will always be acquired in a single specific order.

    Given the logging module's fork-time code maintains no order, even implementing a custom acquire() and release() method in your custom DebugHandler would not be sufficient as sub-handler locks could be acquired first during fork. also, that would be assuming your sub-handlers are not also used directly elsewhere within all possible coexisting logger configurations.

    An implementable _gross workaround_ that libreswan or anything else in this boat could implement satisfying that constraint would be to force all potentially related handlers to share the same single RLock instance. :(

    @gpshead
    Copy link
    Member Author

    gpshead commented Apr 5, 2019

    Within the logging module we could go beyond using a WeakSet and maintain an ordering. But we'd need to allow a way for Handler subclasses to indicate what order matters; that'd require a new API (not suitable for 3.7)

    An ordering itself may be insufficient as code can implement arbitrary Handler code with cyclic graphs of handlers calling directly into other handlers such that no possible order could ever satisfy them all.

    This pulls us back into "stop having multiple locks" for any code that creates relationships between handlers. Which is good advice for Handler authors and something we should document if it isn't already.

    Side note: We should also document that Handler.emit() as an API MUST be called with the Handler's lock acquired via Handler.acquire() first. That is the undocumented contracted way logging.Logger.handle()'s code always calls into handler emit().... The libreswan code calling other handler's emit() methods without doing that smells wrong. But that has no relation to their current problem: Even if it did, it'd still deadlock, just sooner rather than within the other handler's emit()->flush()->acquire() call chain.

    Logging library users are better off never directly calling another handler. Put message onto queues with a dedicated thread processing those into the relevant handlers. Only block waiting for those queued items to have been processed _after_ releasing your own lock if you want to maintain the synchronicity of logging API calls returning only after the message has been handled.

    @gpshead
    Copy link
    Member Author

    gpshead commented Apr 5, 2019

    Now to back up: Why was this application using fork() in a threaded application at all? That is a fundamental flaw. Should we be doing work to support things that so far merely _happen_ to work on such broken designs?

    Another alternative for the DebugHandler implementation in whatever code implemented it is for it to de-register itself from the logging library's private WeakSet of handler locks to acquire at fork time from its own constructor. We don't have a public API for this so the workaround doing that on 3.7.1 - 3.7.3 would look like:

       def __init__(self, ...):
           super().__init__(...)
           if hasattr(logging, '_at_fork_acquire_release_weakset'):
               logging._at_fork_acquire_release_weakset.discard(self)

    This means it'd still have the bug the code already had on all prior to versions of Python before this logging library "acquire the locks before fork" fix went in: Namely if the forked child tries to log it could deadlock due to forking while the inherited logging handler lock held.

    @gpshead
    Copy link
    Member Author

    gpshead commented Apr 5, 2019

    A stdlib alternative to this whole mess would be to avoid acquiring the logging locks before fork() as we currently do and just blindly re-initialize all of them afterwards under the assumption that they "can't" be protecting anything in a newly forked child process. Handlers that need specific resource synchronization around fork would then be required to deal with their own os.register_at_fork() calls. (ex: to avoid multiple processes writing to the same file or fd at once)

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 5, 2019

    On Fri, 5 Apr 2019 at 04:15, Gregory P. Smith <report@bugs.python.org> wrote:

    New submission from Gregory P. Smith <greg@krypto.org>:

    I'm spawning a dicussion buried in the way too long thread of https://bugs.python.org/issue6721 over here into its own specific issue to treat as a 3.7 release blocker for a rollback or repair decision before 3.7.4.

    3b69993

    I believe that was released in 3.7.1 is leading to a behavior regression for an application (the Fedora installer's libreswan kvmrunner?). Full details can be found in the messages of the other issue starting with:
    https://bugs.python.org/issue6721#msg329474

    Two separate applications have tripped up on this:

    • Fedora's installer aka anaconda

    I don't know any further details.

    • libreswan's test framework aka kvmrunner

    kvmrunner uses pexpect
    pexpect uses ptyprocess
    ptyprocess uses pty.fork() + os.exec*(); but, hey, who knew! I didn't
    until this week.
    this seems to be a long standing concern:
    pexpect/ptyprocess#43

    TL;DR - logging.Handler instances each have their own threading.Rlock.
    libreswan implements at least one logging.Handler subclass. That subclass's custom emit() implementation directly calls potentially many other sub-handlers emit() methods. Some of those emit() methods (such as logging.StreamHandler) call flush() which acquires the handler's lock.

    # Implement log-level inversion.

    # Ref: https://docs.python.org/2/howto/logging.html#logging-flow

    # By default, a parent (root) logger, regardless of its log-level,
    # will log all the records logged by a child. For instance, if a
    # child logger is logging at DEBUG-level, then the parent will log
    # (display on the console) those DEBUG-level records even when it has
    # been configured to log only INFO-level records. This is because the
    # default log-level ("Logger enabled for level of call?") check is
    # only applied once at record-creation.

    # This code allows DEBUG-level logging to a file, while simultaneously
    # (the inversion) restricting console log records to just INFO-level
    # say.

    The logging.Handler's lock ensures that the two sub-streams are kept
    in ordered. It prevents threads interleaving their writes.

    So they've got a dependency between these two locks, the first's must be acquired before the second.

    But the logging module APIs have no concept of sub-handlers and lock ordering.

    I see many flaws with the libreswan code's design (I'm already ignoring the futility of threading + fork) but this still caused a behavior regression in the stable 3.7 release.

    Always first understand the problem.

    (more comments coming as followups to avoid a wall of text with too many topics)

    ----------
    assignee: gregory.p.smith
    components: Library (Lib)
    keywords: 3.7regression
    messages: 339472
    nosy: cagney, gregory.p.smith, ned.deily, vstinner
    priority: release blocker
    severity: normal
    status: open
    title: logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential)
    type: behavior
    versions: Python 3.7


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue36533\>


    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 5, 2019

    On Fri, 5 Apr 2019 at 05:02, Gregory P. Smith <report@bugs.python.org> wrote:

    Gregory P. Smith <greg@krypto.org> added the comment:

    A stdlib alternative to this whole mess would be to avoid acquiring the logging locks before fork() as we currently do and just blindly re-initialize all of them afterwards under the assumption that they "can't" be protecting anything in a newly forked child process. Handlers that need specific resource synchronization around fork would then be required to deal with their own os.register_at_fork() calls. (ex: to avoid multiple processes writing to the same file or fd at once)

    FYI, below is a simpler, but related, test.
    _at_fork_acquire_release_weakset doesn't seem to be locked:

    Ignoring exception from logging atfork <Handler (NOTSET)> release
    method: cannot release un-acquired lock

    Exception ignored in: <function _after_at_fork_weak_calls at 0x7f7307550378>
    Traceback (most recent call last):
      File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
    269, in _after_at_fork_weak_calls
        _at_fork_weak_calls('release')
      File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
    254, in _at_fork_weak_calls
        for instance in _at_fork_acquire_release_weakset:
      File "/home/python/v3.7.3/lib/python3.7/_weakrefset.py", line 60, in __iter__
        for itemref in self.data:
    RuntimeError: Set changed size during iteration
    
    Exception in thread Thread-1:
    Traceback (most recent call last):
      File "/home/python/v3.7.3/lib/python3.7/threading.py", line 917, in
    _bootstrap_inner
        self.run()
      File "/home/python/v3.7.3/lib/python3.7/threading.py", line 865, in run
        self._target(*self._args, **self._kwargs)
      File "./btc.py", line 11, in lockie
        h = logging.Handler()
      File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
    824, in __init__
        self.createLock()
      File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
    847, in createLock
        _register_at_fork_acquire_release(self)
      File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
    250, in _register_at_fork_acquire_release
        _at_fork_acquire_release_weakset.add(instance)
      File "/home/python/v3.7.3/lib/python3.7/_weakrefset.py", line 83, in add
        self._commit_removals()
      File "/home/python/v3.7.3/lib/python3.7/_weakrefset.py", line 56, in
    _commit_removals
        discard(l.pop())
    IndexError: pop from empty list

    import logging
    import os
    import sys
    import threading
    
    def lockie():
        while True:
            # this adds handle to _at_fork_acquire_release_weakset
            #sys.stdout.write(".")
            #sys.stdout.flush()
            h = logging.Handler()

    threading.Thread(target=lockie).start()

    for n in range(0,100000):
        if n % 100 == 0:
            sys.stdout.write("%d" % n)
            sys.stdout.flush()
        pid = os.fork()
        if pid != 0:
            os.wait()
        else:
            os._exit(0)

    @gpshead
    Copy link
    Member Author

    gpshead commented Apr 6, 2019

    Here's a PR that goes the re-initialization route rather than attempting to acquire/release anything other than the single module _lock.

    @gpshead gpshead added the 3.8 (EOL) end of life label Apr 6, 2019
    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 8, 2019

    I think the only pragmatic solution here is to add an optional parameter to logging.basicConfig() that specifies that the logger should use a single global lock; and then start documenting that thread locks and fork() don't work well together.

    And note that this solution is pragmatic, not correct (@dhr, when we discussed this off line, pointed out that since python's using threads then a "correct" solution would be to use some sort of inter-process lock).

    For instance, if I were to implement emit() as something contrived like:

        with lock:
            s = record.to-string()
            for i in 1 .. s'length:
                b[i] = s[i]
            for i in 1 .. b'length:
                stdout.write(b[i]).flush()
            b = []

    then when fork() breaks 'lock' the guarantee that the code is atomic is also broken:

    • when the child enters the code 'b' is undefined
    • the guarantee that log records don't interleave is lost

    while a global lock would help mitigate the first case it really isn't a "correct" fix.

    @gpshead
    Copy link
    Member Author

    gpshead commented Apr 9, 2019

    The logging library has never guaranteed that it wouldn't interleave or duplicate buffered io output when fork() in involved. We should not start trying to claim that it does. Too complicated and fragile. It can't. Users who want that should implement their own single logging handler that is well behaved in whatever regards they need. there are a ton of options for them.

    A basicConfig setting to switch the mode of operation to a single lock instead of per-handler locks could only happen in 3.8 (new feature). I'll ponder if it is even feasible to add that after the re-init change goes in. Even with such a feature, library user authored Handler subclasses would always be free to violate that. We can't protect people from themselves.

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 9, 2019

    I pointed out two issues with breaking the locks.

    Your response addressed the less important issue:

    • the guarantee that log records don't interleave is lost

    Since the documentation seems to be silent the guarantee (or expectation) is implied - logging systems emit records atomically - if this isn't true across fork() then the exception should be documented.

    but it didn't address the more fundamental problem:

    • when the child enters the code 'b' is undefined

    i.e., breaking the lock will allow the child to access data in an undefined state. This will result in either core dumps or hangs (presumably the motivation for the original change was to prevent this?).

    For instance, if the fork() happens while a thread is holding the <stdio.h> FILE lock on a log file, then the child trying to access that FILE will hang.

    @gpshead
    Copy link
    Member Author

    gpshead commented Apr 10, 2019

    """
    Since the documentation seems to be silent the guarantee (or expectation) is implied - logging systems emit records atomically - if this isn't true across fork() then the exception should be documented.
    """

    We can add a note to the docs. As a general rule: documentation being silent means there are zero guarantees. Behavior of the existing implementations over time is the only source of truth.

    """
    but it didn't address the more fundamental problem:

    • when the child enters the code 'b' is undefined

    i.e., breaking the lock will allow the child to access data in an undefined state. This will result in either core dumps or hangs (presumably the motivation for the original change was to prevent this?).
    """

    nope. the contrived emit() pseudocode from msg339650 never defined b in the first place. that code, if massaged into python syntax would raise an UnboundLocalError no matter who executed it. a fork() from another thread would not change that. There is no concept of undefined/uninitialized state from the Python perspective.

    The motivation for the original fix was entirely deadlock avoidance in code that uses fork. It failed and introduced an alternate form of deadlock in code that had been lucky up until that point. Thus my new PR proposing to fix the regression by making that not fail yet still preventing locks from starting out held in the child process.

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 10, 2019

    nope. the contrived emit() pseudocode from msg339650 never defined b in the first place. that code, if massaged into python syntax would raise an UnboundLocalError no matter who executed it. a fork() from another thread would not change that. There is no concept of undefined/uninitialized state from the Python perspective.

    Wow! It is pseudo code, right? (I figured using Ada like attributes was a bit of a giveaway). Nitpicking it achieves nothing (if you feel that your language needs 'b' pre-defined then, hey!, pre-define it; I'm pretty sure that won't fix the problem of operations such b[i] = 2 not being atomic yet alone thread-safe in Python).

    Lets instead focus on my point that breaking the locks won't fix the problem. For instance, above. Or "For instance, if the fork() happens while a thread is holding the <stdio.h> FILE lock on a log file, then the child trying to access that FILE will hang."

    @cagney
    Copy link
    Mannequin

    cagney mannequin commented Apr 10, 2019

    BTW, non-visible change might be to use a global readers-writer lock where fork() is the writer.

    https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock

    @gpshead
    Copy link
    Member Author

    gpshead commented Apr 10, 2019

    The problem i am fixing is merely making Python's logging library not the source of this deadlock because it was not the source in the past. I am solving a regression in CPython behavior between 3.7.0 and 3.7.1 that led to a logging.Handler lock related deadlock in two identified problematic applications.

    Breaking the logging.Handler owned locks does solve that unless someone has concrete proof from these specific applications that it does otherwise.

    It would be helpful if someone *willing to actually be constructive* would test them with my PR's patch applied to their 3.7.3 interpreter as confirmation.

    @ambv
    Copy link
    Contributor

    ambv commented May 6, 2019

    This deadlock is a release blocker for 3.8.0. FTR, I don't consider it blocking alphas and betas.

    @gpshead
    Copy link
    Member Author

    gpshead commented May 7, 2019

    New changeset 64aa6d2 by Gregory P. Smith in branch 'master':
    bpo-36533: Reinit logging.Handler locks on fork(). (GH-12704)
    64aa6d2

    @gpshead gpshead removed the 3.8 (EOL) end of life label May 7, 2019
    @gpshead
    Copy link
    Member Author

    gpshead commented May 7, 2019

    New changeset 3b4b28e by Gregory P. Smith in branch '3.7':
    [3.7] bpo-36533: Reinit logging.Handler locks on fork(). (GH-12704) (GH-13170)
    3b4b28e

    @gpshead
    Copy link
    Member Author

    gpshead commented May 7, 2019

    The regression should be fixed. It'd be helpful if owners of applications that were running into this could test their applications with this specific change.

    @gpshead gpshead added the 3.8 (EOL) end of life label May 7, 2019
    @gpshead gpshead closed this as completed May 7, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 (EOL) end of life release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants