Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_PyThreadState_Init and fork race leads to inconsistent key list #73826

Closed
JnStanek mannequin opened this issue Feb 24, 2017 · 18 comments
Closed

_PyThreadState_Init and fork race leads to inconsistent key list #73826

JnStanek mannequin opened this issue Feb 24, 2017 · 18 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@JnStanek
Copy link
Mannequin

JnStanek mannequin commented Feb 24, 2017

BPO 29640
Nosy @vstinner, @encukou, @zware, @matrixise, @applio, @stratakis
PRs
  • bpo-29640: Protect key list during fork() #783
  • [WIP] bpo-29640 for pthread: Use native thread-local storage to avoid locking issues #5141
  • Files
  • bz1268226_reproducer2.tar.gz: python patch that makes it easier to reproduce
  • fork_hold_keymutex.patch
  • python.patch
  • 0001-Protect-key-list-during-fork.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-04-27.04:19:17.396>
    created_at = <Date 2017-02-24.09:03:58.320>
    labels = ['interpreter-core', 'type-crash']
    title = '_PyThreadState_Init and fork race leads to inconsistent key list'
    updated_at = <Date 2020-04-27.04:19:17.395>
    user = 'https://bugs.python.org/JnStanek'

    bugs.python.org fields:

    activity = <Date 2020-04-27.04:19:17.395>
    actor = 'zach.ware'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-04-27.04:19:17.396>
    closer = 'zach.ware'
    components = ['Interpreter Core']
    creation = <Date 2017-02-24.09:03:58.320>
    creator = 'J\xc3\xa1n Stan\xc4\x8dek'
    dependencies = []
    files = ['46666', '46682', '46730', '46753']
    hgrepos = []
    issue_num = 29640
    keywords = ['patch']
    message_count = 18.0
    messages = ['288510', '288753', '289723', '289993', '299299', '309700', '309705', '309712', '309725', '309731', '316649', '316655', '317077', '317184', '317273', '317276', '317291', '367389']
    nosy_count = 9.0
    nosy_names = ['vstinner', 'petr.viktorin', 'neologix', 'zach.ware', 'matrixise', 'davin', 'cstratak', 'J\xc3\xa1n Stan\xc4\x8dek', 'Thomas Mortensson']
    pr_nums = ['783', '5141']
    priority = 'normal'
    resolution = 'out of date'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue29640'
    versions = ['Python 2.7']

    @JnStanek
    Copy link
    Mannequin Author

    JnStanek mannequin commented Feb 24, 2017

    Following crash is sporadically observed in RHEL7 anaconda:

    (gdb) f 0
    #0 PyThread_ReInitTLS () at /usr/src/debug/Python-2.7.5/Python/thread.c:411
    411 if (p->id != id) {
    (gdb) list
    406 keymutex = PyThread_allocate_lock();
    407
    408 /* Delete all keys which do not match the current thread id */
    409 q = &keyhead;
    410 while ((p = *q) != NULL) {
    411 if (p->id != id) {
    412 *q = p->next;
    413 free((void *)p);
    414 /* NB This does *not* free p->value! */
    415 }
    (gdb) p p
    $1 = (struct key *) 0x3333333333333333
    (gdb) p keyhead
    $2 = (struct key *) 0x3333333333333333

    key list is protected by keymutex (except for PyThread_ReInitTLS), but there doesn't seem to be any protection against concurrent fork(). What seems to happen is fork() at the moment when key list is not consistent.

    For example, if I initialize new key to 0xfe:

    static struct key *find_key(int key, void *value)
        // find_key with extra memset()
        ...
        p = (struct key *)malloc(sizeof(struct key));
        memset(p, 0xfe, sizeof(struct key));
        if (p != NULL) {
            p->id = id;
            p->key = key;
            p->value = value;
            p->next = keyhead;
            keyhead = p;
        }
        ...

    Looking at disassembly, compiler reordered last 2 writes:

    0x00000000004fcb50 <+272>: callq 0x413d10 <malloc@plt>
    0x00000000004fcb55 <+277>: movabs $0xfefefefefefefefe,%rcx
    0x00000000004fcb5f <+287>: test %rax,%rax
    0x00000000004fcb62 <+290>: mov %rcx,(%rax)
    0x00000000004fcb65 <+293>: mov %rcx,0x8(%rax)
    0x00000000004fcb69 <+297>: mov %rcx,0x10(%rax)
    0x00000000004fcb6d <+301>: mov %rcx,0x18(%rax)
    0x00000000004fcb71 <+305>: je 0x4fcaff <PyThread_set_key_value+191>
    0x00000000004fcb73 <+307>: mov 0x2f1e26(%rip),%rdx # 0x7ee9a0 <keyhead>
    0x00000000004fcb7a <+314>: mov 0x2f1dff(%rip),%rdi # 0x7ee980 <keymutex>
    0x00000000004fcb81 <+321>: xor %r14d,%r14d
    0x00000000004fcb84 <+324>: mov %rbp,0x8(%rax)
    0x00000000004fcb88 <+328>: mov %r12d,0x10(%rax)
    0x00000000004fcb8c <+332>: mov %r13,0x18(%rax)
    0x00000000004fcb90 <+336>: mov %rax,0x2f1e09(%rip) # 0x7ee9a0 <keyhead>
    0x00000000004fcb97 <+343>: mov %rdx,(%rax)

    Now consider what happens, when different threads call fork() in between these 2 writes: we updated keyhead, but keyhead->next has not been updated yet.

    Now when anaconda hangs, I get:

    (gdb) list
    407 keymutex = PyThread_allocate_lock();
    408
    409 /* Delete all keys which do not match the current thread id */
    410 q = &keyhead;
    411 while ((p = *q) != NULL) {
    412 if (p->id != id) {
    413 *q = p->next;
    414 free((void *)p);
    415 /* NB This does *not* free p->value! */
    416 }
    (gdb) p p
    $1 = (struct key *) 0xfefefefefefefefe
    (gdb) p keyhead
    $2 = (struct key *) 0xfefefefefefefefe

    Here's how I think we get into this state:

    -------------------------> thread 1
    # has GIL
    Thread.start
    _start_new_thread(self.__bootstrap, ())
    PyThread_start_new_thread(t_bootstrap)
    # spawns thread 3

    -------------------------> thread 2
    ...
    # waiting for GIL

    -------------------------> thread 3
    t_bootstrap
    _PyThreadState_Init # does not have GIL yet at this point
    _PyGILState_NoteThreadState
    PyThread_set_key_value(autoTLSkey, (void *)tstate)
    find_key
    # key list is temporarily not consistent
    # due to compiler reordering couple writes in find_key

    -------------------------> thread 1
    continuing Thread.start
      self.__started.wait()
        Event.wait()
          self.__cond.wait 
            Condition.wait()
              waiter = _allocate_lock()
              waiter.acquire()
                lock_PyThread_acquire_lock
                  Py_BEGIN_ALLOW_THREADS
                    PyEval_SaveThread
                      PyThread_release_lock(interpreter_lock);

    -------------------------> thread 2
    ...
    # acquired GIL
    os.fork() # forks inconsistent list

    -------------------------> child
    PyOS_AfterFork()
    PyThread_ReInitTLS()
    SIGSEGV

    Attached patch for python makes it easier to reproduce, by adding delays to couple places to make window key list is not consistent larger.

    @JnStanek JnStanek mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump labels Feb 24, 2017
    @encukou
    Copy link
    Member

    encukou commented Mar 1, 2017

    Here is a proof of concept patch from Jaroslav Škarvada. It fixes the problem by holding the mutex used for PyThread_create_key while forking.

    To make it more than PoC it needs adding _PyThread_AcquireKeyLock and _ReleaseKeyLock (similar to _PyImport_AcquireLock() etc.) and calling those. Other than that, does this approach look reasonable?

    @stratakis
    Copy link
    Mannequin

    stratakis mannequin commented Mar 16, 2017

    In order to reproduce:

    Apply the python.patch from bz1268226_reproducer2.tar.gz

    Compile python

    Run the reproduce4.py from bz1268226_reproducer2.tar.gz

    As indicated by the reproducer, the status returned by os.wait() for the child is 139.

    I will refine a bit the patch and work on a PR.

    @stratakis
    Copy link
    Mannequin

    stratakis mannequin commented Mar 22, 2017

    Patch for protecting the key list while forking.

    @ThomasMortensson
    Copy link
    Mannequin

    ThomasMortensson mannequin commented Jul 27, 2017

    Hey, any status update on this bug? Suffered a similar issue on a Centos 6.5 kernel when spawning multiple processes in a Twisted environment. Is this PR targeted for inclusion into the source tree?

    Thanks,

    Tom

    @encukou
    Copy link
    Member

    encukou commented Jan 9, 2018

    I'm a bit out of my depth here. Victor, could you chime in?

    The problem with Harris' patch is that, once fork() is protected by the thread lock, acquiring that lock (by e.g. calling PyGILState_GetThisThreadState) from an atfork handler hangs deadlocks.

    I thought that can be solved by doing the locking in an atfork handler, but that's not working out -- CPython's pthread_atfork (which would lock _PyThread_AcquireKeyLock for the duration of the fork) would need to be called *after* an extension's pthread_atfork (which needs the thread lock temporarily).

    @vstinner
    Copy link
    Member

    vstinner commented Jan 9, 2018

    There is a more general issue for any lock and fork(): bpo-6721, "Locks in the standard library should be sanitized on fork".

    @vstinner
    Copy link
    Member

    vstinner commented Jan 9, 2018

    Python 3 is not affected by this issue because it uses native thread locale storage (TLS):

    • pthread: pthread_getspecific() / pthread_setspecific()
    • Windows: TlsGetValue() / TlsSetValue()

    I'm not sure that it's doable to backport such enhancement, since Python 2.7 supports many thread implementations, not only NT (Windows) and pthread:

    Maybe it's doable for a Linux vendor, but it's going to be a large change that has to be maintained downstream :-/

    @encukou
    Copy link
    Member

    encukou commented Jan 9, 2018

    Gah! The more I look into locks & forks ... the more I learn, to put it mildly.

    Instead of piling on workarounds, I'll try my hand at using native thread-local storage for pthread, and avoid the locking altogether.

    Hopefully that can make it in as a bugfix? At least for this bug, it most likely *is* the most appropriate fix -- though I can only fix it for pthread.

    @encukou
    Copy link
    Member

    encukou commented Jan 9, 2018

    WIP pull request: #5141
    (I'll not get back to this for a few days, sadly.)

    @matrixise
    Copy link
    Member

    Hi Petr,

    Do you continue this patch/issue?
    Thank you

    @encukou
    Copy link
    Member

    encukou commented May 15, 2018

    Not immediately, but it is on my TODO list.
    If anyone wants to tackle it in the mean time, I'd be happy to answer any questions

    @vstinner
    Copy link
    Member

    Oh, it seems like I was wrong in my previous comment. Python 2.7 code base is already designed to support native TLS. It's just that we only implement native TLS on Windows. So yeah, it seems doable to implement native TLS for pthread.

    History of Py_HAVE_NATIVE_TLS:

    commit 8d98d2c
    Author: Mark Hammond <mhammond@skippinet.com.au>
    Date: Sat Apr 19 15:41:53 2003 +0000

    New PyGILState_ API - implements PEP-311, from patch 684256.
    

    commit 00f2df4
    Author: Kristján Valur Jónsson <kristjan@ccpgames.com>
    Date: Fri Jan 9 20:03:27 2009 +0000

    bpo-3582.  Improved thread support and TLS for Windows
    

    @encukou
    Copy link
    Member

    encukou commented May 20, 2018

    pthread is not generally compatible with int, so it can't be used with Python 2's API.
    This was changed in py3 with PEP-539, which however makes it so that the old TLS API is only available if PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT.

    @vstinner
    Copy link
    Member

    the old TLS API is only available if PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT.

    PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is defined on most (pthread) platforms, no? I understood that the PEP-539 is mostly designed for Cygwin, a platform which is not officially supported by Python. At least, PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is set to 1 on my Fedora 27 (Linux).

    I propose to cast pthread_key_create() result to int, but only define PyThread_create_key() in Python/thread_pthread.h if PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is defined.

    It means that the pthread implementation of Python would still have this bug (race condition) if PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is not defined. But backporting the PEP-539 to Python 2.7 doesn't seem worth it.

    What do you think?

    @encukou
    Copy link
    Member

    encukou commented May 22, 2018

    PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is defined on most (pthread) platforms, no?

    I propose to cast pthread_key_create() result to int, but only define PyThread_create_key() in Python/thread_pthread.h if PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is defined.

    I don't think that's a good idea. Changing API, even for platforms that aren't officially supported, sounds very harsh this late in the release cycle.

    But! I suppose we could fix the bug only for platforms with PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT. Other platforms would keep the current implementation -- they'd still have the bug, but the API would stay unchanged.

    @vstinner
    Copy link
    Member

    But! I suppose we could fix the bug only for platforms with PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT.

    Yes, this is my proposal.

    > I propose to cast pthread_key_create() result to int, but only define PyThread_create_key() in Python/thread_pthread.h if PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is defined.

    I don't think that's a good idea. Changing API, even for platforms that aren't officially supported, sounds very harsh this late in the release cycle.

    Which API change? I don't propose to modify the existing public C API "int PyThread_create_key(void)". I only propose to change it's implementation to the native pthread API when PTHREAD_KEY_T_IS_COMPATIBLE_WITH_INT is defined.

    @zware
    Copy link
    Member

    zware commented Apr 27, 2020

    As 2.7 is now EOL, I'm closing the issue.

    @zware zware closed this as completed Apr 27, 2020
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants