Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash During Subinterpreter Finalization #105699

Closed
ericsnowcurrently opened this issue Jun 12, 2023 · 9 comments
Closed

Crash During Subinterpreter Finalization #105699

ericsnowcurrently opened this issue Jun 12, 2023 · 9 comments
Assignees
Labels
3.12 bugs and security fixes 3.13 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-subinterpreters type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@ericsnowcurrently
Copy link
Member

ericsnowcurrently commented Jun 12, 2023

There's an isolation leak somewhere. It may be just in the _xxsubinterpreters module, but I suspect it's not.

See #99114 (comment).

Reproducers:

FYI, I see crashes on this fairly infrequently.

Linked PRs

@ericsnowcurrently ericsnowcurrently added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump topic-subinterpreters 3.12 bugs and security fixes 3.13 new features, bugs and security fixes labels Jun 12, 2023
@ericsnowcurrently
Copy link
Member Author

ericsnowcurrently commented Jun 12, 2023

Probably the same thing:

(AMD64 Arch Linux TraceRefs 3.12)

test_one (test.test__xxsubinterpreters.DestroyTests.test_one) ... ok
Objects/object.c:2211: _Py_ForgetReference: Assertion failed: invalid object chain
Enable tracemalloc to get the memory block allocation traceback
object address  : 0x7f6bf962d300
object refcount : 0
object type     : 0x55a320cd18c0
object type name: bytes
object repr     : <refcnt 0 at 0x7f6bf962d300>
Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed
Python runtime state: initialized
Current thread 0x00007f6c09e2e740 (most recent call first):
  <no Python frame>
Debug memory block at address p=0x7f6c09d55e90: API '�'
    18302063728033398269 bytes originally requested
    The 7 pad bytes at p-7 are not all FORBIDDENBYTE (0xfd):
        at p-7: 0xdd *** OUCH
        at p-6: 0xdd *** OUCH
        at p-5: 0xdd *** OUCH
        at p-4: 0xdd *** OUCH
        at p-3: 0xdd *** OUCH
        at p-2: 0xdd *** OUCH
        at p-1: 0xdd *** OUCH
    Because memory is corrupted at the start, the count of bytes requested
       may be bogus, and checking the trailing pad bytes may segfault.
    The 8 pad bytes at tail=0xfdfe7d6a07d35c8d are Fatal Python error: Segmentation fault
Current thread 0x00007f6c09e2e740 (most recent call first):
  <no Python frame>
Extension modules: _testcapi, _xxsubinterpreters, _xxinterpchannels (total: 3)
make: *** [Makefile:2015: buildbottest] Segmentation fault (core dumped)

@ericsnowcurrently
Copy link
Member Author

maybe related: #105690

ericsnowcurrently added a commit that referenced this issue Jun 14, 2023
This fixes a race during import. The existing _PyRuntimeState.imports.pkgcontext is shared between interpreters, and occasionally this would cause a crash when multiple interpreters were importing extensions modules at the same time.  To solve this we add a thread-local variable for the value.  We also leave the existing state (and infrequent race) in place for platforms that do not support thread-local variables.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 14, 2023
…-105740)

This fixes a race during import. The existing _PyRuntimeState.imports.pkgcontext is shared between interpreters, and occasionally this would cause a crash when multiple interpreters were importing extensions modules at the same time.  To solve this we add a thread-local variable for the value.  We also leave the existing state (and infrequent race) in place for platforms that do not support thread-local variables.
(cherry picked from commit b87d288)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
ericsnowcurrently added a commit that referenced this issue Jun 14, 2023
…) (gh-105765)

This fixes a race during import. The existing _PyRuntimeState.imports.pkgcontext is shared between interpreters, and occasionally this would cause a crash when multiple interpreters were importing extensions modules at the same time.  To solve this we add a thread-local variable for the value.  We also leave the existing state (and infrequent race) in place for platforms that do not support thread-local variables.
(cherry picked from commit b87d288)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
@ericsnowcurrently
Copy link
Member Author

ericsnowcurrently commented Jul 19, 2023

After gh-106899, I'm only seeing 3 very infrequent crashers:

  1. failing assertion in _xxinterpreterchannelsmodule.c (not user-facing)
  2. bogus interned string state
  3. something bad in start_thread() (pthread)
(stack track for that last one)
Thread 1 (Thread 0x7f1c41efb700 (LWP 27875)):
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f1c4a6bd7f1 in __GI_abort () at abort.c:79
#2  0x00007f1c4a706837 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f1c4a833a7b "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007f1c4a70d8ba in malloc_printerr (str=str@entry=0x7f1c4a831c8e "free(): invalid size") at malloc.c:5342
#4  0x00007f1c4a818a6c in _int_free (have_lock=0, p=0x7f1c28001490, av=0x7f1c28000020) at malloc.c:4171
#5  __GI___libc_free (mem=0x7f1c280014a0) at malloc.c:3134
#6  tcache_thread_shutdown () at malloc.c:2979
#7  arena_thread_freeres () at arena.c:950
#8  0x00007f1c4a819562 in __libc_thread_freeres () at thread-freeres.c:29
#9  0x00007f1c4b21a700 in start_thread (arg=0x7f1c41efb700) at pthread_create.c:476
#10 0x00007f1c4a79e61f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@ericsnowcurrently
Copy link
Member Author

With the 3 PRs I have up I don't see any more crashes (other than in _xxinterpchannels).

ericsnowcurrently added a commit that referenced this issue Jul 21, 2023
A static (process-global) str object must only have its "interned" state cleared when no longer interned in any interpreters.  They are the only ones that can be shared by interpreters so we don't have to worry about any other str objects.

We trigger clearing the state with the main interpreter, since no other interpreters may exist at that point and _PyUnicode_ClearInterned() is only called during interpreter finalization.

We do not address here the fact that a string will only be interned in the first interpreter that interns it.  In any subsequent interpreters str.state.interned is already set so _PyUnicode_InternInPlace() will skip it.  That needs to be addressed separately from fixing the crasher.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jul 21, 2023
A static (process-global) str object must only have its "interned" state cleared when no longer interned in any interpreters.  They are the only ones that can be shared by interpreters so we don't have to worry about any other str objects.

We trigger clearing the state with the main interpreter, since no other interpreters may exist at that point and _PyUnicode_ClearInterned() is only called during interpreter finalization.

We do not address here the fact that a string will only be interned in the first interpreter that interns it.  In any subsequent interpreters str.state.interned is already set so _PyUnicode_InternInPlace() will skip it.  That needs to be addressed separately from fixing the crasher.
(cherry picked from commit 87e7cb0)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
ericsnowcurrently added a commit that referenced this issue Jul 21, 2023
…106923)

There was a slight race in _Py_ClearFileSystemEncoding() (when called from _Py_SetFileSystemEncoding()), between freeing the value and setting the variable to NULL, which occasionally caused crashes when multiple isolated interpreters were used.  (Notably, I saw at least 10 different, seemingly unrelated spooky-action-at-a-distance, ways this crashed. Yay, free threading!)  We avoid the problem by only setting the global variables with the main interpreter (i.e. runtime init).
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jul 21, 2023
pythongh-106923)

There was a slight race in _Py_ClearFileSystemEncoding() (when called from _Py_SetFileSystemEncoding()), between freeing the value and setting the variable to NULL, which occasionally caused crashes when multiple isolated interpreters were used.  (Notably, I saw at least 10 different, seemingly unrelated spooky-action-at-a-distance, ways this crashed. Yay, free threading!)  We avoid the problem by only setting the global variables with the main interpreter (i.e. runtime init).
(cherry picked from commit 0ba07b2)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
Yhg1s pushed a commit that referenced this issue Jul 21, 2023
…le (gh-106923) (#106964)

gh-105699: Fix a Crasher Related to a Deprecated Global Variable (gh-106923)

There was a slight race in _Py_ClearFileSystemEncoding() (when called from _Py_SetFileSystemEncoding()), between freeing the value and setting the variable to NULL, which occasionally caused crashes when multiple isolated interpreters were used.  (Notably, I saw at least 10 different, seemingly unrelated spooky-action-at-a-distance, ways this crashed. Yay, free threading!)  We avoid the problem by only setting the global variables with the main interpreter (i.e. runtime init).
(cherry picked from commit 0ba07b2)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
Yhg1s pushed a commit that referenced this issue Jul 21, 2023
gh-105699: Fix an Interned Strings Crasher (gh-106930)

A static (process-global) str object must only have its "interned" state cleared when no longer interned in any interpreters.  They are the only ones that can be shared by interpreters so we don't have to worry about any other str objects.

We trigger clearing the state with the main interpreter, since no other interpreters may exist at that point and _PyUnicode_ClearInterned() is only called during interpreter finalization.

We do not address here the fact that a string will only be interned in the first interpreter that interns it.  In any subsequent interpreters str.state.interned is already set so _PyUnicode_InternInPlace() will skip it.  That needs to be addressed separately from fixing the crasher.
(cherry picked from commit 87e7cb0)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jul 22, 2023
…ythonGH-106966)

(cherry picked from commit adda43d)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
ericsnowcurrently added a commit that referenced this issue Jul 25, 2023
…H-106966) (gh-107012)

gh-105699: Add some stress tests for subinterpreter creation (GH-106966)
(cherry picked from commit adda43d)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
@ericsnowcurrently
Copy link
Member Author

This can be closed once gh-106974 lands.

@ericsnowcurrently
Copy link
Member Author

The stress tests I added are still crashing occasionally. I suspect most of the problem lies with the _xxsubinterpreters module, but I'm going to check. In the meantime I'm going to disable the two tests to cut down on noise as we approach 3.12rc1.

CC @Yhg1s

@ericsnowcurrently
Copy link
Member Author

Notably, I can reproduce a crash on the 3.12 branch but not on main. (./python -m test test_interpreters -m test_create_many_threaded)

ericsnowcurrently added a commit that referenced this issue Jul 27, 2023
The two tests are crashing periodically in CI and on buildbots.  I suspect the problem is in the _xxsubinterpreters module. 
 Regardless, I'm disabling the tests temporarily, to reduce the noise as we approach 3.12rc1.  I'll be investigating the crashes separately.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jul 27, 2023
The two tests are crashing periodically in CI and on buildbots.  I suspect the problem is in the _xxsubinterpreters module.
 Regardless, I'm disabling the tests temporarily, to reduce the noise as we approach 3.12rc1.  I'll be investigating the crashes separately.
(cherry picked from commit 4f67921)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
ericsnowcurrently added a commit that referenced this issue Jul 27, 2023
…h-107357)

gh-105699: Disable the Interpreters Stress Tests (gh-107354)

The two tests are crashing periodically in CI and on buildbots.  I suspect the problem is in the _xxsubinterpreters module.
 Regardless, I'm disabling the tests temporarily, to reduce the noise as we approach 3.12rc1.  I'll be investigating the crashes separately.
(cherry picked from commit 4f67921)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
ericsnowcurrently added a commit that referenced this issue Jul 28, 2023
This fixes a crasher due to a race condition, triggered infrequently when two isolated (own GIL) subinterpreters simultaneously initialize their sys or builtins modules.  The crash happened due the combination of the "detached" thread state we were using and the "last holder" logic we use for the GIL.  It turns out it's tricky to use the same thread state for different threads.  Who could have guessed?

We solve the problem by eliminating the one object we were still sharing between interpreters.  We replace it with a low-level hashtable, using the "raw" allocator to avoid tying it to the main interpreter.

We also remove the accommodations for "detached" thread states, which were a dubious idea to start with.
ericsnowcurrently added a commit to ericsnowcurrently/cpython that referenced this issue Jul 28, 2023
…hongh-106974)

This fixes a crasher due to a race condition, triggered infrequently when two isolated (own GIL) subinterpreters simultaneously initialize their sys or builtins modules.  The crash happened due the combination of the "detached" thread state we were using and the "last holder" logic we use for the GIL.  It turns out it's tricky to use the same thread state for different threads.  Who could have guessed?

We solve the problem by eliminating the one object we were still sharing between interpreters.  We replace it with a low-level hashtable, using the "raw" allocator to avoid tying it to the main interpreter.

We also remove the accommodations for "detached" thread states, which were a dubious idea to start with.

(cherry picked from commit 8ba4df9)
ericsnowcurrently added a commit that referenced this issue Jul 28, 2023
…-106974) (gh-107412)

gh-105699: Use a _Py_hashtable_t for the PyModuleDef Cache (gh-106974)

This fixes a crasher due to a race condition, triggered infrequently when two isolated (own GIL) subinterpreters simultaneously initialize their sys or builtins modules.  The crash happened due the combination of the "detached" thread state we were using and the "last holder" logic we use for the GIL.  It turns out it's tricky to use the same thread state for different threads.  Who could have guessed?

We solve the problem by eliminating the one object we were still sharing between interpreters.  We replace it with a low-level hashtable, using the "raw" allocator to avoid tying it to the main interpreter.

We also remove the accommodations for "detached" thread states, which were a dubious idea to start with.

(cherry picked from commit 8ba4df9)
@ericsnowcurrently
Copy link
Member Author

I've disable the new stress tests as they were causing crashes regularly that I'm not convinced indicate bugs (outside _xxsubinterpreters). Once that's sorted out and I've re-enabled the tests, we can close this issue.

ericsnowcurrently added a commit that referenced this issue Aug 8, 2023
We had disabled them due to crashes they exposed, which have since been fixed.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Aug 8, 2023
…thongh-107572)

We had disabled them due to crashes they exposed, which have since been fixed.
(cherry picked from commit f9e3ff1)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
@ericsnowcurrently
Copy link
Member Author

I'm calling this good.

ericsnowcurrently added a commit that referenced this issue Nov 27, 2023
…h-107572) (#107783)

We had disabled them due to crashes they exposed, which have since been fixed.
(cherry picked from commit f9e3ff1)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
Co-authored-by: Łukasz Langa <lukasz@langa.pl>
Co-authored-by: T. Wouters <thomas@python.org>
ericsnowcurrently added a commit to ericsnowcurrently/cpython that referenced this issue Nov 27, 2023
ericsnowcurrently added a commit that referenced this issue Nov 27, 2023
…rs Stress Tests" (gh-112474)

Revert "[3.12] gh-105699: Re-enable the Multiple-Interpreters Stress Tests (gh-107572) (#107783)"

This reverts commit a4aac7d.

The stress tests are still failing on FreeBSD.
ericsnowcurrently added a commit to ericsnowcurrently/cpython that referenced this issue Nov 28, 2023
…sts (pythongh-107572) (python#107783)

We had disabled them due to crashes they exposed, which have since been fixed.
(cherry picked from commit f9e3ff1)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
Co-authored-by: Łukasz Langa <lukasz@langa.pl>
Co-authored-by: T. Wouters <thomas@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-subinterpreters type-crash A hard crash of the interpreter, possibly with a core dump
Projects
Status: Done
Development

No branches or pull requests

1 participant