Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read and write of __class__ in two threads causes crash #120198

Closed
Fidget-Spinner opened this issue Jun 7, 2024 · 19 comments
Closed

Read and write of __class__ in two threads causes crash #120198

Fidget-Spinner opened this issue Jun 7, 2024 · 19 comments
Labels
type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@Fidget-Spinner
Copy link
Member

Fidget-Spinner commented Jun 7, 2024

Crash report

What happened?

See #120195 for crashing CI.
Apply diff from there and run

python -m test test_free_threading -m test_type

CPython versions tested on:

CPython main branch

Operating systems tested on:

No response

Output from running 'python -VV' on the command line:

No response

Linked PRs

@Fidget-Spinner Fidget-Spinner added the type-crash A hard crash of the interpreter, possibly with a core dump label Jun 7, 2024
@Eclips4
Copy link
Member

Eclips4 commented Jun 7, 2024

Is this a free threaded build?

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Jun 7, 2024

Is this a free threaded build?

Both builds (default and free-threaded).

@Fidget-Spinner
Copy link
Member Author

FWIW I cannot repro locally on my own machine, but it crashes the CI.

@Eclips4
Copy link
Member

Eclips4 commented Jun 7, 2024

FWIW I cannot repro locally on my own machine, but it crashes the CI.

If I run this code, no crash occurs:

import threading

NTHREADS = 6

def test___class___modification():
    class Foo:
        pass

    class Bar:
        pass

    thing = Foo()
    def work():
        foo = thing
        for _ in range(10000):
            foo.__class__ = Bar
            type(foo)
            foo.__class__ = Foo
            type(foo)


    threads = []
    for i in range(NTHREADS):
        thread = threading.Thread(target=work)
        thread.start()
        threads.append(thread)

    for thread in threads:
        thread.join()

test___class___modification()

But if I apply your changes from the #120195 and run test_free_threading.test_type it will crash:

./python.exe -m test -v test_free_threading -m test_type
== CPython 3.14.0a0 (heads/main:6b606522ca, Jun 7 2024, 08:22:12) [Clang 15.0.0 (clang-1500.3.9.4)]
== macOS-14.5-arm64-arm-64bit-Mach-O little-endian
== Python build: debug
== cwd: /Users/admin/Projects/cpython/build/test_python_worker_71431æ
== CPU count: 8
== encodings: locale=UTF-8 FS=utf-8
== resources: all test resources are disabled, use -u option to unskip tests

Using random seed: 3124489832
0:00:00 load avg: 3.31 Run 1 test sequentially in a single process
0:00:00 load avg: 3.31 [1/1] test_free_threading
test___class___modification (test.test_free_threading.test_type.TestType.test___class___modification) ... ok
test_attr_cache (test.test_free_threading.test_type.TestType.test_attr_cache) ... ok
test_attr_cache_consistency (test.test_free_threading.test_type.TestType.test_attr_cache_consistency) ... ok
test_attr_cache_consistency_subclass (test.test_free_threading.test_type.TestType.test_attr_cache_consistency_subclass) ... ok

----------------------------------------------------------------------
Ran 4 tests in 10.804s

OK
Python/gc.c:94: gc_decref: Assertion "gc_get_refs(g) > 0" failed: refcount is too small
Enable tracemalloc to get the memory block allocation traceback

object address  : 0x133073a30
object refcount : 1
object type     : 0x10323a798
object type name: type
object repr     : <class 'test.test_free_threading.test_type.TestType.test___class___modification.<locals>.Bar'>

Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed
Python runtime state: initialized

Current thread 0x00000001f61d0c00 (most recent call first):
  Garbage-collecting
  File "/Users/admin/Projects/cpython/Lib/test/support/__init__.py", line 814 in gc_collect
  File "/Users/admin/Projects/cpython/Lib/test/libregrtest/single.py", line 144 in _load_run_test
  File "/Users/admin/Projects/cpython/Lib/test/libregrtest/single.py", line 181 in _runtest_env_changed_exc
  File "/Users/admin/Projects/cpython/Lib/test/libregrtest/single.py", line 281 in _runtest
  File "/Users/admin/Projects/cpython/Lib/test/libregrtest/single.py", line 310 in run_single_test
  File "/Users/admin/Projects/cpython/Lib/test/libregrtest/main.py", line 363 in run_test
  File "/Users/admin/Projects/cpython/Lib/test/libregrtest/main.py", line 397 in run_tests_sequentially
  File "/Users/admin/Projects/cpython/Lib/test/libregrtest/main.py", line 541 in _run_tests
  File "/Users/admin/Projects/cpython/Lib/test/libregrtest/main.py", line 576 in run_tests
  File "/Users/admin/Projects/cpython/Lib/test/libregrtest/main.py", line 739 in main
  File "/Users/admin/Projects/cpython/Lib/test/libregrtest/main.py", line 747 in main
  File "/Users/admin/Projects/cpython/Lib/test/__main__.py", line 2 in <module>
  File "/Users/admin/Projects/cpython/Lib/runpy.py", line 88 in _run_code
  File "/Users/admin/Projects/cpython/Lib/runpy.py", line 198 in _run_module_as_main

Extension modules: _testinternalcapi, _testcapi (total: 2)
zsh: abort      ./python.exe -m test -v test_free_threading -m test_type

@Fidget-Spinner
Copy link
Member Author

Oh yeah I get it only in the test too. Thanks!

@Fidget-Spinner
Copy link
Member Author

Well I'm not sure how to fix this. So it's up for grabs for anyone to take!

@Zheaoli
Copy link
Contributor

Zheaoli commented Jun 8, 2024

I can reproduce on Linux/x86_64, would you mind I take a try?

@Fidget-Spinner
Copy link
Member Author

@Zheaoli Yeah sure go ahead!

@Zheaoli
Copy link
Contributor

Zheaoli commented Jun 8, 2024

Seems the bugs root cause here https://github.com/python/cpython/blob/main/Lib/test/libregrtest/setup.py#L73-L77

Can reproduced by the code following below

import threading
import sys

NTHREADS = 20
def _test_audit_hook(name, args):
    # print(name, args)
    pass
sys.addaudithook(_test_audit_hook)
class A:
    @classmethod
    def test___class___modification(self):
        class Foo:
            pass

        class Bar:
            pass

        thing = Foo()
        def work():
            foo = thing
            for _ in range(100000):
                foo.__class__ = Bar
                type(foo)
                foo.__class__ = Foo
                type(foo)


        threads = []
        for i in range(NTHREADS):
            thread = threading.Thread(target=work)
            thread.start()
            threads.append(thread)

        for thread in threads:
            thread.join()

A.test___class___modification()

Let me find more detail about the race condition here

@Zheaoli
Copy link
Contributor

Zheaoli commented Jun 8, 2024

This bug is also in Python 3.9 ~ now

@Zheaoli
Copy link
Contributor

Zheaoli commented Jun 9, 2024

Update status:

I may need some help here.

For now, I have observed some interesting detail

  1. The crash will disappear if we add extra code like(print) in the hook function
  2. Comment https://github.com/python/cpython/blob/main/Python/sysmodule.c#L282 this line, and the crash will disappear
  3. the crash will disappear too if I use watchpoint in the gdb to watch the object ob_refcnt filed changed event.....
  4. This bug exist in origin code for PEP 578

I think this is a race condition issue for the sys_audit. But I may need some help to dive deeper into this issue...

cc @zooba for #12613

@picnixz
Copy link
Contributor

picnixz commented Jun 9, 2024

FTR, I got a crash with a print(1) in the audit hook and only 1000 runs and 4 threads (not a constant crash but still with the same error). I'll investigate on my side.

@Fidget-Spinner
Copy link
Member Author

@Zheaoli it might not be a threading issue. I wrapped the function you sent with _PyEval_StopTheWorld(is); _PyEval_StartTheWorld(is); and it still crashes. Usually that's strong enough to prevent any races, so I'm thinking the threads might be exposing something else.

BTW, I checked the object_set_class code and couldn't find any issues there. This is where it's calling the audit hooks.

if (PySys_Audit("object.__setattr__", "OsO",
. If I comment that out, it passes too. So this suggests there's probably indeed an issue with audit.

@Zheaoli
Copy link
Contributor

Zheaoli commented Jun 9, 2024

So this suggests there's probably indeed an issue with audit.

Yep, https://github.com/python/cpython/blob/main/Python/sysmodule.c#L282 the crash because of here...

@zooba
Copy link
Member

zooba commented Jun 10, 2024

@Fidget-Spinner has it right. object_set_class is capturing the object's type and then calling the hook. The hook goes back into Python, which may release the GIL and let another thread come in and change the object's type. Then after the hook, it continues with the old type.

PyTypeObject *oldto = Py_TYPE(self);

The fix is to move this line to below where the hook is called (we do parameter validation before invoking the hook on purpose). Anyone want to make a PR?

@Fidget-Spinner
Copy link
Member Author

I have a PR up already, so I will reuse it for the fix, but I will attribute @Zheaoli as co-author for it thanks to their significant help triaging the issue down to audit.

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 11, 2024
…dit hook active (pythonGH-120195)

(cherry picked from commit 203565b)

Co-authored-by: Ken Jin <kenjin@python.org>
zooba pushed a commit that referenced this issue Jun 11, 2024
…ok active (GH-120195)

(cherry picked from commit 203565b)

Co-authored-by: Ken Jin <kenjin@python.org>
@encukou
Copy link
Member

encukou commented Jun 12, 2024

It seems that this started a reference leak on the 3.13 buildbots, see e.g. https://buildbot.python.org/all/#/builders/1396/builds/138

@zooba
Copy link
Member

zooba commented Jun 12, 2024

I only see a test failure on a single macOS nogil buildbot, and all the rest are fine? And I'd expect this new test to expose threading issues, because that's what it's for.

The fix has certainly not introduced a reference leak, so if anything is going to get reverted, it should just be the test.

@Fidget-Spinner
Copy link
Member Author

I just fixed the problem on free threading, so the problem is gone.

The refleak buildbot was failing not because of an actual refleak but it ran the test long enough to always segfault. So I fixed it on the free threading buildbot as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-crash A hard crash of the interpreter, possibly with a core dump
Projects
None yet
Development

No branches or pull requests

6 participants