Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-128942: make arraymodule.c free-thread safe (lock-free) #130771

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

tom-pytel
Copy link
Contributor

@tom-pytel tom-pytel commented Mar 2, 2025

I added lock-free single element reads and writes by mostly copying the list object's homework. TL;DR: pyperformance scimark seems to be back to about what it was without the free-thread safe stuff (pending confirmation of course). Tried a few other things but the list strategy seems good enough (except for the negative index thing I mentioned in #130744, if that is an issue).

Timings, the relevant ones are "OLD" - non free-thread safe arraymodule, "SLOW" - the previous slower PR and the last two "LFREERW".

### scimark_fft ###
WITHGIL: Mean +- std dev: 200 ms +- 6 ms
OLD:     Mean +- std dev: 238 ms +- 7 ms
SLOW:    Mean +- std dev: 274 ms +- 4 ms
ITEMFIX: Mean +- std dev: 272 ms +- 14 ms
NOLOCK:  Mean +- std dev: 240 ms +- 4 ms
TEST1:   Mean +- std dev: 267 ms +- 4 ms
TEST2:   Mean +- std dev: 255 ms +- 6 ms
LFREERD: Mean +- std dev: 256 ms +- 7 ms   # left in standard Py_SIZE() in getarrayitem
LFREERD: Mean +- std dev: 262 ms +- 7 ms   # atomic read of size in getarrayitem
LFREERD: Mean +- std dev: 270 ms +- 7 ms   # build and run again
LFREERD: Mean +- std dev: 255 ms +- 10 ms  # orthodox
LFREERD: Mean +- std dev: 259 ms +- 8 ms   # build and run again
LFREERW: Mean +- std dev: 239 ms +- 7 ms   # lockfree write single element
LFREERW: Mean +- std dev: 242 ms +- 6 ms   # build and run again
STATICL: Mean +- std dev: 223 ms +- 6 ms   # statically linked
STATICL: Mean +- std dev: 223 ms +- 5 ms   # build and run again
CLEANUP: Mean +- std dev: 224 ms +- 5 ms   # including atomic get/set
AMEMCPY: Mean +- std dev: 226 ms +- 6 ms   # atomic item copy on resize (instead of memcpy)
AMEMCPY: Mean +- std dev: 222 ms +- 5 ms   # build and run again
LOCKWR:  Mean +- std dev: 226 ms +- 4 ms   # locked writes again, but much better with static linking
LOCKWR:  Mean +- std dev: 226 ms +- 5 ms   # build and run again

### scimark_lu ###
WITHGIL: Mean +- std dev: 64.6 ms +- 1.9 ms
OLD:     Mean +- std dev: 89.0 ms +- 2.1 ms
SLOW:    Mean +- std dev: 95.4 ms +- 3.1 ms
ITEMFIX: Mean +- std dev: 92.0 ms +- 2.1 ms
NOLOCK:  Mean +- std dev: 88.5 ms +- 2.5 ms
TEST1:   Mean +- std dev: 89.7 ms +- 2.1 ms
TEST2:   Mean +- std dev: 90.5 ms +- 2.4 ms
LFREERD: Mean +- std dev: 89.9 ms +- 2.7 ms
LFREERD: Mean +- std dev: 90.9 ms +- 2.6 ms
LFREERD: Mean +- std dev: 93.4 ms +- 4.0 ms
LFREERD: Mean +- std dev: 91.9 ms +- 3.7 ms
LFREERD: Mean +- std dev: 91.0 ms +- 5.0 ms
LFREERW: Mean +- std dev: 89.2 ms +- 2.8 ms
LFREERW: Mean +- std dev: 88.8 ms +- 2.5 ms
STATICL: Mean +- std dev: 86.9 ms +- 1.9 ms
STATICL: Mean +- std dev: 87.1 ms +- 2.2 ms
CLEANUP: Mean +- std dev: 88.9 ms +- 2.5 ms
AMEMCPY: Mean +- std dev: 87.9 ms +- 1.9 ms
AMEMCPY: Mean +- std dev: 88.7 ms +- 4.8 ms
LOCKWR:  Mean +- std dev: 88.0 ms +- 2.4 ms
LOCKWR:  Mean +- std dev: 88.5 ms +- 3.5 ms

### scimark_monte_carlo ###
WITHGIL: Mean +- std dev: 40.1 ms +- 2.3 ms
OLD:     Mean +- std dev: 53.4 ms +- 1.0 ms
SLOW:    Mean +- std dev: 56.8 ms +- 1.4 ms
ITEMFIX: Mean +- std dev: 55.9 ms +- 0.7 ms
NOLOCK:  Mean +- std dev: 54.4 ms +- 1.3 ms
TEST1:   Mean +- std dev: 54.1 ms +- 0.6 ms
TEST2:   Mean +- std dev: 54.1 ms +- 0.6 ms
LFREERD: Mean +- std dev: 54.3 ms +- 0.9 ms
LFREERD: Mean +- std dev: 55.9 ms +- 0.9 ms
LFREERD: Mean +- std dev: 55.4 ms +- 1.1 ms
LFREERD: Mean +- std dev: 55.4 ms +- 1.1 ms
LFREERD: Mean +- std dev: 55.2 ms +- 1.5 ms
LFREERW: Mean +- std dev: 53.2 ms +- 1.1 ms
LFREERW: Mean +- std dev: 53.9 ms +- 1.5 ms
STATICL: Mean +- std dev: 55.0 ms +- 1.9 ms
STATICL: Mean +- std dev: 53.3 ms +- 1.0 ms
CLEANUP: Mean +- std dev: 52.6 ms +- 0.5 ms
AMEMCPY: Mean +- std dev: 53.3 ms +- 0.6 ms
AMEMCPY: Mean +- std dev: 53.4 ms +- 1.2 ms
LOCKWR:  Mean +- std dev: 55.4 ms +- 4.0 ms
LOCKWR:  Mean +- std dev: 54.8 ms +- 1.8 ms

### scimark_sor ###
WITHGIL: Mean +- std dev: 70.3 ms +- 3.3 ms
OLD:     Mean +- std dev: 98.1 ms +- 1.8 ms
SLOW:    Mean +- std dev: 99.9 ms +- 1.1 ms
ITEMFIX: Mean +- std dev:  100 ms +- 1 ms
NOLOCK:  Mean +- std dev: 97.3 ms +- 1.4 ms
TEST1:   Mean +- std dev: 97.5 ms +- 0.7 ms
TEST2:   Mean +- std dev: 98.5 ms +- 1.5 ms
LFREERD: Mean +- std dev: 97.0 ms +- 2.0 ms
LFREERD: Mean +- std dev: 98.2 ms +- 1.4 ms
LFREERD: Mean +- std dev: 98.4 ms +- 2.7 ms
LFREERD: Mean +- std dev: 97.6 ms +- 1.5 ms
LFREERD: Mean +- std dev: 97.9 ms +- 1.8 ms
LFREERW: Mean +- std dev: 97.2 ms +- 1.6 ms
LFREERW: Mean +- std dev: 96.9 ms +- 2.0 ms
STATICL: Mean +- std dev: 97.8 ms +- 1.9 ms
STATICL: Mean +- std dev: 97.2 ms +- 1.7 ms
CLEANUP: Mean +- std dev: 97.7 ms +- 1.2 ms
AMEMCPY: Mean +- std dev: 96.0 ms +- 2.0 ms
AMEMCPY: Mean +- std dev: 96.4 ms +- 1.5 ms
LOCKWR:  Mean +- std dev: 97.7 ms +- 1.5 ms
LOCKWR:  Mean +- std dev: 96.8 ms +- 2.5 ms

### scimark_sparse_mat_mult ###
WITHGIL: Mean +- std dev: 2.89 ms +- 0.11 ms
OLD:     Mean +- std dev: 3.94 ms +- 0.15 ms
SLOW:    Mean +- std dev: 4.68 ms +- 0.07 ms
ITEMFIX: Mean +- std dev: 4.34 ms +- 0.07 ms
NOLOCK:  Mean +- std dev: 3.94 ms +- 0.58 ms
TEST1:   Mean +- std dev: 3.99 ms +- 0.12 ms
TEST2:   Mean +- std dev: 3.80 ms +- 0.12 ms
LFREERD: Mean +- std dev: 3.84 ms +- 0.17 ms
LFREERD: Mean +- std dev: 3.83 ms +- 0.15 ms
LFREERD: Mean +- std dev: 3.94 ms +- 0.12 ms
LFREERD: Mean +- std dev: 3.74 ms +- 0.12 ms
LFREERD: Mean +- std dev: 3.83 ms +- 0.19 ms
LFREERW: Mean +- std dev: 3.84 ms +- 0.19 ms
LFREERW: Mean +- std dev: 3.87 ms +- 0.17 ms
STATICL: Mean +- std dev: 3.37 ms +- 0.08 ms
STATICL: Mean +- std dev: 3.49 ms +- 0.12 ms
CLEANUP: Mean +- std dev: 3.45 ms +- 0.14 ms
AMEMCPY: Mean +- std dev: 3.44 ms +- 0.13 ms
AMEMCPY: Mean +- std dev: 3.46 ms +- 0.14 ms
LOCKWR:  Mean +- std dev: 3.52 ms +- 0.47 ms
LOCKWR:  Mean +- std dev: 3.44 ms +- 0.10 ms

@tom-pytel
Copy link
Contributor Author

ping @colesbury

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Disclaimer: I'm not an expert on the FT list implementation, so take some of my comments with a grain of salt.

Seeing good single-threaded performance is nice, but what about multi-threaded scaling? The number of locks that are still here scare me a little--it would be nice if this scaled well for concurrent use as well, especially for operations that don't require concurrent writes (e.g., comparisons and copies).

@tom-pytel
Copy link
Contributor Author

Note, this is not ready to go, there is the memory issue which needs resolving.

@tom-pytel
Copy link
Contributor Author

@ZeroIntensity you can remove the do-not-merge, its not an arraymodule issue, its a QSBR issue, see #130794.

@tom-pytel
Copy link
Contributor Author

The main thing here for acceptance is a benchmark run which I am not able to start (I only did local pyperformance check against main), so someone with access will have to initiate that to compare with main.

@colesbury colesbury self-requested a review March 4, 2025 16:22
Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gotten a chance to look through arraymodule.c yet. I'll review that later this week.

@@ -117,7 +117,7 @@ extern wchar_t *_PyMem_DefaultRawWcsdup(const wchar_t *str);
extern int _PyMem_DebugEnabled(void);

// Enqueue a pointer to be freed possibly after some delay.
extern void _PyMem_FreeDelayed(void *ptr);
PyAPI_FUNC(void) _PyMem_FreeDelayed(void *ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move arraymodule from Setup.stdlib.in to Setup.bootstrap.in so that it's statically linked to CPython. That will:

  1. avoid the need to make this function visible externally
  2. Make the critical section API uses faster because PyThreadState_GET() will be faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there possible repercussions to some weird software out there depending on array being an external module? Like some instrumentation package or something similar?

Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall approach here seems good. A few comments below.

@@ -1673,5 +1677,266 @@ def test_gh_128961(self):
self.assertRaises(StopIteration, next, it)


class FreeThreadingTest(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add test_array to TSAN_TESTS in Lib/test/libregrtest/tsan.py. The tests should be able to pass with thread sanitizer.

Copy link
Contributor Author

@tom-pytel tom-pytel Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing is I don't think the test infrastructure used here and / or the other tests (global vars (probably more this)) are parallel safe yet (will look into it):

$ ./python -m test --parallel-threads=4 -j4 test_array
\Using random seed: 2880345231
0:00:00 load avg: 0.37 Run 1 test in parallel using 1 worker process
0:00:02 load avg: 0.37 [1/1/1] test_array failed (35 errors, 17 failures)
Warning -- Uncaught thread exception: IndexErrorWarning -- Uncaught thread exception: IndexError

Exception in thread test_cmp (test.test_array.ByteTest.test_cmp)-0:
Exception in thread test_cmp (test.test_array.ByteTest.test_cmp)-2:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/tom/work/cpython/arraylockfree/cp/Lib/threading.py", line 1054, in _bootstrap_inner
    self.run()
    ~~~~~~~~^^
  File "/home/tom/work/cpython/arraylockfree/cp/Lib/threading.py", line 996, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tom/work/cpython/arraylockfree/cp/Lib/threading.py", line 1054, in _bootstrap_inner
    self.run()
    ~~~~~~~~^^
  File "/home/tom/work/cpython/arraylockfree/cp/Lib/test/libregrtest/parallel_case.py", line 24, in run_worker
    test_case.run(result)
    ~~~~~~~~~~~~~^^^^^^^^
  File "/home/tom/work/cpython/arraylockfree/cp/Lib/unittest/case.py", line 664, in run
    self.doCleanups()
    ~~~~~~~~~~~~~~~^^
  File "/home/tom/work/cpython/arraylockfree/cp/Lib/threading.py", line 996, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tom/work/cpython/arraylockfree/cp/Lib/test/libregrtest/parallel_case.py", line 24, in run_worker
    test_case.run(result)
    ~~~~~~~~~~~~~^^^^^^^^
  File "/home/tom/work/cpython/arraylockfree/cp/Lib/unittest/case.py", line 664, in run
    self.doCleanups()
    ~~~~~~~~~~~~~~~^^
  File "/home/tom/work/cpython/arraylockfree/cp/Lib/unittest/case.py", line 695, in doCleanups
    function, args, kwargs = self._cleanups.pop()
                             ~~~~~~~~~~~~~~~~~~^^
  File "/home/tom/work/cpython/arraylockfree/cp/Lib/unittest/case.py", line 695, in doCleanups
    function, args, kwargs = self._cleanups.pop()
                             ~~~~~~~~~~~~~~~~~~^^
IndexError: pop from empty list
IndexError: pop from empty list
Warning -- Uncaught thread exception: IndexError
Exception in thread test_count (test.test_array.ByteTest.test_count)-3:
...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I am not suggesting using --parallel-threads=N. Just adding it TSAN_TESTS so it's run with Python built with TSAN. To get the same effect locally, you would build Python with TSAN and run:

./python -m test --tsan -j0

Which runs the tests in the TSAN_TESTS list. It doesn't run with --parallel-threads=N.

The --parallel-threads=N option is of limited use right now. It doesn't work with warnings (yet), and may not work with some cleanups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it did help me find two races so it is helpful. Still need to add extra individual tests for the standard tsan run.

@unittest.skipUnless(support.Py_GIL_DISABLED, 'this test can only possibly fail with GIL disabled')
@threading_helper.reap_threads
@threading_helper.requires_working_threading()
def test_free_threading(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is not catching some basic data races. See https://gist.github.com/colesbury/358feeb3a9fc628f63f5d22a2f33916d for an example.

I think you'll want:

  1. More subscript accesses (reads or writes) concurrent with other operations, especially things that resize the array
  2. Do things in a short loop, not just once.

@@ -130,9 +144,89 @@ enum machine_format_code {

#define array_Check(op, state) PyObject_TypeCheck(op, state->ArrayType)

#ifdef Py_GIL_DISABLED
typedef struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably use the same structure for the free threading and default builds. We didn't do that in PyListObject because it was constrained by the public API.

What I mean is something like:

typedef struct arrayobject {
    PyObject_VAR_HEAD
    _PyArrayArray *data;
    const struct arraydescr *ob_descr;
    PyObject *weakreflist; /* List of weak references */
    Py_ssize_t ob_exports;  /* Number of exported buffers */
} arrayobject;

And accesses to ob_item would look like arr->data.ob_item and capacity as arr->data.capacity. So no more need to mess with _Py_CONTAINER_OF(). And allocated is stored in only one place (in _PyArrayArray.allocated).

_PyArrayArray is a bit confusing of a name and there's already other functions with the array_array_ prefix in this file. Maybe PyArrayData or PyArrayItems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for minimal reasonable changes, but if you want the full cleanup can certainly do that.

@@ -231,37 +350,37 @@ b_setitem(arrayobject *ap, Py_ssize_t i, PyObject *v)
return -1;
}
if (i >= 0)
((char *)ap->ob_item)[i] = (char)x;
((char *)ob_item)[i] = (char)x;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these need to be FT_ATOMIC_STORE_CHAR_RELAXED and similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not here as these can be called multiple times from a single locked function, better acquire / release (for set) in calling lock-free function?

@tom-pytel
Copy link
Contributor Author

tom-pytel commented Mar 5, 2025

The actual arraymodule cleanup is fine, just one thing you should pay attention to when looking at this:

data->items is not a deref but pointer arithmetic so is safe to use if data is NULL in memcopies and _PyBytes_Repeat when associated size is 0. But there are places where this might not be safe which are accomodated, specifically:

  • array_array_buffer_info_impl
  • array_array_tobytes_impl
  • array_array_fromunicode_impl
  • array_array_tounicode_impl
  • array_buffer_getbuf_lock_held

Are there any other places where this needs to take place?

Its the test and trying to run it with parallel-threads with tsan that's problematic.

  • warnings doesn't play nice with --parallel-threads, found a least evil solution but its not wonderful. This still gets "env changed" but at least can run. Is there a way to check for parallel-threads other than this?
    def setUp(self):
        if (not support.Py_GIL_DISABLED or
                any('"parallel_threads": null' in a for a in sys.argv) or
                all('parallel_threads' not in a for a in sys.argv)):
            self.enterContext(warnings.catch_warnings())
  • --parallel-threads doesn't play well with the threaded FreeThreadingTest, get dangling threads sometimes.

  • ./Include/cpython/pyatomic_gcc.h:615:3: warning: ‘atomic_thread_fence’ is not supported with ‘-fsanitize=thread’ [-Wtsan]
    Do we care?

  • And worst of all, cleaned up data races in arraymodule (as far as I can see). I used to be able to run test stuff setting supperssions without problems but with test_array I get:

$ ./python -m test --parallel-threads=4 -j4 test_array
Using random seed: 2354862211
0:00:00 load avg: 0.55 Run 1 test in parallel using 1 worker process
0:00:30 load avg: 0.80 running (1): test_array (30.1 sec)
0:00:41 load avg: 0.75 [1/1/1] test_array worker non-zero exit code (Exit code 66)
==================

WARNING: ThreadSanitizer: data race (pid=1075807)
  Read of size 8 at 0x7f58e22c0098 by thread T4:
    #0 _Py_TYPE Include/object.h:268 (python+0x22a21a)
    #1 Py_IS_TYPE Include/object.h:291 (python+0x22a21a)
    #2 compare_unicode_unicode_threadsafe Objects/dictobject.c:1413 (python+0x22a21a)
    #3 do_lookup Objects/dictobject.c:1010 (python+0x22bb38)
    #4 unicodekeys_lookup_unicode_threadsafe Objects/dictobject.c:1438 (python+0x22bd9b)
    #5 _Py_dict_lookup_threadsafe Objects/dictobject.c:1493 (python+0x232e50)
    #6 _PyDict_GetItemRef_KnownHash Objects/dictobject.c:2342 (python+0x233f39)
    #7 PyDict_GetItemRef Objects/dictobject.c:2378 (python+0x234070)
    #8 cache_struct_converter Modules/_struct.c:2524 (_struct.cpython-314td-x86_64-linux-gnu.so+0xc916)
    #9 calcsize Modules/clinic/_struct.c.h:249 (_struct.cpython-314td-x86_64-linux-gnu.so+0xceb0)
    #10 cfunction_vectorcall_O Objects/methodobject.c:523 (python+0x25961c)
    #11 _PyObject_VectorcallTstate Include/internal/pycore_call.h:167 (python+0x1896b3)
    #12 PyObject_Vectorcall Objects/call.c:327 (python+0x189810)
    #13 _PyEval_EvalFrameDefault Python/generated_cases.c.h:1371 (python+0x42310d)
    ...

  Previous write of size 8 at 0x7f58e22c0098 by thread T1:
    #0 memset ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:799 (libtsan.so.0+0x614cb)
    #1 memset ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:797 (libtsan.so.0+0x614cb)
    #2 memset /usr/include/x86_64-linux-gnu/bits/string_fortified.h:59 (python+0x274169)
    #3 fill_mem_debug Objects/obmalloc.c:2739 (python+0x274169)
    #4 _PyMem_DebugRawAlloc Objects/obmalloc.c:2823 (python+0x27429d)
    #5 _PyMem_DebugRawMalloc Objects/obmalloc.c:2839 (python+0x2742d6)
    #6 _PyMem_DebugMalloc Objects/obmalloc.c:3004 (python+0x274331)
    #7 PyObject_Malloc Objects/obmalloc.c:1412 (python+0x296e92)
    #8 PyUnicode_New Objects/unicodeobject.c:1407 (python+0x31fd7d)
    #9 PyUnicode_Concat Objects/unicodeobject.c:11647 (python+0x335e53)
    #10 PyNumber_Add Objects/abstract.c:1135 (python+0x14e11c)
    #11 _PyEval_EvalFrameDefault Python/generated_cases.c.h:62 (python+0x41b60b)
    ...

WARNING: ThreadSanitizer: data race (pid=1079529)
  Write of size 8 at 0x7f0050cbd0f0 by thread T17:
    #0 descr_get_qualname Objects/descrobject.c:624 (python+0x1ac687)
    #1 getset_get Objects/descrobject.c:193 (python+0x1aaf41)
    #2 _PyObject_GenericGetAttrWithDict Objects/object.c:1694 (python+0x268a4e)
    ...

  Previous write of size 8 at 0x7f0050cbd0f0 by thread T20:
    #0 descr_get_qualname Objects/descrobject.c:624 (python+0x1ac687)
    #1 getset_get Objects/descrobject.c:193 (python+0x1aaf41)
    #2 _PyObject_GenericGetAttrWithDict Objects/object.c:1694 (python+0x268a4e)
    ...

Which is not arraymodule stuff and not suppressed. So as of now its not runnable clean under tsan (at least on my system how I am running it).

Left the bad setUp() in test_array for now so u can see but it will be getting removed (or fixed).

@colesbury
Copy link
Contributor

colesbury commented Mar 5, 2025

I'd like test_array added to TSAN_TESTS, not TSAN_PARALLEL_TESTS.

warnings doesn't play nice with --parallel-threads

Yes, warnings is not therad-safe (even with the GIL). Neil is work on making it thread-safe, but for now we shouldn't add the test to TSAN_PARALLEL_TESTS for now. It's not worth trying to work around the warnings stuff.

@tom-pytel
Copy link
Contributor Author

Ok, I've been playing with the races and have been able to shut up the ones between reads and writes by using atomic operations in get/set, but not between reads/writes and resize without a mutex.

Are you sure that tsan is supposed to be quiet here? After all, tsan detects when multiple threads access the same memory without synchronization, which is the whole point of lockfree operation, to allow multiple threads to access the same memory without synchronization. Which makes me think the tsan warnings should be expected?

I'm sending up a version with atomic reads/writes in get and set so you can have a look, but I don't think they should be there. They don't slow anything down on Intel but could be bad on other architectures (ARM? not an expert). Adding links to three places in arraymodule which I think should be sufficient make sure the bad memory is never read/written, regardless of tsan, and also a reproducer to cause tsan warnings with list in the same manner.

Thoughts?

https://github.com/tom-pytel/cpython/blob/04a8f9dc9fe07f432ef206c90e660a0dfe74ac48/Modules/arraymodule.c#L295

https://github.com/tom-pytel/cpython/blob/04a8f9dc9fe07f432ef206c90e660a0dfe74ac48/Modules/arraymodule.c#L872

https://github.com/tom-pytel/cpython/blob/04a8f9dc9fe07f432ef206c90e660a0dfe74ac48/Modules/arraymodule.c#L947

List behavior:

import threading
# from array import array


def copy_back_and_forth(b, a, count):
    b.wait()
    for _ in range(count):
        a[0] = a[1]
        a[1] = a[0]


def check(funcs, *args):
    barrier = threading.Barrier(len(funcs))
    thrds = []

    for func in funcs:
        thrd = threading.Thread(target=func, args=(barrier, *args))

        thrds.append(thrd)
        thrd.start()

    for thrd in thrds:
        thrd.join()


if __name__ == "__main__":
    check([copy_back_and_forth] * 10, [0, 1], 100)
    # check([copy_back_and_forth] * 10, array('i', [0, 1]), 100)  # currently shut up by atomics in get/set

@ZeroIntensity
Copy link
Member

Are you using the TSan suppressions?

@tom-pytel
Copy link
Contributor Author

tom-pytel commented Mar 6, 2025

Are you using the TSan suppressions?

Yes. And also no warnings if you run for fewer iterations.

List tsan warnings:

WARNING: ThreadSanitizer: data race (pid=20569)
  Atomic read of size 8 at 0x7fd6062b6260 by thread T1:
    #0 __tsan_atomic64_load ../../../../src/libsanitizer/tsan/tsan_interface_atomic.cpp:539 (libtsan.so.0+0x7fe0e)
    #1 _Py_atomic_load_ptr Include/cpython/pyatomic_gcc.h:300 (python+0x20433f)
    #2 _Py_TryXGetRef Include/internal/pycore_object.h:649 (python+0x20433f)
    #3 list_get_item_ref Objects/listobject.c:364 (python+0x20433f)
    #4 _PyList_GetItemRef Objects/listobject.c:415 (python+0x20a558)
    #5 _PyEval_EvalFrameDefault Python/generated_cases.c.h:686 (python+0x41e883)
    #6 _PyEval_EvalFrame Include/internal/pycore_ceval.h:116 (python+0x46da0b)
    #7 _PyEval_Vector Python/ceval.c:1820 (python+0x46da0b)
    #8 _PyFunction_Vectorcall Objects/call.c:413 (python+0x188eb4)
    #9 _PyObject_VectorcallTstate Include/internal/pycore_call.h:167 (python+0x190b2c)
    #10 method_vectorcall Objects/classobject.c:72 (python+0x190b2c)
    #11 _PyVectorcall_Call Objects/call.c:273 (python+0x18c9a7)
    #12 _PyObject_Call Objects/call.c:348 (python+0x18ce9f)
    #13 PyObject_Call Objects/call.c:373 (python+0x18cf24)
    #14 thread_run Modules/_threadmodule.c:354 (python+0x6689d4)
    #15 pythread_wrapper Python/thread_pthread.h:242 (python+0x57d02b)

  Previous write of size 8 at 0x7fd6062b6260 by thread T10:
    #0 PyList_SET_ITEM Include/cpython/listobject.h:47 (python+0x40aa2c)
    #1 _PyEval_EvalFrameDefault Python/generated_cases.c.h:11260 (python+0x466bd8)
    #2 _PyEval_EvalFrame Include/internal/pycore_ceval.h:116 (python+0x46da0b)
    #3 _PyEval_Vector Python/ceval.c:1820 (python+0x46da0b)
    #4 _PyFunction_Vectorcall Objects/call.c:413 (python+0x188eb4)
    #5 _PyObject_VectorcallTstate Include/internal/pycore_call.h:167 (python+0x190b2c)
    #6 method_vectorcall Objects/classobject.c:72 (python+0x190b2c)
    #7 _PyVectorcall_Call Objects/call.c:273 (python+0x18c9a7)
    #8 _PyObject_Call Objects/call.c:348 (python+0x18ce9f)
    #9 PyObject_Call Objects/call.c:373 (python+0x18cf24)
    #10 thread_run Modules/_threadmodule.c:354 (python+0x6689d4)
    #11 pythread_wrapper Python/thread_pthread.h:242 (python+0x57d02b)

To be clear, I am not saying this is an error in list, but rather expected correct behavior.

@colesbury
Copy link
Contributor

  • Yes, that's a bug in the STORE_SUBSCR_LIST_INT code. You can file an issue with data race warning.
  • Atomic reads and writes with memory_order_releaxed have no ordering or synchronization, but they are not data races by definition.
  • Don't use explicit memory fences. They're more complicated then using atomic operations, TSAN doesn't handle them, and I don't think they're necessary here.

@tom-pytel
Copy link
Contributor Author

  • Yes, that's a bug in the STORE_SUBSCR_LIST_INT code. You can file an issue with data race warning.
  • Atomic reads and writes with memory_order_releaxed have no ordering or synchronization, but they are not data races by definition.
  • Don't use explicit memory fences. They're more complicated then using atomic operations, TSAN doesn't handle them, and I don't think they're necessary here.

I don't think the memory fence is necessary either, but doesn't hurt. Will remove anyway. Point is, are you sure its a bug. Would you not expect tsan to complain with lock-free operation?

@colesbury
Copy link
Contributor

Yes, it's a bug because it's a data race and data races are undefined behavior. More concretely, it's important that we write the list element with at least memory_order_release so that any previously written contents of PyObject are visible before the write to the list. In general, we want to avoid undefined behavior in C even when if it doesn't necessarily lead to noticeable change in program behavior.

Would you not expect tsan to complain with lock-free operation?

You need to use atomic operations with the appropriate memory order. In some cases, memory_order_relaxed is sufficient, such as for the elements of array. Atomic loads and stores with memory_order_relaxed compile to plain machine load and store instructions on every architecture 1.

Footnotes

  1. Assuming normal data sizes, etc.

@ZeroIntensity
Copy link
Member

Don't we want acquire ordering, not release? That's what I originally suggested before the impl switched to relaxed.

@tom-pytel
Copy link
Contributor Author

Don't we want acquire ordering, not release? That's what I originally suggested before the impl switched to relaxed.

Yes you were right about that one, but he's talking about writing the list element in STORE_SUBSCR_LIST_INT.

@tom-pytel
Copy link
Contributor Author

  • Yes, that's a bug in the STORE_SUBSCR_LIST_INT code. You can file an issue with data race warning.

Issue posted #130920.

@tom-pytel
Copy link
Contributor Author

tom-pytel commented Mar 6, 2025

So if you insist on appeasing the tsan gods then as far as I can see its either this or locked ass_subscr / ass_item because otherwise random writes keep landing in the middle of non-atomic memcpy. Or could spinlock it... Luckily statically linked they are both fast. For the record list has all writes locked.

Relevant timings:

### scimark_fft ###
WITHGIL: Mean +- std dev: 200 ms +- 6 ms
OLD:     Mean +- std dev: 238 ms +- 7 ms
...
AMEMCPY: Mean +- std dev: 226 ms +- 6 ms   # atomic item copy on resize (instead of memcpy)
AMEMCPY: Mean +- std dev: 222 ms +- 5 ms   # build and run again
LOCKWR:  Mean +- std dev: 226 ms +- 4 ms   # locked writes again, but much better with static linking
LOCKWR:  Mean +- std dev: 226 ms +- 5 ms   # build and run again

Note: This instead of locked writes could have data integrity isssues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants