Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LoongArch support #11

Closed
wants to merge 1 commit into from
Closed

Conversation

wangjingwb
Copy link
Contributor

Please review, Thanks.

@compudj
Copy link
Contributor

compudj commented May 13, 2022

There is still one unanswered question about this patch: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030119.html

compudj added a commit that referenced this pull request Feb 10, 2023
Fix a deadlock for auto-resize hash tables when cds_lfht_destroy
is called with RCU read-side lock held.

Example stack track of a hang:

  Thread 2 (Thread 0x7f21ba876700 (LWP 26114)):
  #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  #1  0x00007f21beba7aa0 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x7f21bedac308 <urcu_memb_gp+8>) at ../include/urcu/futex.h:81
  #2  futex_noasync (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x7f21bedac308 <urcu_memb_gp+8>) at ../include/urcu/futex.h:90
  #3  wait_gp () at urcu.c:265
  #4  wait_for_readers (input_readers=input_readers@entry=0x7f21ba8751b0, cur_snap_readers=cur_snap_readers@entry=0x0,
      qsreaders=qsreaders@entry=0x7f21ba8751c0) at urcu.c:357
  #5  0x00007f21beba8339 in urcu_memb_synchronize_rcu () at urcu.c:498
  #6  0x00007f21be99f93f in fini_table (last_order=<optimized out>, first_order=13, ht=0x5651cec75400) at rculfhash.c:1489
  #7  _do_cds_lfht_shrink (new_size=<optimized out>, old_size=<optimized out>, ht=0x5651cec75400) at rculfhash.c:2001
  #8  _do_cds_lfht_resize (ht=ht@entry=0x5651cec75400) at rculfhash.c:2023
  #9  0x00007f21be99fa26 in do_resize_cb (work=0x5651e20621a0) at rculfhash.c:2063
  #10 0x00007f21be99dbfd in workqueue_thread (arg=0x5651cec74a00) at workqueue.c:234
  #11 0x00007f21bd7c06db in start_thread (arg=0x7f21ba876700) at pthread_create.c:463
  #12 0x00007f21bd4e961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

  Thread 1 (Thread 0x7f21bf285300 (LWP 26098)):
  #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  #1  0x00007f21be99d8b7 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x5651d8b38584) at ../include/urcu/futex.h:81
  #2  futex_async (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x5651d8b38584) at ../include/urcu/futex.h:113
  #3  futex_wait (futex=futex@entry=0x5651d8b38584) at workqueue.c:135
  #4  0x00007f21be99e2c8 in urcu_workqueue_wait_completion (completion=completion@entry=0x5651d8b38580) at workqueue.c:423
  #5  0x00007f21be99e3f9 in urcu_workqueue_flush_queued_work (workqueue=0x5651cec74a00) at workqueue.c:452
  #6  0x00007f21be9a0c83 in cds_lfht_destroy (ht=0x5651d8b2fcf0, attr=attr@entry=0x0) at rculfhash.c:1906

This deadlock is easy to reproduce when rapidly adding a large number of
entries in the cds_lfht, removing them, and calling cds_lfht_destroy().

The deadlock will occur if the call to cds_lfht_destroy() takes place
while a resize of the hash table is ongoing.

Fix this by moving the teardown of the lfht worker thread to libcds
library destructor, so it does not have to wait on synchronize_rcu from
a resize callback from within a read-side critical section. As a
consequence, the atfork callbacks are left registered within each urcu
flavor for which a resizeable hash table is created until the end of the
executable lifetime.

The other part of the fix is to move the hash table destruction to the
worker thread for auto-resize hash tables. This prevents having to wait
for resize callbacks from RCU read-side critical section. This is
guaranteed by the fact that the worker thread serializes previously
queued resize callbacks before the destroy callback.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: If8b1c3c8063dc7b9846dc5c3fc452efd917eab4d
compudj added a commit that referenced this pull request Feb 10, 2023
Fix a deadlock for auto-resize hash tables when cds_lfht_destroy
is called with RCU read-side lock held.

Example stack track of a hang:

  Thread 2 (Thread 0x7f21ba876700 (LWP 26114)):
  #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  #1  0x00007f21beba7aa0 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x7f21bedac308 <urcu_memb_gp+8>) at ../include/urcu/futex.h:81
  #2  futex_noasync (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x7f21bedac308 <urcu_memb_gp+8>) at ../include/urcu/futex.h:90
  #3  wait_gp () at urcu.c:265
  #4  wait_for_readers (input_readers=input_readers@entry=0x7f21ba8751b0, cur_snap_readers=cur_snap_readers@entry=0x0,
      qsreaders=qsreaders@entry=0x7f21ba8751c0) at urcu.c:357
  #5  0x00007f21beba8339 in urcu_memb_synchronize_rcu () at urcu.c:498
  #6  0x00007f21be99f93f in fini_table (last_order=<optimized out>, first_order=13, ht=0x5651cec75400) at rculfhash.c:1489
  #7  _do_cds_lfht_shrink (new_size=<optimized out>, old_size=<optimized out>, ht=0x5651cec75400) at rculfhash.c:2001
  #8  _do_cds_lfht_resize (ht=ht@entry=0x5651cec75400) at rculfhash.c:2023
  #9  0x00007f21be99fa26 in do_resize_cb (work=0x5651e20621a0) at rculfhash.c:2063
  #10 0x00007f21be99dbfd in workqueue_thread (arg=0x5651cec74a00) at workqueue.c:234
  #11 0x00007f21bd7c06db in start_thread (arg=0x7f21ba876700) at pthread_create.c:463
  #12 0x00007f21bd4e961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

  Thread 1 (Thread 0x7f21bf285300 (LWP 26098)):
  #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  #1  0x00007f21be99d8b7 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x5651d8b38584) at ../include/urcu/futex.h:81
  #2  futex_async (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x5651d8b38584) at ../include/urcu/futex.h:113
  #3  futex_wait (futex=futex@entry=0x5651d8b38584) at workqueue.c:135
  #4  0x00007f21be99e2c8 in urcu_workqueue_wait_completion (completion=completion@entry=0x5651d8b38580) at workqueue.c:423
  #5  0x00007f21be99e3f9 in urcu_workqueue_flush_queued_work (workqueue=0x5651cec74a00) at workqueue.c:452
  #6  0x00007f21be9a0c83 in cds_lfht_destroy (ht=0x5651d8b2fcf0, attr=attr@entry=0x0) at rculfhash.c:1906

This deadlock is easy to reproduce when rapidly adding a large number of
entries in the cds_lfht, removing them, and calling cds_lfht_destroy().

The deadlock will occur if the call to cds_lfht_destroy() takes place
while a resize of the hash table is ongoing.

Fix this by moving the teardown of the lfht worker thread to libcds
library destructor, so it does not have to wait on synchronize_rcu from
a resize callback from within a read-side critical section. As a
consequence, the atfork callbacks are left registered within each urcu
flavor for which a resizeable hash table is created until the end of the
executable lifetime.

The other part of the fix is to move the hash table destruction to the
worker thread for auto-resize hash tables. This prevents having to wait
for resize callbacks from RCU read-side critical section. This is
guaranteed by the fact that the worker thread serializes previously
queued resize callbacks before the destroy callback.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: If8b1c3c8063dc7b9846dc5c3fc452efd917eab4d
compudj added a commit that referenced this pull request Feb 10, 2023
Fix a deadlock for auto-resize hash tables when cds_lfht_destroy
is called with RCU read-side lock held.

Example stack track of a hang:

  Thread 2 (Thread 0x7f21ba876700 (LWP 26114)):
  #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  #1  0x00007f21beba7aa0 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x7f21bedac308 <urcu_memb_gp+8>) at ../include/urcu/futex.h:81
  #2  futex_noasync (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x7f21bedac308 <urcu_memb_gp+8>) at ../include/urcu/futex.h:90
  #3  wait_gp () at urcu.c:265
  #4  wait_for_readers (input_readers=input_readers@entry=0x7f21ba8751b0, cur_snap_readers=cur_snap_readers@entry=0x0,
      qsreaders=qsreaders@entry=0x7f21ba8751c0) at urcu.c:357
  #5  0x00007f21beba8339 in urcu_memb_synchronize_rcu () at urcu.c:498
  #6  0x00007f21be99f93f in fini_table (last_order=<optimized out>, first_order=13, ht=0x5651cec75400) at rculfhash.c:1489
  #7  _do_cds_lfht_shrink (new_size=<optimized out>, old_size=<optimized out>, ht=0x5651cec75400) at rculfhash.c:2001
  #8  _do_cds_lfht_resize (ht=ht@entry=0x5651cec75400) at rculfhash.c:2023
  #9  0x00007f21be99fa26 in do_resize_cb (work=0x5651e20621a0) at rculfhash.c:2063
  #10 0x00007f21be99dbfd in workqueue_thread (arg=0x5651cec74a00) at workqueue.c:234
  #11 0x00007f21bd7c06db in start_thread (arg=0x7f21ba876700) at pthread_create.c:463
  #12 0x00007f21bd4e961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

  Thread 1 (Thread 0x7f21bf285300 (LWP 26098)):
  #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  #1  0x00007f21be99d8b7 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x5651d8b38584) at ../include/urcu/futex.h:81
  #2  futex_async (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x5651d8b38584) at ../include/urcu/futex.h:113
  #3  futex_wait (futex=futex@entry=0x5651d8b38584) at workqueue.c:135
  #4  0x00007f21be99e2c8 in urcu_workqueue_wait_completion (completion=completion@entry=0x5651d8b38580) at workqueue.c:423
  #5  0x00007f21be99e3f9 in urcu_workqueue_flush_queued_work (workqueue=0x5651cec74a00) at workqueue.c:452
  #6  0x00007f21be9a0c83 in cds_lfht_destroy (ht=0x5651d8b2fcf0, attr=attr@entry=0x0) at rculfhash.c:1906

This deadlock is easy to reproduce when rapidly adding a large number of
entries in the cds_lfht, removing them, and calling cds_lfht_destroy().

The deadlock will occur if the call to cds_lfht_destroy() takes place
while a resize of the hash table is ongoing.

Fix this by moving the teardown of the lfht worker thread to libcds
library destructor, so it does not have to wait on synchronize_rcu from
a resize callback from within a read-side critical section. As a
consequence, the atfork callbacks are left registered within each urcu
flavor for which a resizeable hash table is created until the end of the
executable lifetime.

The other part of the fix is to move the hash table destruction to the
worker thread for auto-resize hash tables. This prevents having to wait
for resize callbacks from RCU read-side critical section. This is
guaranteed by the fact that the worker thread serializes previously
queued resize callbacks before the destroy callback.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: If8b1c3c8063dc7b9846dc5c3fc452efd917eab4d
@loongson-sm
Copy link

There is still one unanswered question about this patch: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030119.html

Does this issue still block the merge of the pull request? If it does, we can provide the hardware for you to perform verification.

@compudj
Copy link
Contributor

compudj commented Aug 10, 2023

There is still one unanswered question about this patch: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030119.html

Does this issue still block the merge of the pull request? If it does, we can provide the hardware for you to perform verification.

Yes, this still blocks the merge of the pull request. I need to understand the architecture design well enough to merge and maintain support for an architecture. For this I would need either answer to my questions, or boards available for testing and access to the architecture documentation. Testing-wise, the ideal scenario is if we can add at least 2 test boards in our test rack at EfficiOS, so it can be wired up in our CI.

We can discuss access to hardware by email. Please contact me at mathieu.desnoyers@efficios.com

@glaubitz
Copy link

glaubitz commented Sep 3, 2023

Yes, this still blocks the merge of the pull request. I need to understand the architecture design well enough to merge and maintain support for an architecture. For this I would need either answer to my questions, or boards available for testing and access to the architecture documentation. Testing-wise, the ideal scenario is if we can add at least 2 test boards in our test rack at EfficiOS, so it can be wired up in our CI.

LoongArch is already supported by QEMU so CI should be possible without real hardware.

FWIW, this missing patch is blocking other packages on Debian now. I am adding the patch locally now.

See: https://buildd.debian.org/status/fetch.php?pkg=liburcu&arch=loong64&ver=0.14.0-1&stamp=1693584292&raw=0

@compudj
Copy link
Contributor

compudj commented Sep 3, 2023

The proposed patch has this set in include/urcu/uatomic/loongarch.h:

#define UATOMIC_HAS_ATOMIC_BYTE
#define UATOMIC_HAS_ATOMIC_SHORT

The architecture manual states:

https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#atomic-memory-access-instructions

"The access address of an AM* atomic access instruction is the value of the general register rj. The access address of an AM* atomic access instruction always requires natural alignment, and failure to meet this condition will trigger a non-alignment exception.

Atomic access instructions ending in .W and .WU read and write memory and intermediate operations with a data length of 32 bits, while atomic access instructions ending in .D and .DU read and write memory and intermediate operations with a data length of 64 bits. Whether ending in .W or .WU, the data of a word retrieved from memory by an atomic access instruction is symbolically extended and written to the general register rd."

So there is a discrepancy between the patch implementation and the architecture manual. There is a test for this under tests/unit/test_uatomic.c. I suspect the main reason why this test does not fail is because the addresses of the byte and short variables happen to be naturally aligned on word-size.

So the reason why I have not merged this patch is because I think it has a bug, and that if the urcu tests all pass with this, there is a hole in the testing that needs to be filled.

I am not in favor of this patch being picked up as is in Debian without these issues being addressed first. I have voiced my concerns repeatedly to the patch submitter and they were never addressed.

@compudj
Copy link
Contributor

compudj commented Sep 3, 2023

I have pushed this commit into the liburcu master branch which should help trigger the "unaligned atomic trap" detailed in the loongson architecture manual, please try it out:

commit cac31bf
Author: Mathieu Desnoyers mathieu.desnoyers@efficios.com
Date: Sun Sep 3 10:55:24 2023 -0400

Tests: Add test for byte/short atomics on addresses which are not word-aligned

Add a unit test to catch architectures which do not allow byte and short
atomic operations on addresses which are not word aligned.

If an architecture supports byte and short atomic operations, it should
be valid to issue those operations on variables which are not
word-aligned, otherwise the architecture should not define
UATOMIC_HAS_ATOMIC_BYTE nor UATOMIC_HAS_ATOMIC_SHORT.

This should help identify architectures which mistakenly define
UATOMIC_HAS_ATOMIC_BYTE and UATOMIC_HAS_ATOMIC_SHORT.

@compudj
Copy link
Contributor

compudj commented Sep 3, 2023

Yes, this still blocks the merge of the pull request. I need to understand the architecture design well enough to merge and maintain support for an architecture. For this I would need either answer to my questions, or boards available for testing and access to the architecture documentation. Testing-wise, the ideal scenario is if we can add at least 2 test boards in our test rack at EfficiOS, so it can be wired up in our CI.

LoongArch is already supported by QEMU so CI should be possible without real hardware.

FWIW, this missing patch is blocking other packages on Debian now. I am adding the patch locally now.

See: https://buildd.debian.org/status/fetch.php?pkg=liburcu&arch=loong64&ver=0.14.0-1&stamp=1693584292&raw=0

Testing liburcu in a QEMU environment is not sufficient to validate the correctness of the memory barriers, as this inherently depends on the hardware implementation of the processor. Typically emulators will fall back on a stronger memory consistency model compared to the emulated hardware, which makes things "work" but fail to cover the various race conditions that can happen if the memory barriers are wrong.

So no, testing within QEMU is not enough for liburcu.

@glaubitz
Copy link

glaubitz commented Sep 3, 2023

I am not in favor of this patch being picked up as is in Debian without these issues being addressed first. I have voiced my concerns repeatedly to the patch submitter and they were never addressed.

Patch is already in the »unreleased« of Debian distribution as this would otherwise block many other packages.

So no, testing within QEMU is not enough for liburcu.

There are two LoongArch machines in the GCC compile farm for which access can be obtained by any open source developers:

https://cfarm.tetaneutral.net/machines/list/

The two machines are currently offline, but I will reach out to my contacts at Loongson to get them back online.

@compudj
Copy link
Contributor

compudj commented Sep 3, 2023

I am not in favor of this patch being picked up as is in Debian without these issues being addressed first. I have voiced my concerns repeatedly to the patch submitter and they were never addressed.

Patch is already in the »unreleased« of Debian distribution as this would otherwise block many other packages.

Short-term, if you really need to deploy this patch, I recommend that you remove the UATOMIC_HAS_ATOMIC_BYTE and UATOMIC_HAS_ATOMIC_SHORT defines from include/urcu/uatomic/loongarch.h. It still needs to be tested on real hardware, but my main underlying concern is the presence of those two define in public headers that contradict the architecture reference manual.

So no, testing within QEMU is not enough for liburcu.

There are two LoongArch machines in the GCC compile farm for which access can be obtained by any open source developers:

https://cfarm.tetaneutral.net/machines/list/

The two machines are currently offline, but I will reach out to my contacts at Loongson to get them back online.

I would be interested to have liburcu tested on real Loongson boards before I merge its support into liburcu. Please let me know how it goes.

@wangjingwb
Copy link
Contributor Author

cac31bf

Test passed.
Environment:
cpu:Loongson-3A5000
memory:16G
os:Loongnix 20.5
kernel version:4.19.190.8.11
gcc version:8.3.0-6.lnd.vec.36

Results:
./test_uatomic
1..255

Test atomic ops on byte with 0 byte offset from long alignment

ok 1 - uatomic_read(&vals.c[i]) == 10
ok 2 - uatomic_read(&vals.c[i]) == (typeof((&vals.c[i])))-1UL
ok 3 - uatomic_read(&vals.c[i]) == 22
ok 4 - v == (typeof(
(&vals.c[i])))-1UL
ok 5 - uatomic_read(&vals.c[i]) == 22
ok 6 - v == 22
ok 7 - uatomic_read(&vals.c[i]) == 55
ok 8 - v == 22
ok 9 - uatomic_read(&vals.c[i]) == 23
ok 10 - uatomic_read(&vals.c[i]) == 22
ok 11 - v == 96
ok 12 - uatomic_read(&vals.c[i]) == 96
ok 13 - uatomic_read(&vals.c[i]) == 122
ok 14 - v == 121
ok 15 - uatomic_read(&vals.c[i]) == 119
ok 16 - uatomic_read(&vals.c[i]) == 121
ok 17 - uatomic_read(&vals.c[i]) == 1

Test atomic ops on byte with 1 byte offset from long alignment

ok 18 - uatomic_read(&vals.c[i]) == 10
ok 19 - uatomic_read(&vals.c[i]) == (typeof((&vals.c[i])))-1UL
ok 20 - uatomic_read(&vals.c[i]) == 22
ok 21 - v == (typeof(
(&vals.c[i])))-1UL
ok 22 - uatomic_read(&vals.c[i]) == 22
ok 23 - v == 22
ok 24 - uatomic_read(&vals.c[i]) == 55
ok 25 - v == 22
ok 26 - uatomic_read(&vals.c[i]) == 23
ok 27 - uatomic_read(&vals.c[i]) == 22
ok 28 - v == 96
ok 29 - uatomic_read(&vals.c[i]) == 96
ok 30 - uatomic_read(&vals.c[i]) == 122
ok 31 - v == 121
ok 32 - uatomic_read(&vals.c[i]) == 119
ok 33 - uatomic_read(&vals.c[i]) == 121
ok 34 - uatomic_read(&vals.c[i]) == 1

Test atomic ops on byte with 2 byte offset from long alignment

ok 35 - uatomic_read(&vals.c[i]) == 10
ok 36 - uatomic_read(&vals.c[i]) == (typeof((&vals.c[i])))-1UL
ok 37 - uatomic_read(&vals.c[i]) == 22
ok 38 - v == (typeof(
(&vals.c[i])))-1UL
ok 39 - uatomic_read(&vals.c[i]) == 22
ok 40 - v == 22
ok 41 - uatomic_read(&vals.c[i]) == 55
ok 42 - v == 22
ok 43 - uatomic_read(&vals.c[i]) == 23
ok 44 - uatomic_read(&vals.c[i]) == 22
ok 45 - v == 96
ok 46 - uatomic_read(&vals.c[i]) == 96
ok 47 - uatomic_read(&vals.c[i]) == 122
ok 48 - v == 121
ok 49 - uatomic_read(&vals.c[i]) == 119
ok 50 - uatomic_read(&vals.c[i]) == 121
ok 51 - uatomic_read(&vals.c[i]) == 1

Test atomic ops on byte with 3 byte offset from long alignment

ok 52 - uatomic_read(&vals.c[i]) == 10
ok 53 - uatomic_read(&vals.c[i]) == (typeof((&vals.c[i])))-1UL
ok 54 - uatomic_read(&vals.c[i]) == 22
ok 55 - v == (typeof(
(&vals.c[i])))-1UL
ok 56 - uatomic_read(&vals.c[i]) == 22
ok 57 - v == 22
ok 58 - uatomic_read(&vals.c[i]) == 55
ok 59 - v == 22
ok 60 - uatomic_read(&vals.c[i]) == 23
ok 61 - uatomic_read(&vals.c[i]) == 22
ok 62 - v == 96
ok 63 - uatomic_read(&vals.c[i]) == 96
ok 64 - uatomic_read(&vals.c[i]) == 122
ok 65 - v == 121
ok 66 - uatomic_read(&vals.c[i]) == 119
ok 67 - uatomic_read(&vals.c[i]) == 121
ok 68 - uatomic_read(&vals.c[i]) == 1

Test atomic ops on byte with 4 byte offset from long alignment

ok 69 - uatomic_read(&vals.c[i]) == 10
ok 70 - uatomic_read(&vals.c[i]) == (typeof((&vals.c[i])))-1UL
ok 71 - uatomic_read(&vals.c[i]) == 22
ok 72 - v == (typeof(
(&vals.c[i])))-1UL
ok 73 - uatomic_read(&vals.c[i]) == 22
ok 74 - v == 22
ok 75 - uatomic_read(&vals.c[i]) == 55
ok 76 - v == 22
ok 77 - uatomic_read(&vals.c[i]) == 23
ok 78 - uatomic_read(&vals.c[i]) == 22
ok 79 - v == 96
ok 80 - uatomic_read(&vals.c[i]) == 96
ok 81 - uatomic_read(&vals.c[i]) == 122
ok 82 - v == 121
ok 83 - uatomic_read(&vals.c[i]) == 119
ok 84 - uatomic_read(&vals.c[i]) == 121
ok 85 - uatomic_read(&vals.c[i]) == 1

Test atomic ops on byte with 5 byte offset from long alignment

ok 86 - uatomic_read(&vals.c[i]) == 10
ok 87 - uatomic_read(&vals.c[i]) == (typeof((&vals.c[i])))-1UL
ok 88 - uatomic_read(&vals.c[i]) == 22
ok 89 - v == (typeof(
(&vals.c[i])))-1UL
ok 90 - uatomic_read(&vals.c[i]) == 22
ok 91 - v == 22
ok 92 - uatomic_read(&vals.c[i]) == 55
ok 93 - v == 22
ok 94 - uatomic_read(&vals.c[i]) == 23
ok 95 - uatomic_read(&vals.c[i]) == 22
ok 96 - v == 96
ok 97 - uatomic_read(&vals.c[i]) == 96
ok 98 - uatomic_read(&vals.c[i]) == 122
ok 99 - v == 121
ok 100 - uatomic_read(&vals.c[i]) == 119
ok 101 - uatomic_read(&vals.c[i]) == 121
ok 102 - uatomic_read(&vals.c[i]) == 1

Test atomic ops on byte with 6 byte offset from long alignment

ok 103 - uatomic_read(&vals.c[i]) == 10
ok 104 - uatomic_read(&vals.c[i]) == (typeof((&vals.c[i])))-1UL
ok 105 - uatomic_read(&vals.c[i]) == 22
ok 106 - v == (typeof(
(&vals.c[i])))-1UL
ok 107 - uatomic_read(&vals.c[i]) == 22
ok 108 - v == 22
ok 109 - uatomic_read(&vals.c[i]) == 55
ok 110 - v == 22
ok 111 - uatomic_read(&vals.c[i]) == 23
ok 112 - uatomic_read(&vals.c[i]) == 22
ok 113 - v == 96
ok 114 - uatomic_read(&vals.c[i]) == 96
ok 115 - uatomic_read(&vals.c[i]) == 122
ok 116 - v == 121
ok 117 - uatomic_read(&vals.c[i]) == 119
ok 118 - uatomic_read(&vals.c[i]) == 121
ok 119 - uatomic_read(&vals.c[i]) == 1

Test atomic ops on byte with 7 byte offset from long alignment

ok 120 - uatomic_read(&vals.c[i]) == 10
ok 121 - uatomic_read(&vals.c[i]) == (typeof((&vals.c[i])))-1UL
ok 122 - uatomic_read(&vals.c[i]) == 22
ok 123 - v == (typeof(
(&vals.c[i])))-1UL
ok 124 - uatomic_read(&vals.c[i]) == 22
ok 125 - v == 22
ok 126 - uatomic_read(&vals.c[i]) == 55
ok 127 - v == 22
ok 128 - uatomic_read(&vals.c[i]) == 23
ok 129 - uatomic_read(&vals.c[i]) == 22
ok 130 - v == 96
ok 131 - uatomic_read(&vals.c[i]) == 96
ok 132 - uatomic_read(&vals.c[i]) == 122
ok 133 - v == 121
ok 134 - uatomic_read(&vals.c[i]) == 119
ok 135 - uatomic_read(&vals.c[i]) == 121
ok 136 - uatomic_read(&vals.c[i]) == 1

Test atomic ops on short with 0 byte offset from long alignment

ok 137 - uatomic_read(&vals.s[i]) == 10
ok 138 - uatomic_read(&vals.s[i]) == (typeof((&vals.s[i])))-1UL
ok 139 - uatomic_read(&vals.s[i]) == 22
ok 140 - v == (typeof(
(&vals.s[i])))-1UL
ok 141 - uatomic_read(&vals.s[i]) == 22
ok 142 - v == 22
ok 143 - uatomic_read(&vals.s[i]) == 55
ok 144 - v == 22
ok 145 - uatomic_read(&vals.s[i]) == 23
ok 146 - uatomic_read(&vals.s[i]) == 22
ok 147 - v == 96
ok 148 - uatomic_read(&vals.s[i]) == 96
ok 149 - uatomic_read(&vals.s[i]) == 122
ok 150 - v == 121
ok 151 - uatomic_read(&vals.s[i]) == 119
ok 152 - uatomic_read(&vals.s[i]) == 121
ok 153 - uatomic_read(&vals.s[i]) == 1

Test atomic ops on short with 2 byte offset from long alignment

ok 154 - uatomic_read(&vals.s[i]) == 10
ok 155 - uatomic_read(&vals.s[i]) == (typeof((&vals.s[i])))-1UL
ok 156 - uatomic_read(&vals.s[i]) == 22
ok 157 - v == (typeof(
(&vals.s[i])))-1UL
ok 158 - uatomic_read(&vals.s[i]) == 22
ok 159 - v == 22
ok 160 - uatomic_read(&vals.s[i]) == 55
ok 161 - v == 22
ok 162 - uatomic_read(&vals.s[i]) == 23
ok 163 - uatomic_read(&vals.s[i]) == 22
ok 164 - v == 96
ok 165 - uatomic_read(&vals.s[i]) == 96
ok 166 - uatomic_read(&vals.s[i]) == 122
ok 167 - v == 121
ok 168 - uatomic_read(&vals.s[i]) == 119
ok 169 - uatomic_read(&vals.s[i]) == 121
ok 170 - uatomic_read(&vals.s[i]) == 1

Test atomic ops on short with 4 byte offset from long alignment

ok 171 - uatomic_read(&vals.s[i]) == 10
ok 172 - uatomic_read(&vals.s[i]) == (typeof((&vals.s[i])))-1UL
ok 173 - uatomic_read(&vals.s[i]) == 22
ok 174 - v == (typeof(
(&vals.s[i])))-1UL
ok 175 - uatomic_read(&vals.s[i]) == 22
ok 176 - v == 22
ok 177 - uatomic_read(&vals.s[i]) == 55
ok 178 - v == 22
ok 179 - uatomic_read(&vals.s[i]) == 23
ok 180 - uatomic_read(&vals.s[i]) == 22
ok 181 - v == 96
ok 182 - uatomic_read(&vals.s[i]) == 96
ok 183 - uatomic_read(&vals.s[i]) == 122
ok 184 - v == 121
ok 185 - uatomic_read(&vals.s[i]) == 119
ok 186 - uatomic_read(&vals.s[i]) == 121
ok 187 - uatomic_read(&vals.s[i]) == 1

Test atomic ops on short with 6 byte offset from long alignment

ok 188 - uatomic_read(&vals.s[i]) == 10
ok 189 - uatomic_read(&vals.s[i]) == (typeof((&vals.s[i])))-1UL
ok 190 - uatomic_read(&vals.s[i]) == 22
ok 191 - v == (typeof(
(&vals.s[i])))-1UL
ok 192 - uatomic_read(&vals.s[i]) == 22
ok 193 - v == 22
ok 194 - uatomic_read(&vals.s[i]) == 55
ok 195 - v == 22
ok 196 - uatomic_read(&vals.s[i]) == 23
ok 197 - uatomic_read(&vals.s[i]) == 22
ok 198 - v == 96
ok 199 - uatomic_read(&vals.s[i]) == 96
ok 200 - uatomic_read(&vals.s[i]) == 122
ok 201 - v == 121
ok 202 - uatomic_read(&vals.s[i]) == 119
ok 203 - uatomic_read(&vals.s[i]) == 121
ok 204 - uatomic_read(&vals.s[i]) == 1

Test atomic ops on int with 0 byte offset from long alignment

ok 205 - uatomic_read(&vals.i[i]) == 10
ok 206 - uatomic_read(&vals.i[i]) == (typeof((&vals.i[i])))-1UL
ok 207 - uatomic_read(&vals.i[i]) == 22
ok 208 - v == (typeof(
(&vals.i[i])))-1UL
ok 209 - uatomic_read(&vals.i[i]) == 22
ok 210 - v == 22
ok 211 - uatomic_read(&vals.i[i]) == 55
ok 212 - v == 22
ok 213 - uatomic_read(&vals.i[i]) == 23
ok 214 - uatomic_read(&vals.i[i]) == 22
ok 215 - v == 96
ok 216 - uatomic_read(&vals.i[i]) == 96
ok 217 - uatomic_read(&vals.i[i]) == 122
ok 218 - v == 121
ok 219 - uatomic_read(&vals.i[i]) == 119
ok 220 - uatomic_read(&vals.i[i]) == 121
ok 221 - uatomic_read(&vals.i[i]) == 1

Test atomic ops on int with 4 byte offset from long alignment

ok 222 - uatomic_read(&vals.i[i]) == 10
ok 223 - uatomic_read(&vals.i[i]) == (typeof((&vals.i[i])))-1UL
ok 224 - uatomic_read(&vals.i[i]) == 22
ok 225 - v == (typeof(
(&vals.i[i])))-1UL
ok 226 - uatomic_read(&vals.i[i]) == 22
ok 227 - v == 22
ok 228 - uatomic_read(&vals.i[i]) == 55
ok 229 - v == 22
ok 230 - uatomic_read(&vals.i[i]) == 23
ok 231 - uatomic_read(&vals.i[i]) == 22
ok 232 - v == 96
ok 233 - uatomic_read(&vals.i[i]) == 96
ok 234 - uatomic_read(&vals.i[i]) == 122
ok 235 - v == 121
ok 236 - uatomic_read(&vals.i[i]) == 119
ok 237 - uatomic_read(&vals.i[i]) == 121
ok 238 - uatomic_read(&vals.i[i]) == 1

Test atomic ops on long

ok 239 - uatomic_read(&vals.l) == 10
ok 240 - uatomic_read(&vals.l) == (typeof((&vals.l)))-1UL
ok 241 - uatomic_read(&vals.l) == 22
ok 242 - v == (typeof(
(&vals.l)))-1UL
ok 243 - uatomic_read(&vals.l) == 22
ok 244 - v == 22
ok 245 - uatomic_read(&vals.l) == 55
ok 246 - v == 22
ok 247 - uatomic_read(&vals.l) == 23
ok 248 - uatomic_read(&vals.l) == 22
ok 249 - v == 96
ok 250 - uatomic_read(&vals.l) == 96
ok 251 - uatomic_read(&vals.l) == 122
ok 252 - v == 121
ok 253 - uatomic_read(&vals.l) == 119
ok 254 - uatomic_read(&vals.l) == 121
ok 255 - uatomic_read(&vals.l) == 1

@wangjingwb
Copy link
Contributor Author

I am not in favor of this patch being picked up as is in Debian without these issues being addressed first. I have voiced my concerns repeatedly to the patch submitter and they were never addressed.

Patch is already in the »unreleased« of Debian distribution as this would otherwise block many other packages.

Short-term, if you really need to deploy this patch, I recommend that you remove the UATOMIC_HAS_ATOMIC_BYTE and UATOMIC_HAS_ATOMIC_SHORT defines from include/urcu/uatomic/loongarch.h. It still needs to be tested on real hardware, but my main underlying concern is the presence of those two define in public headers that contradict the architecture reference manual.

So no, testing within QEMU is not enough for liburcu.

There are two LoongArch machines in the GCC compile farm for which access can be obtained by any open source developers:

https://cfarm.tetaneutral.net/machines/list/

The two machines are currently offline, but I will reach out to my contacts at Loongson to get them back online.

I would be interested to have liburcu tested on real Loongson boards before I merge its support into liburcu. Please let me know how it goes.

The LoongArch machine in the GCC compilation farm is currently being prepared, and there will be a notification when it goes online.
Thanks.

@wangjingwb
Copy link
Contributor Author

There is still one unanswered question about this patch: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030119.html

Yes. char and short are implemented through the ll/sc instruction.

@compudj
Copy link
Contributor

compudj commented Sep 5, 2023

There is still one unanswered question about this patch: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030119.html

Yes. char and short are implemented through the ll/sc instruction.

Perfect, this makes sense. I'll apply the patch and add an extra comment stating that 8-bit and 16-bit atomic accesses are performed through ll/sc, and that the ll/sc loop may retry if the cache line is modified concurrently. This can be relevant for API users relying on strong forward progress guarantees.

@compudj
Copy link
Contributor

compudj commented Sep 5, 2023

Please add a patch commit message describing the change, and a "Signed-off-by: " with your email at the end, and I will be able to merge it.

This commit completes LoongArch support.

LoongArch supports byte and short atomic operations,
and defines UATOMIC_HAS_ATOMIC_BYTE and UATOMIC_HAS_ATOMIC_SHORT.

Signed-off-by: Wang Jing <wangjing@loongson.cn>
Change-Id: I335e654939bfc90994275f2a4fad550c95f3eba4
@wangjingwb
Copy link
Contributor Author

Please add a patch commit message describing the change, and a "Signed-off-by: " with your email at the end, and I will be able to merge it.

Modified and completed.Thanks.

compudj added a commit that referenced this pull request Sep 6, 2023
…LL/SC

Based on the LoongArch Reference Manual:

https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html

Section 2.2.7 "Atomic Memory Access Instructions" only lists atomic
operations for 32-bit and 64-bit integers. As detailed in Section
2.2.7.1, LL/SC instructions operating on 32-bit and 64-bit integers are
also available. Those are used by the compiler to support atomics on
byte and short types.

This means atomics on 32-bit and 64-bit types have stronger forward
progress guarantees than those operating on 8-bit and 16-bit types.

Link: #11 (comment)
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I01569b718f7300a46d984c34065c0bbfbd2f7cc6
@compudj
Copy link
Contributor

compudj commented Sep 6, 2023

It is now merged into the liburcu master branch, thanks for your contribution !

@compudj compudj closed this Sep 6, 2023
compudj added a commit that referenced this pull request Sep 6, 2023
…LL/SC

Based on the LoongArch Reference Manual:

https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html

Section 2.2.7 "Atomic Memory Access Instructions" only lists atomic
operations for 32-bit and 64-bit integers. As detailed in Section
2.2.7.1, LL/SC instructions operating on 32-bit and 64-bit integers are
also available. Those are used by the compiler to support atomics on
byte and short types.

This means atomics on 32-bit and 64-bit types have stronger forward
progress guarantees than those operating on 8-bit and 16-bit types.

Link: #11 (comment)
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I01569b718f7300a46d984c34065c0bbfbd2f7cc6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants