New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LoongArch support #11
Conversation
There is still one unanswered question about this patch: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030119.html |
Fix a deadlock for auto-resize hash tables when cds_lfht_destroy is called with RCU read-side lock held. Example stack track of a hang: Thread 2 (Thread 0x7f21ba876700 (LWP 26114)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00007f21beba7aa0 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x7f21bedac308 <urcu_memb_gp+8>) at ../include/urcu/futex.h:81 #2 futex_noasync (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x7f21bedac308 <urcu_memb_gp+8>) at ../include/urcu/futex.h:90 #3 wait_gp () at urcu.c:265 #4 wait_for_readers (input_readers=input_readers@entry=0x7f21ba8751b0, cur_snap_readers=cur_snap_readers@entry=0x0, qsreaders=qsreaders@entry=0x7f21ba8751c0) at urcu.c:357 #5 0x00007f21beba8339 in urcu_memb_synchronize_rcu () at urcu.c:498 #6 0x00007f21be99f93f in fini_table (last_order=<optimized out>, first_order=13, ht=0x5651cec75400) at rculfhash.c:1489 #7 _do_cds_lfht_shrink (new_size=<optimized out>, old_size=<optimized out>, ht=0x5651cec75400) at rculfhash.c:2001 #8 _do_cds_lfht_resize (ht=ht@entry=0x5651cec75400) at rculfhash.c:2023 #9 0x00007f21be99fa26 in do_resize_cb (work=0x5651e20621a0) at rculfhash.c:2063 #10 0x00007f21be99dbfd in workqueue_thread (arg=0x5651cec74a00) at workqueue.c:234 #11 0x00007f21bd7c06db in start_thread (arg=0x7f21ba876700) at pthread_create.c:463 #12 0x00007f21bd4e961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 1 (Thread 0x7f21bf285300 (LWP 26098)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00007f21be99d8b7 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x5651d8b38584) at ../include/urcu/futex.h:81 #2 futex_async (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x5651d8b38584) at ../include/urcu/futex.h:113 #3 futex_wait (futex=futex@entry=0x5651d8b38584) at workqueue.c:135 #4 0x00007f21be99e2c8 in urcu_workqueue_wait_completion (completion=completion@entry=0x5651d8b38580) at workqueue.c:423 #5 0x00007f21be99e3f9 in urcu_workqueue_flush_queued_work (workqueue=0x5651cec74a00) at workqueue.c:452 #6 0x00007f21be9a0c83 in cds_lfht_destroy (ht=0x5651d8b2fcf0, attr=attr@entry=0x0) at rculfhash.c:1906 This deadlock is easy to reproduce when rapidly adding a large number of entries in the cds_lfht, removing them, and calling cds_lfht_destroy(). The deadlock will occur if the call to cds_lfht_destroy() takes place while a resize of the hash table is ongoing. Fix this by moving the teardown of the lfht worker thread to libcds library destructor, so it does not have to wait on synchronize_rcu from a resize callback from within a read-side critical section. As a consequence, the atfork callbacks are left registered within each urcu flavor for which a resizeable hash table is created until the end of the executable lifetime. The other part of the fix is to move the hash table destruction to the worker thread for auto-resize hash tables. This prevents having to wait for resize callbacks from RCU read-side critical section. This is guaranteed by the fact that the worker thread serializes previously queued resize callbacks before the destroy callback. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Change-Id: If8b1c3c8063dc7b9846dc5c3fc452efd917eab4d
Fix a deadlock for auto-resize hash tables when cds_lfht_destroy is called with RCU read-side lock held. Example stack track of a hang: Thread 2 (Thread 0x7f21ba876700 (LWP 26114)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00007f21beba7aa0 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x7f21bedac308 <urcu_memb_gp+8>) at ../include/urcu/futex.h:81 #2 futex_noasync (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x7f21bedac308 <urcu_memb_gp+8>) at ../include/urcu/futex.h:90 #3 wait_gp () at urcu.c:265 #4 wait_for_readers (input_readers=input_readers@entry=0x7f21ba8751b0, cur_snap_readers=cur_snap_readers@entry=0x0, qsreaders=qsreaders@entry=0x7f21ba8751c0) at urcu.c:357 #5 0x00007f21beba8339 in urcu_memb_synchronize_rcu () at urcu.c:498 #6 0x00007f21be99f93f in fini_table (last_order=<optimized out>, first_order=13, ht=0x5651cec75400) at rculfhash.c:1489 #7 _do_cds_lfht_shrink (new_size=<optimized out>, old_size=<optimized out>, ht=0x5651cec75400) at rculfhash.c:2001 #8 _do_cds_lfht_resize (ht=ht@entry=0x5651cec75400) at rculfhash.c:2023 #9 0x00007f21be99fa26 in do_resize_cb (work=0x5651e20621a0) at rculfhash.c:2063 #10 0x00007f21be99dbfd in workqueue_thread (arg=0x5651cec74a00) at workqueue.c:234 #11 0x00007f21bd7c06db in start_thread (arg=0x7f21ba876700) at pthread_create.c:463 #12 0x00007f21bd4e961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 1 (Thread 0x7f21bf285300 (LWP 26098)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00007f21be99d8b7 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x5651d8b38584) at ../include/urcu/futex.h:81 #2 futex_async (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x5651d8b38584) at ../include/urcu/futex.h:113 #3 futex_wait (futex=futex@entry=0x5651d8b38584) at workqueue.c:135 #4 0x00007f21be99e2c8 in urcu_workqueue_wait_completion (completion=completion@entry=0x5651d8b38580) at workqueue.c:423 #5 0x00007f21be99e3f9 in urcu_workqueue_flush_queued_work (workqueue=0x5651cec74a00) at workqueue.c:452 #6 0x00007f21be9a0c83 in cds_lfht_destroy (ht=0x5651d8b2fcf0, attr=attr@entry=0x0) at rculfhash.c:1906 This deadlock is easy to reproduce when rapidly adding a large number of entries in the cds_lfht, removing them, and calling cds_lfht_destroy(). The deadlock will occur if the call to cds_lfht_destroy() takes place while a resize of the hash table is ongoing. Fix this by moving the teardown of the lfht worker thread to libcds library destructor, so it does not have to wait on synchronize_rcu from a resize callback from within a read-side critical section. As a consequence, the atfork callbacks are left registered within each urcu flavor for which a resizeable hash table is created until the end of the executable lifetime. The other part of the fix is to move the hash table destruction to the worker thread for auto-resize hash tables. This prevents having to wait for resize callbacks from RCU read-side critical section. This is guaranteed by the fact that the worker thread serializes previously queued resize callbacks before the destroy callback. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Change-Id: If8b1c3c8063dc7b9846dc5c3fc452efd917eab4d
Fix a deadlock for auto-resize hash tables when cds_lfht_destroy is called with RCU read-side lock held. Example stack track of a hang: Thread 2 (Thread 0x7f21ba876700 (LWP 26114)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00007f21beba7aa0 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x7f21bedac308 <urcu_memb_gp+8>) at ../include/urcu/futex.h:81 #2 futex_noasync (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x7f21bedac308 <urcu_memb_gp+8>) at ../include/urcu/futex.h:90 #3 wait_gp () at urcu.c:265 #4 wait_for_readers (input_readers=input_readers@entry=0x7f21ba8751b0, cur_snap_readers=cur_snap_readers@entry=0x0, qsreaders=qsreaders@entry=0x7f21ba8751c0) at urcu.c:357 #5 0x00007f21beba8339 in urcu_memb_synchronize_rcu () at urcu.c:498 #6 0x00007f21be99f93f in fini_table (last_order=<optimized out>, first_order=13, ht=0x5651cec75400) at rculfhash.c:1489 #7 _do_cds_lfht_shrink (new_size=<optimized out>, old_size=<optimized out>, ht=0x5651cec75400) at rculfhash.c:2001 #8 _do_cds_lfht_resize (ht=ht@entry=0x5651cec75400) at rculfhash.c:2023 #9 0x00007f21be99fa26 in do_resize_cb (work=0x5651e20621a0) at rculfhash.c:2063 #10 0x00007f21be99dbfd in workqueue_thread (arg=0x5651cec74a00) at workqueue.c:234 #11 0x00007f21bd7c06db in start_thread (arg=0x7f21ba876700) at pthread_create.c:463 #12 0x00007f21bd4e961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 1 (Thread 0x7f21bf285300 (LWP 26098)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00007f21be99d8b7 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x5651d8b38584) at ../include/urcu/futex.h:81 #2 futex_async (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x5651d8b38584) at ../include/urcu/futex.h:113 #3 futex_wait (futex=futex@entry=0x5651d8b38584) at workqueue.c:135 #4 0x00007f21be99e2c8 in urcu_workqueue_wait_completion (completion=completion@entry=0x5651d8b38580) at workqueue.c:423 #5 0x00007f21be99e3f9 in urcu_workqueue_flush_queued_work (workqueue=0x5651cec74a00) at workqueue.c:452 #6 0x00007f21be9a0c83 in cds_lfht_destroy (ht=0x5651d8b2fcf0, attr=attr@entry=0x0) at rculfhash.c:1906 This deadlock is easy to reproduce when rapidly adding a large number of entries in the cds_lfht, removing them, and calling cds_lfht_destroy(). The deadlock will occur if the call to cds_lfht_destroy() takes place while a resize of the hash table is ongoing. Fix this by moving the teardown of the lfht worker thread to libcds library destructor, so it does not have to wait on synchronize_rcu from a resize callback from within a read-side critical section. As a consequence, the atfork callbacks are left registered within each urcu flavor for which a resizeable hash table is created until the end of the executable lifetime. The other part of the fix is to move the hash table destruction to the worker thread for auto-resize hash tables. This prevents having to wait for resize callbacks from RCU read-side critical section. This is guaranteed by the fact that the worker thread serializes previously queued resize callbacks before the destroy callback. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Change-Id: If8b1c3c8063dc7b9846dc5c3fc452efd917eab4d
Does this issue still block the merge of the pull request? If it does, we can provide the hardware for you to perform verification. |
Yes, this still blocks the merge of the pull request. I need to understand the architecture design well enough to merge and maintain support for an architecture. For this I would need either answer to my questions, or boards available for testing and access to the architecture documentation. Testing-wise, the ideal scenario is if we can add at least 2 test boards in our test rack at EfficiOS, so it can be wired up in our CI. We can discuss access to hardware by email. Please contact me at mathieu.desnoyers@efficios.com |
LoongArch is already supported by QEMU so CI should be possible without real hardware. FWIW, this missing patch is blocking other packages on Debian now. I am adding the patch locally now. |
The proposed patch has this set in include/urcu/uatomic/loongarch.h: #define UATOMIC_HAS_ATOMIC_BYTE The architecture manual states: "The access address of an AM* atomic access instruction is the value of the general register rj. The access address of an AM* atomic access instruction always requires natural alignment, and failure to meet this condition will trigger a non-alignment exception. Atomic access instructions ending in .W and .WU read and write memory and intermediate operations with a data length of 32 bits, while atomic access instructions ending in .D and .DU read and write memory and intermediate operations with a data length of 64 bits. Whether ending in .W or .WU, the data of a word retrieved from memory by an atomic access instruction is symbolically extended and written to the general register rd." So there is a discrepancy between the patch implementation and the architecture manual. There is a test for this under tests/unit/test_uatomic.c. I suspect the main reason why this test does not fail is because the addresses of the byte and short variables happen to be naturally aligned on word-size. So the reason why I have not merged this patch is because I think it has a bug, and that if the urcu tests all pass with this, there is a hole in the testing that needs to be filled. I am not in favor of this patch being picked up as is in Debian without these issues being addressed first. I have voiced my concerns repeatedly to the patch submitter and they were never addressed. |
I have pushed this commit into the liburcu master branch which should help trigger the "unaligned atomic trap" detailed in the loongson architecture manual, please try it out: commit cac31bf
|
Testing liburcu in a QEMU environment is not sufficient to validate the correctness of the memory barriers, as this inherently depends on the hardware implementation of the processor. Typically emulators will fall back on a stronger memory consistency model compared to the emulated hardware, which makes things "work" but fail to cover the various race conditions that can happen if the memory barriers are wrong. So no, testing within QEMU is not enough for liburcu. |
Patch is already in the »unreleased« of Debian distribution as this would otherwise block many other packages.
There are two LoongArch machines in the GCC compile farm for which access can be obtained by any open source developers: The two machines are currently offline, but I will reach out to my contacts at Loongson to get them back online. |
Short-term, if you really need to deploy this patch, I recommend that you remove the UATOMIC_HAS_ATOMIC_BYTE and UATOMIC_HAS_ATOMIC_SHORT defines from include/urcu/uatomic/loongarch.h. It still needs to be tested on real hardware, but my main underlying concern is the presence of those two define in public headers that contradict the architecture reference manual.
I would be interested to have liburcu tested on real Loongson boards before I merge its support into liburcu. Please let me know how it goes. |
Test passed. Results: Test atomic ops on byte with 0 byte offset from long alignmentok 1 - uatomic_read(&vals.c[i]) == 10 Test atomic ops on byte with 1 byte offset from long alignmentok 18 - uatomic_read(&vals.c[i]) == 10 Test atomic ops on byte with 2 byte offset from long alignmentok 35 - uatomic_read(&vals.c[i]) == 10 Test atomic ops on byte with 3 byte offset from long alignmentok 52 - uatomic_read(&vals.c[i]) == 10 Test atomic ops on byte with 4 byte offset from long alignmentok 69 - uatomic_read(&vals.c[i]) == 10 Test atomic ops on byte with 5 byte offset from long alignmentok 86 - uatomic_read(&vals.c[i]) == 10 Test atomic ops on byte with 6 byte offset from long alignmentok 103 - uatomic_read(&vals.c[i]) == 10 Test atomic ops on byte with 7 byte offset from long alignmentok 120 - uatomic_read(&vals.c[i]) == 10 Test atomic ops on short with 0 byte offset from long alignmentok 137 - uatomic_read(&vals.s[i]) == 10 Test atomic ops on short with 2 byte offset from long alignmentok 154 - uatomic_read(&vals.s[i]) == 10 Test atomic ops on short with 4 byte offset from long alignmentok 171 - uatomic_read(&vals.s[i]) == 10 Test atomic ops on short with 6 byte offset from long alignmentok 188 - uatomic_read(&vals.s[i]) == 10 Test atomic ops on int with 0 byte offset from long alignmentok 205 - uatomic_read(&vals.i[i]) == 10 Test atomic ops on int with 4 byte offset from long alignmentok 222 - uatomic_read(&vals.i[i]) == 10 Test atomic ops on longok 239 - uatomic_read(&vals.l) == 10 |
The LoongArch machine in the GCC compilation farm is currently being prepared, and there will be a notification when it goes online. |
Yes. char and short are implemented through the ll/sc instruction. |
Perfect, this makes sense. I'll apply the patch and add an extra comment stating that 8-bit and 16-bit atomic accesses are performed through ll/sc, and that the ll/sc loop may retry if the cache line is modified concurrently. This can be relevant for API users relying on strong forward progress guarantees. |
Please add a patch commit message describing the change, and a "Signed-off-by: " with your email at the end, and I will be able to merge it. |
This commit completes LoongArch support. LoongArch supports byte and short atomic operations, and defines UATOMIC_HAS_ATOMIC_BYTE and UATOMIC_HAS_ATOMIC_SHORT. Signed-off-by: Wang Jing <wangjing@loongson.cn> Change-Id: I335e654939bfc90994275f2a4fad550c95f3eba4
9527dea
to
fe9c144
Compare
Modified and completed.Thanks. |
…LL/SC Based on the LoongArch Reference Manual: https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html Section 2.2.7 "Atomic Memory Access Instructions" only lists atomic operations for 32-bit and 64-bit integers. As detailed in Section 2.2.7.1, LL/SC instructions operating on 32-bit and 64-bit integers are also available. Those are used by the compiler to support atomics on byte and short types. This means atomics on 32-bit and 64-bit types have stronger forward progress guarantees than those operating on 8-bit and 16-bit types. Link: #11 (comment) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Change-Id: I01569b718f7300a46d984c34065c0bbfbd2f7cc6
It is now merged into the liburcu master branch, thanks for your contribution ! |
…LL/SC Based on the LoongArch Reference Manual: https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html Section 2.2.7 "Atomic Memory Access Instructions" only lists atomic operations for 32-bit and 64-bit integers. As detailed in Section 2.2.7.1, LL/SC instructions operating on 32-bit and 64-bit integers are also available. Those are used by the compiler to support atomics on byte and short types. This means atomics on 32-bit and 64-bit types have stronger forward progress guarantees than those operating on 8-bit and 16-bit types. Link: #11 (comment) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Change-Id: I01569b718f7300a46d984c34065c0bbfbd2f7cc6
Please review, Thanks.