-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QEMU bug with branch delay slots on ARC #54720
QEMU bug with branch delay slots on ARC #54720
Comments
See comments in code. The workaround is straightforward once the issue is understood, but the path to get here was hilariously weird. Fixes zephyrproject-rtos#54720 Signed-off-by: Andy Ross <andyross@google.com>
Probably should mark this P1 as it's currently breaking CI |
Can you open an issue on the SDK to track the toolchain bug. |
This is a workaround for a compiler bug on (at least) GCC 12.1.0 in Zephyr SDK 0.15.1. The optimizer generates this function with a last instruction that is an unconditional branch (a tail call into the chunk_set() handling). But that means that the NEXT instruction gets decoded as part of the branch delay slot, but that instruction isn't part of this function! Some instructions aren't legal in branch delay slots. One of those is ENTER_S, which is a very common entry instruction for whatever function the linker places after us. It seems like the compiler doesn't understand this problem. Stuff a NOP in to guarantee the code is legal. Comment above is duplicated in the code. The workaround is straightforward once the issue is understood, but the path to get here was hilariously weird. Fixes zephyrproject-rtos#54720 Signed-off-by: Andy Ross <andyross@google.com>
@andyross I'm not sure which instruction in the disassembly above supposedly has a delay-slot? |
This needs some more investigations indeed to know the root cause to see if the .d delay slot flag is set or made up by qemu. |
That makes sense too, I suppose. Here's a zephyr.exe |
This is a workaround for a compiler bug on (at least) GCC 12.1.0 in Zephyr SDK 0.15.1. The optimizer generates this function with a last instruction that is an unconditional branch (a tail call into the chunk_set() handling). But that means that the NEXT instruction gets decoded as part of the branch delay slot, but that instruction isn't part of this function! Some instructions aren't legal in branch delay slots. One of those is ENTER_S, which is a very common entry instruction for whatever function the linker places after us. It seems like the compiler doesn't understand this problem. Stuff a NOP in to guarantee the code is legal. Comment above is duplicated in the code. The workaround is straightforward once the issue is understood, but the path to get here was hilariously weird. Fixes #54720 Signed-off-by: Andy Ross <andyross@google.com>
That's funny, with enable
And as I expected works well in nSIM, so most likely QEMU issue. I'll file an issue in our QEMU repo once I collect all the needed data. |
It's worth pointing out that this is hugely timing dependent, so the fact that singlestep avoids the bug doesn't say much, IMHO. The function in question is (logically) branchless, and the heap code absolutely hits it many times in this binary. If I had to guess, it's something like "qemu is returning from an interrupt at the right moment and calling some function to re-populate instruction state", and that particular case mixed up the .d and not-.d instruction variants or something. |
Oh, and we should probably continue discussion in the still-open toolchain issue and not this one, which just got closed. zephyrproject-rtos/sdk-ng#627 |
I'm reopening this. As I alluded to in PR #54721 the merged workaround See https://github.com/zephyrproject-rtos/zephyr/actions/runs/4161101126/jobs/7198752243 |
Indeed, in our internal daily regression testing the issue popped up again. |
Note that this is likely not a code generation bug, but an issue in qemu.
I'll let someone else more knowledgeable with this issue rename it appropriately.
|
This is a workaround for a compiler bug on (at least) GCC 12.1.0 in Zephyr SDK 0.15.1. The optimizer generates this function with a last instruction that is an unconditional branch (a tail call into the chunk_set() handling). But that means that the NEXT instruction gets decoded as part of the branch delay slot, but that instruction isn't part of this function! Some instructions aren't legal in branch delay slots. One of those is ENTER_S, which is a very common entry instruction for whatever function the linker places after us. It seems like the compiler doesn't understand this problem. Stuff a NOP in to guarantee the code is legal. Comment above is duplicated in the code. The workaround is straightforward once the issue is understood, but the path to get here was hilariously weird. Fixes zephyrproject-rtos#54720 Signed-off-by: Andy Ross <andyross@google.com>
https://github.com/zephyrproject-rtos/zephyr/actions/runs/4171460176/jobs/7221421851#step:12:1310 It looks like this test is broken again. |
@stephanosio we're looking into it. In the meantime we'll try to submit a temporary work-around with |
Just to duplicate a comment from discord: I can also suppress results with CONFIG_QEMU_ICOUNT=n. Both techniques are probably just changing timing, but kconfig can be done on a per-test level and might be more narrowly targetted. |
Taking off the release blocker label since, according to @ruuddw, this issue is not observed in real hardware and seems to be a QEMU-specific issue. Unless one of the maintainers can come up with a workaround by tomorrow, I will send a PR to temporarily disable the affected test-platform. |
I'm not a maintainer but I'm pushing a RP with a workaround in 5 min.
|
Use -fno-optimize-sibling-calls on QEMU builds. This is a workaround to make tests/kernel/mem_protect/syscalls (and possibly others) work. Without this, result is: |START - test_syscall_torture |Running syscall torture test with 4 threads on 1 cpu(s) |E: ***** Exception vector: 0x2, cause code: 0x1, parameter 0x0 |E: Address 0x80006e48 |E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0 |E: Current thread: 0x8040095c (unknown) |E: Halting system With this commit applied: |START - test_syscall_torture |Running syscall torture test with 4 threads on 1 cpu(s) | | PASS - test_syscall_torture in 15.004 seconds |[...] |PROJECT EXECUTION SUCCESSFUL Fixes zephyrproject-rtos#54720 Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
Please consider merging PR #54876 |
Renamed this issue per overall consensus. |
…tos#54720 Disable tests/kernel/mem_protect/syscalls for qemu_arc_em where we trigger ARC QEMU bug which cause illegal instruction exception on perfectly valid ARC code. Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com> Signed-off-by: Evgeniy Paltsev <PaltsevEvgeniy@gmail.com>
Disable tests/kernel/mem_protect/syscalls for qemu_arc_em where we trigger ARC QEMU bug which cause illegal instruction exception on perfectly valid ARC code. Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com> Signed-off-by: Evgeniy Paltsev <PaltsevEvgeniy@gmail.com>
Re-opening since #54910 is only a workaround for this issue until a solution (QEMU bug fix in a new Zephyr SDK release) is provided. |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
Still open, see ARC QEMU issue. |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
For the record, we're actively working on significant re-work on delay-slot related functionality in QEMU for ARC. Apparently this is much more complex work as one may think and take a bit more time than expected. |
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
(Submitting this to Zephyr just to have a fix to link with the full description from the workaround patch. We probably want to open a toolchain bug for the broader problem?)
As of recent commits, qemu_arc_em is failing in the tests/kernel/mem_protect/syscalls test with:
The proximate cause was (hilariously) that the patch count since the last release candidate had reached 100. This caused the version string printed by the boot banner to be one byte longer and exposed the bug. (I actually got a test rig created where two Zephyr binaries that differed ONLY in whether the last byte of a fake banner string was a "x" or a newline would differ in crash behavior).
But as it turns out that's all just timing interaction. The real problem happens in the heap code, and is a compiler bug. @ruuddw pointed out that the fault (at 0x800014a4) is actually flagging an illegal instruction in a branch delay slot. And indeed, the disassembly below shows that the faulting instruction is a ENTER_S (one of the forbidden instructions), and that the instruction immediately preceding is indeed an unconditional branch!
This is a clear optimizer bug. The generated code can't be allowed to emit a linker section that ends on an instruction with a branch delay slot, because it can't control the instruction the linker will place next. And indeed, stuffing a NOP instruction at the end of the function fixes the symptom
The text was updated successfully, but these errors were encountered: