-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arc64 local variable corruption in z_init_static_threads #60071
arc64 local variable corruption in z_init_static_threads #60071
Comments
Thanks Keith for the initial investigations, I'll assign to @evgeniy-paltsev. |
as I can see you are not using release SDK version. I'm wondering if I can download this 0.16.2-rc1-5-g8d9b4fe SDK from somewhere? Or I need to build it myself? Thanks! |
sorry about the random SDK build. Turns out the regular 0.16.2-rc1 also shows the problem, get that from here: Note that the error I get is this: ASSERTION FAIL [!sys_dnode_is_linked(&to->node)] @ ZEPHYR_BASE/kernel/timeout.c:110 |
@keith-packard I'm wondering it this issue reproduces every time or not? I've run twister 50 times in a loop - and I'm not able to reproduce it.
May I also ask you to send full error log? |
And may I ask you to send your value of |
oof, I was hoping this would be easy to reproduce... I think the only bits that should affect this are the Zephyr version and the SDK version, right?
I made sure to run 'west update' in the zephyr repo before building. Here's the commands and their output; you can see that it's just hanging today, not generating an error message.
|
Did a bit more debugging today. With the code as above, the fault happens when
When I place a break point at 0x8040024a (right after vfprintf returns), I notice a difference in register contents depending upon whether I also place a break point at 0x80400242 (right before vprintf is called). When the system stops at 242 and is then continued, the application works correctly. When there is no break point at 242, the application hangs, having branched to address 0x0. The register contents at 0x8040024a when the application works looks like this:
the register contents at 0x8040024a when the application fails looks like this:
Checking the differences, we see:
Critically important here is that If I reset Ok, after even more debugging, I think I've discovered a bug in qemu? Here's the last two instructions in vfprintf:
When stopped at the first one,
I stick a break point on both addresses along with When the rtie instruction executes, it pops all 17 values from the stack, leaving |
ok, I'm pretty sure this is a bug in qemu caused by combining an interrupt and a jump with a delay slot. In this bug, the timer fires exactly as
The instruction in the delay slot increments the stack pointer. If the timer fires right on top of the I'm thinking we should just disable branch delay slots in the qemu targets; they're certainly not an optimization for this platform and it seems that qemu doesn't have them quite right. When I do that, everything is just fine. |
I spent several hours debugging a weird stack pointer corruption bug and discovered that QEMU appears to mess up register contents when an interrupt fires during the execution of a branch with a delay slot. Instead of trying to fix qemu, let's just tell the compiler to not generate code that uses the branch instructions with delay slots. Closes: zephyrproject-rtos#60071 Signed-off-by: Keith Packard <keithp@keithp.com>
Ok, looks like this qemu bug is fixed upstream and we just need a newer version in the SDK to get this to work. |
Reopen as we haven't switched to new SDK yet (but the qequired fixes are merged to SDK repo) |
I spent several hours debugging a weird stack pointer corruption bug and discovered that QEMU appears to mess up register contents when an interrupt fires during the execution of a branch with a delay slot. Instead of trying to fix qemu, let's just tell the compiler to not generate code that uses the branch instructions with delay slots. Closes: #60071 Signed-off-by: Keith Packard <keithp@keithp.com>
Re-opening as a reminder to revert the workaround when the new SDK with qemu fix is used in CI. |
I spent several hours debugging a weird stack pointer corruption bug and discovered that QEMU appears to mess up register contents when an interrupt fires during the execution of a branch with a delay slot. Instead of trying to fix qemu, let's just tell the compiler to not generate code that uses the branch instructions with delay slots. Closes: zephyrproject-rtos#60071 Signed-off-by: Keith Packard <keithp@keithp.com>
Describe the bug
When running the sample.portability.cmsis_rtos_v2.timer_synchronization test on arc64 under qemu using picolibc,
z_init_static_threads
goes into an infinite loop in the second instance of_FOREACH_STATIC_THREAD
where it's callingschedule_new_thread
.Debugging this, I discover that the register holding the termination value has been trashed and holds a completely bogus value (0x4275de2 instead of 0x80407594). Perturbing the code above the loop in essentially any way at all (I added
__asm__("")
just afterk_sched_lock()
) makes this issue go away. The difference in the assembly code is trivial -- it moves the load of__static_thread_data_list_start
after the call tok_sched_lock
Please also mention any information which could help others to understand
the problem you're facing:
What target platform are you using?
What have you tried to diagnose or workaround this issue?
Modifying the code in even trivial ways appears to work around this issue. I applied this diff:
This generated the following change in the generated code:
setting a break point on the
cmp_s
instruction here also makes things work. That makes me suspect that qemu might be at fault somehow?specific commit?
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Test completes successfully
Impact
This will delay the switch to using picolibc as the default C library.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: