-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESP-32 preemption regressions with asm2 #6346
Comments
When returning into a different thread than we interrupted, we obviously need to spill all the existing register windows to make sure all their values are in the old thread's stack. But the code to do this forgot to reset the current stack pointer to the value it had at interrupt time (it was still pointing to the saved context below that), so the caller of the interrupted function was spilling to the wrong spot. This wouldn't show up as an instant failure, it would only happen when switching BACK to the improperly-spilled thread. And even then it would be a noop if the original interrupt handler was deep enough to have spilled that function naturally. In practice, this happened only in some instances on ESP-32 (which has more windowed registers than qemu) when interrupting the idle thread (which is very shallow) with a (very simple) timer interrupt. Trivial to see, hard to find. See zephyrproject-rtos#6346 for more detail. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Got it. It wasn't the spill code, which had a decent unit test and was pretty simple. It wasn't the cross-stack call code, which was likewise testable in isolation but a little more subtle. It was the really obvious-looking glue between the two. Cross-stack calls return with the stack pointer still pointing to the restore area, which lives below the interrupted stack. But the register spills need the stack pointer to point to the actual function that got interrupted, otherwise its CALLER (xtensa register windows are kinda crazy) will spill to the wrong spot. Turns out qemu never managed to hit this case because it has half as many windowed registers as ESP-32. |
@locomuco Sorry, missed your question. We support only the official SDK toolchain from Esspressif on ESP-32. @lpereira can give more context, but broadly Xtensa toolchains are not portable between targets, they are hard-configured to assume specific hardware features. Part of the output of their CPU generator is an "overlay" source package you have to apply to a specific version of binutils/gcc, making a generic toolchain really hard to manage. |
When returning into a different thread than we interrupted, we obviously need to spill all the existing register windows to make sure all their values are in the old thread's stack. But the code to do this forgot to reset the current stack pointer to the value it had at interrupt time (it was still pointing to the saved context below that), so the caller of the interrupted function was spilling to the wrong spot. This wouldn't show up as an instant failure, it would only happen when switching BACK to the improperly-spilled thread. And even then it would be a noop if the original interrupt handler was deep enough to have spilled that function naturally. In practice, this happened only in some instances on ESP-32 (which has more windowed registers than qemu) when interrupting the idle thread (which is very shallow) with a (very simple) timer interrupt. Trivial to see, hard to find. See #6346 for more detail. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
In some cases (though apparently not all) the new asm2 code on esp32 is failing in situations where it is asked to switch back to a previously-interrupted thread from an otherwise working one. See #6339 for one report.
The most common case happens with the idle thread, which will be correctly interrupted and switched away from (e.g. a timer interrupt will wake up a blocked thread), but then the system will hang (or sometimes report a spurious exception) when switching back into idle.
There's a heisenbug quality to this too, where calling printk() inside the interrupt handler will magically "fix" the problem. My current theory is that the window spill code is broken somehow: calling into a complicated handler like printk() will flush the register windows for us and make the spill code a noop. The spill code is different between ESP-32 (which has 64 registers) and qemu (32), which might explain the difference.
This also is consistent with the "failure on switch back to idle" behavior -- a bad spill would only affect the old thread and not the new one.
The text was updated successfully, but these errors were encountered: