ESP-32 preemption regressions with asm2 #6346

andyross · 2018-02-23T22:36:40Z

In some cases (though apparently not all) the new asm2 code on esp32 is failing in situations where it is asked to switch back to a previously-interrupted thread from an otherwise working one. See #6339 for one report.

The most common case happens with the idle thread, which will be correctly interrupted and switched away from (e.g. a timer interrupt will wake up a blocked thread), but then the system will hang (or sometimes report a spurious exception) when switching back into idle.

There's a heisenbug quality to this too, where calling printk() inside the interrupt handler will magically "fix" the problem. My current theory is that the window spill code is broken somehow: calling into a complicated handler like printk() will flush the register windows for us and make the spill code a noop. The spill code is different between ESP-32 (which has 64 registers) and qemu (32), which might explain the difference.

This also is consistent with the "failure on switch back to idle" behavior -- a bad spill would only affect the old thread and not the new one.

locomuco · 2018-02-25T16:52:21Z

@andyross what ESP32 toolchain version is currently supported by zephyr? the documentation always points to the latest esp32 toolchain

When returning into a different thread than we interrupted, we obviously need to spill all the existing register windows to make sure all their values are in the old thread's stack. But the code to do this forgot to reset the current stack pointer to the value it had at interrupt time (it was still pointing to the saved context below that), so the caller of the interrupted function was spilling to the wrong spot. This wouldn't show up as an instant failure, it would only happen when switching BACK to the improperly-spilled thread. And even then it would be a noop if the original interrupt handler was deep enough to have spilled that function naturally. In practice, this happened only in some instances on ESP-32 (which has more windowed registers than qemu) when interrupting the idle thread (which is very shallow) with a (very simple) timer interrupt. Trivial to see, hard to find. See zephyrproject-rtos#6346 for more detail. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

andyross · 2018-03-05T21:41:31Z

Got it. It wasn't the spill code, which had a decent unit test and was pretty simple. It wasn't the cross-stack call code, which was likewise testable in isolation but a little more subtle. It was the really obvious-looking glue between the two. Cross-stack calls return with the stack pointer still pointing to the restore area, which lives below the interrupted stack. But the register spills need the stack pointer to point to the actual function that got interrupted, otherwise its CALLER (xtensa register windows are kinda crazy) will spill to the wrong spot.

Turns out qemu never managed to hit this case because it has half as many windowed registers as ESP-32.

andyross · 2018-03-05T21:48:00Z

@locomuco Sorry, missed your question. We support only the official SDK toolchain from Esspressif on ESP-32. @lpereira can give more context, but broadly Xtensa toolchains are not portable between targets, they are hard-configured to assume specific hardware features. Part of the output of their CPU generator is an "overlay" source package you have to apply to a specific version of binutils/gcc, making a generic toolchain really hard to manage.

When returning into a different thread than we interrupted, we obviously need to spill all the existing register windows to make sure all their values are in the old thread's stack. But the code to do this forgot to reset the current stack pointer to the value it had at interrupt time (it was still pointing to the saved context below that), so the caller of the interrupted function was spilling to the wrong spot. This wouldn't show up as an instant failure, it would only happen when switching BACK to the improperly-spilled thread. And even then it would be a noop if the original interrupt handler was deep enough to have spilled that function naturally. In practice, this happened only in some instances on ESP-32 (which has more windowed registers than qemu) when interrupting the idle thread (which is very shallow) with a (very simple) timer interrupt. Trivial to see, hard to find. See #6346 for more detail. Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

andyross self-assigned this Feb 23, 2018

andyross added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug area: Xtensa Xtensa Architecture labels Feb 23, 2018

andyross mentioned this issue Feb 23, 2018

samples/drivers/gpio Sample doesn't work on ESP32 after SMP support was added #6339

Closed

linkmeyer added this to the v1.11.0 milestone Mar 5, 2018

lpereira mentioned this issue Mar 5, 2018

k_sleep() fails on ESP32 sometimes #6455

Closed

andyross mentioned this issue Mar 5, 2018

xtensa/asm2: Fix stack pointer during preemption spills #6465

Merged

nashif closed this as completed in #6465 Mar 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESP-32 preemption regressions with asm2 #6346

ESP-32 preemption regressions with asm2 #6346

andyross commented Feb 23, 2018

locomuco commented Feb 25, 2018

andyross commented Mar 5, 2018

andyross commented Mar 5, 2018

ESP-32 preemption regressions with asm2 #6346

ESP-32 preemption regressions with asm2 #6346

Comments

andyross commented Feb 23, 2018

locomuco commented Feb 25, 2018

andyross commented Mar 5, 2018

andyross commented Mar 5, 2018