New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qemu_xtensa tests fail spuriously under sanitycheck/CI #11182
Comments
same fails on qemu_xtensa with the following exception
|
@nategraff-sifive @andyross @andrewboie fyi, any idea what is going here? |
This is now happening more often with new Qemu in the SDK |
I'm on commit 5addb3a and SDK 0.9.5 and I can't seem to make this test fail anymore. Running |
@nategraff-sifive does it fail for you on other commits? This is very difficult to reproduce locally, happens only in CI it seems (under heavy load) |
I seem to remember this test failing for me sometimes on another commit. Maybe I'm remembering another test failure. |
I thought I had most of these issues fixed, but note that on a loaded host machine architectures (risc-v and xtensa are ones!) that need to reset the timer in the ISR for the next tick even in "non-tickless" mode are always going to have some exposure to the race. The root cause is that when you ask qemu for a cycle count, it gives you back "real time". So if the host OS blocks the process for a while (i.e. a host OS tick -- roughly the same size as our ticks), it will look as if the guess CPU clock got stalled or something, and the tick that should have been scheduled won't, and we'll miss a tick. The driver will (should) recover, but lots of the tests are looking for jitter like this and will fail. |
FWIW, the newer riscv timer driver in #10556 seems a little more robust than the old one and doesn't spurious fail, at least not in a few iterations of sanitycheck on my build box. I'm lost on the Xtensa bug. It seems like qemu is tossing that exception almost immediately after boot, it hits about 10 qemu_xtensa tests on a given run of sanitycheck, might be dependent on host load (I couldn't make it happen on an isolated run anyway), and doesn't seem to have anything to do with the application code at all (it randomly affects anything, at least so far as I can see). I'm tempted to blame this on a qemu bug, but it's worth investigation for sure. |
Spent some time on this this morning. I think I'm comfortable putting the RISC-V bug to bed. It doesn't happen anymore. The combination of the new (still unmerged) driver and "qemu hardening" seems to have done the trick. I've done many santiycheck runs over qemu_riscv32 and haven't seen this once. And I have a proximate cause for the Xtensa crash: the timer interrupt is firing very early, before the _current thread pointer is set up. That's not supposed to happen. Interrupts are supposed to be disabled in early boot and into _Cstart(), and then enabled only once threading is ready and drivers are ready to bring up. This matches the observed behavior well: when the host is unloaded (or when we're on real hardware) the guest proceeds through initialization quickly and doesn't get a chance to see the interrupt fire. When the host is loaded, it can be descheduled for a bit and when the VM "catches up" it throws the interrupt. So you only see it in a loaded sanitycheck. Yet I can't seem to work around it with anything obvious. Taking an irq_lock() at the top of _Cstart is too late to catch all the crashes. Setting the interrupt mask bits all to zero at the top of Xtensa reset vector doesn't work, because something (not the timer driver!) is turning it back on. I thought for a while I'd caught a Qemu bug: the timer interrupt we're using (interrupt 10) is at priority 3, which is the highest maskable priority. I thought for a bit that Qemu had goofed and make this level unmaskable, but a quick test seems to show that it's OK. I think. Anyway I'll play with it more on Monday. Executive summary is that I'm 80% certain this is just a VM glitch with (once I find it) an easy workaround and not a scary bug on our side. |
Zephyr isn't ready to handle interrupts yet, until the threading/scheduler are set up and we make our first context switch. This was a semi-hidden bug: only the timer interrupt would actually get unmasked before the system was ready, and obviously would never have time to fire a tick before the system completed initialization. But a combination of system load and a new version of Qemu (which seems to be more sensitive to non-deterministic timing glitchery) has made this visible. About 2-3% of the time when run under a full sanitycheck, the qemu process will get swapped away for long enough that the tick timer expires before _Cstart() has reached enable_multithreading(). It looks like the original code was cut and pasted from another implementation, which was expected to call into an "application" main() routine that wanted interrupts ready. Fixes zephyrproject-rtos#11182 (Note also that this code is not used for ESP-32, which has its own startup path) Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Should have started looking yesterday. Didn't need much searching, it was pretty obvious and even called out in comments that the code was explicitly enabling interrupts (because the original idea was to enter an "app" that expected interrupts to be set up, AFAICT). Fixes the crazy crashes on boot in CI. I still see a few (0-2 per run) of the spurious timer asserts on xtensa, both with master and the new timer driver (though the new one seems maybe to be a bit better). The new qemu is definitely more sensitive to host timing. |
Zephyr isn't ready to handle interrupts yet, until the threading/scheduler are set up and we make our first context switch. This was a semi-hidden bug: only the timer interrupt would actually get unmasked before the system was ready, and obviously would never have time to fire a tick before the system completed initialization. But a combination of system load and a new version of Qemu (which seems to be more sensitive to non-deterministic timing glitchery) has made this visible. About 2-3% of the time when run under a full sanitycheck, the qemu process will get swapped away for long enough that the tick timer expires before _Cstart() has reached enable_multithreading(). It looks like the original code was cut and pasted from another implementation, which was expected to call into an "application" main() routine that wanted interrupts ready. Fixes #11182 (Note also that this code is not used for ESP-32, which has its own startup path) Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Zephyr isn't ready to handle interrupts yet, until the threading/scheduler are set up and we make our first context switch. This was a semi-hidden bug: only the timer interrupt would actually get unmasked before the system was ready, and obviously would never have time to fire a tick before the system completed initialization. But a combination of system load and a new version of Qemu (which seems to be more sensitive to non-deterministic timing glitchery) has made this visible. About 2-3% of the time when run under a full sanitycheck, the qemu process will get swapped away for long enough that the tick timer expires before _Cstart() has reached enable_multithreading(). It looks like the original code was cut and pasted from another implementation, which was expected to call into an "application" main() routine that wanted interrupts ready. Fixes #11182 (Note also that this code is not used for ESP-32, which has its own startup path) Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Test keeps failing randomly in CI
The text was updated successfully, but these errors were encountered: