qemu_xtensa tests fail spuriously under sanitycheck/CI #11182

nashif · 2018-11-07T13:38:46Z

Test keeps failing randomly in CI

***** Booting Zephyr OS zephyr-v1.13.0-1670-gf403bfa343 *****
Running test suite timer_api
===================================================================
starting test - test_timer_duration_period
PASS - test_timer_duration_period
===================================================================
starting test - test_timer_period_0

    Assertion failed at /home/buildslave/src/github.com/zephyrproject-rtos/zephyr/tests/kernel/timer/timer_api/src/main.c:159: test_timer_period_0: tdata.expire_cnt == 1 is false

FAIL - test_timer_period_0
===================================================================
starting test - test_timer_expirefn_null
PASS - test_timer_expirefn_null
===================================================================
starting test - test_timer_periodicity
PASS - test_timer_periodicity
===================================================================
starting test - test_timer_status_get
PASS - test_timer_status_get
===================================================================
starting test - test_timer_status_get_anytime
PASS - test_timer_status_get_anytime
===================================================================
starting test - test_timer_status_sync
PASS - test_timer_status_sync
===================================================================
starting test - test_timer_k_define
PASS - test_timer_k_define
===================================================================
starting test - test_timer_user_data
PASS - test_timer_user_data
===================================================================
===================================================================
PROJECT EXECUTION FAILED

The text was updated successfully, but these errors were encountered:

nashif · 2018-11-07T15:16:23Z

same fails on qemu_xtensa with the following exception

 ** FATAL EXCEPTION
 ** CPU 0 EXCCAUSE 29 PS 0x00060320 PC 0x60002821 VADDR 0x00000048
 **  A0 0xa0001ef1  SP 0x600052f0  A2 0x63fffeb0  A3 0x60005784
 **  A4 0x63ffffd8  A5 0x00000000  A6 0x00000000  A7 0x00000000
 **  A8 0x00000000  A9 0x600052d0 A10 0x00000000 A11 0x60005770
 ** A12 0x00000001 A13 0x60005778 A14 0x00000000 A15 0x00000000
 ** SAR 0x00000007
Current thread ID = 0x00000000
Faulting instruction address = 0xdeaddead
Fatal fault in ISR! Spinning...

nashif · 2018-11-07T15:17:27Z

@nategraff-sifive @andyross @andrewboie fyi, any idea what is going here?

nashif · 2018-11-07T15:17:43Z

This is now happening more often with new Qemu in the SDK

nategraff-sifive · 2018-11-07T18:42:56Z

I'm on commit 5addb3a and SDK 0.9.5 and I can't seem to make this test fail anymore.

Running ./scripts/sanitycheck -p qemu_riscv32 -s tests/kernel/timer/timer_api/kernel.timer

nashif · 2018-11-08T00:28:03Z

@nategraff-sifive does it fail for you on other commits? This is very difficult to reproduce locally, happens only in CI it seems (under heavy load)

nategraff-sifive · 2018-11-08T00:36:22Z

I seem to remember this test failing for me sometimes on another commit. Maybe I'm remembering another test failure.

andyross · 2018-11-08T16:52:09Z

I thought I had most of these issues fixed, but note that on a loaded host machine architectures (risc-v and xtensa are ones!) that need to reset the timer in the ISR for the next tick even in "non-tickless" mode are always going to have some exposure to the race.

The root cause is that when you ask qemu for a cycle count, it gives you back "real time". So if the host OS blocks the process for a while (i.e. a host OS tick -- roughly the same size as our ticks), it will look as if the guess CPU clock got stalled or something, and the tick that should have been scheduled won't, and we'll miss a tick. The driver will (should) recover, but lots of the tests are looking for jitter like this and will fail.

andyross · 2018-11-10T03:05:14Z

FWIW, the newer riscv timer driver in #10556 seems a little more robust than the old one and doesn't spurious fail, at least not in a few iterations of sanitycheck on my build box.

I'm lost on the Xtensa bug. It seems like qemu is tossing that exception almost immediately after boot, it hits about 10 qemu_xtensa tests on a given run of sanitycheck, might be dependent on host load (I couldn't make it happen on an isolated run anyway), and doesn't seem to have anything to do with the application code at all (it randomly affects anything, at least so far as I can see). I'm tempted to blame this on a qemu bug, but it's worth investigation for sure.

andyross · 2018-11-10T21:36:51Z

Spent some time on this this morning. I think I'm comfortable putting the RISC-V bug to bed. It doesn't happen anymore. The combination of the new (still unmerged) driver and "qemu hardening" seems to have done the trick. I've done many santiycheck runs over qemu_riscv32 and haven't seen this once.

And I have a proximate cause for the Xtensa crash: the timer interrupt is firing very early, before the _current thread pointer is set up. That's not supposed to happen. Interrupts are supposed to be disabled in early boot and into _Cstart(), and then enabled only once threading is ready and drivers are ready to bring up. This matches the observed behavior well: when the host is unloaded (or when we're on real hardware) the guest proceeds through initialization quickly and doesn't get a chance to see the interrupt fire. When the host is loaded, it can be descheduled for a bit and when the VM "catches up" it throws the interrupt. So you only see it in a loaded sanitycheck.

Yet I can't seem to work around it with anything obvious. Taking an irq_lock() at the top of _Cstart is too late to catch all the crashes. Setting the interrupt mask bits all to zero at the top of Xtensa reset vector doesn't work, because something (not the timer driver!) is turning it back on. I thought for a while I'd caught a Qemu bug: the timer interrupt we're using (interrupt 10) is at priority 3, which is the highest maskable priority. I thought for a bit that Qemu had goofed and make this level unmaskable, but a quick test seems to show that it's OK. I think.

Anyway I'll play with it more on Monday. Executive summary is that I'm 80% certain this is just a VM glitch with (once I find it) an easy workaround and not a scary bug on our side.

Zephyr isn't ready to handle interrupts yet, until the threading/scheduler are set up and we make our first context switch. This was a semi-hidden bug: only the timer interrupt would actually get unmasked before the system was ready, and obviously would never have time to fire a tick before the system completed initialization. But a combination of system load and a new version of Qemu (which seems to be more sensitive to non-deterministic timing glitchery) has made this visible. About 2-3% of the time when run under a full sanitycheck, the qemu process will get swapped away for long enough that the tick timer expires before _Cstart() has reached enable_multithreading(). It looks like the original code was cut and pasted from another implementation, which was expected to call into an "application" main() routine that wanted interrupts ready. Fixes zephyrproject-rtos#11182 (Note also that this code is not used for ESP-32, which has its own startup path) Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

andyross · 2018-11-11T16:46:38Z

Should have started looking yesterday. Didn't need much searching, it was pretty obvious and even called out in comments that the code was explicitly enabling interrupts (because the original idea was to enter an "app" that expected interrupts to be set up, AFAICT). Fixes the crazy crashes on boot in CI.

I still see a few (0-2 per run) of the spurious timer asserts on xtensa, both with master and the new timer driver (though the new one seems maybe to be a bit better). The new qemu is definitely more sensitive to host timing.

Zephyr isn't ready to handle interrupts yet, until the threading/scheduler are set up and we make our first context switch. This was a semi-hidden bug: only the timer interrupt would actually get unmasked before the system was ready, and obviously would never have time to fire a tick before the system completed initialization. But a combination of system load and a new version of Qemu (which seems to be more sensitive to non-deterministic timing glitchery) has made this visible. About 2-3% of the time when run under a full sanitycheck, the qemu process will get swapped away for long enough that the tick timer expires before _Cstart() has reached enable_multithreading(). It looks like the original code was cut and pasted from another implementation, which was expected to call into an "application" main() routine that wanted interrupts ready. Fixes #11182 (Note also that this code is not used for ESP-32, which has its own startup path) Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

nashif assigned nategraff-sifive Nov 7, 2018

nashif added the bug The issue is a bug, or the PR is fixing a bug label Nov 7, 2018

nashif added the priority: high High impact/importance bug label Nov 7, 2018

nashif unassigned nategraff-sifive Nov 7, 2018

andyross self-assigned this Nov 10, 2018

andyross changed the title ~~tests/kernel/timer/timer_api/kernel.timer fails sporadically on riscv32 qemu~~ qemu_xtensa tests fail spuriously under sanitycheck/CI Nov 10, 2018

andyross mentioned this issue Nov 11, 2018

xtensa: Disable interrupts on entry to _Cstart #11291

Merged

nashif closed this as completed in #11291 Nov 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qemu_xtensa tests fail spuriously under sanitycheck/CI #11182

qemu_xtensa tests fail spuriously under sanitycheck/CI #11182

nashif commented Nov 7, 2018

nashif commented Nov 7, 2018

nashif commented Nov 7, 2018

nashif commented Nov 7, 2018

nategraff-sifive commented Nov 7, 2018

nashif commented Nov 8, 2018

nategraff-sifive commented Nov 8, 2018 •

edited

andyross commented Nov 8, 2018

andyross commented Nov 10, 2018

andyross commented Nov 10, 2018

andyross commented Nov 11, 2018

qemu_xtensa tests fail spuriously under sanitycheck/CI #11182

qemu_xtensa tests fail spuriously under sanitycheck/CI #11182

Comments

nashif commented Nov 7, 2018

nashif commented Nov 7, 2018

nashif commented Nov 7, 2018

nashif commented Nov 7, 2018

nategraff-sifive commented Nov 7, 2018

nashif commented Nov 8, 2018

nategraff-sifive commented Nov 8, 2018 • edited

andyross commented Nov 8, 2018

andyross commented Nov 10, 2018

andyross commented Nov 10, 2018

andyross commented Nov 11, 2018

nategraff-sifive commented Nov 8, 2018 •

edited