Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qemu_xtensa tests fail spuriously under sanitycheck/CI #11182

Closed
nashif opened this issue Nov 7, 2018 · 10 comments
Closed

qemu_xtensa tests fail spuriously under sanitycheck/CI #11182

nashif opened this issue Nov 7, 2018 · 10 comments
Assignees
Labels
bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug

Comments

@nashif
Copy link
Member

nashif commented Nov 7, 2018

Test keeps failing randomly in CI

***** Booting Zephyr OS zephyr-v1.13.0-1670-gf403bfa343 *****
Running test suite timer_api
===================================================================
starting test - test_timer_duration_period
PASS - test_timer_duration_period
===================================================================
starting test - test_timer_period_0

    Assertion failed at /home/buildslave/src/github.com/zephyrproject-rtos/zephyr/tests/kernel/timer/timer_api/src/main.c:159: test_timer_period_0: tdata.expire_cnt == 1 is false

FAIL - test_timer_period_0
===================================================================
starting test - test_timer_expirefn_null
PASS - test_timer_expirefn_null
===================================================================
starting test - test_timer_periodicity
PASS - test_timer_periodicity
===================================================================
starting test - test_timer_status_get
PASS - test_timer_status_get
===================================================================
starting test - test_timer_status_get_anytime
PASS - test_timer_status_get_anytime
===================================================================
starting test - test_timer_status_sync
PASS - test_timer_status_sync
===================================================================
starting test - test_timer_k_define
PASS - test_timer_k_define
===================================================================
starting test - test_timer_user_data
PASS - test_timer_user_data
===================================================================
===================================================================
PROJECT EXECUTION FAILED
@nashif nashif added the bug The issue is a bug, or the PR is fixing a bug label Nov 7, 2018
@nashif
Copy link
Member Author

nashif commented Nov 7, 2018

same fails on qemu_xtensa with the following exception

 ** FATAL EXCEPTION
 ** CPU 0 EXCCAUSE 29 PS 0x00060320 PC 0x60002821 VADDR 0x00000048
 **  A0 0xa0001ef1  SP 0x600052f0  A2 0x63fffeb0  A3 0x60005784
 **  A4 0x63ffffd8  A5 0x00000000  A6 0x00000000  A7 0x00000000
 **  A8 0x00000000  A9 0x600052d0 A10 0x00000000 A11 0x60005770
 ** A12 0x00000001 A13 0x60005778 A14 0x00000000 A15 0x00000000
 ** SAR 0x00000007
Current thread ID = 0x00000000
Faulting instruction address = 0xdeaddead
Fatal fault in ISR! Spinning...

@nashif nashif added the priority: high High impact/importance bug label Nov 7, 2018
@nashif
Copy link
Member Author

nashif commented Nov 7, 2018

@nategraff-sifive @andyross @andrewboie fyi, any idea what is going here?

@nashif
Copy link
Member Author

nashif commented Nov 7, 2018

This is now happening more often with new Qemu in the SDK

@nategraff-sifive
Copy link
Contributor

I'm on commit 5addb3a and SDK 0.9.5 and I can't seem to make this test fail anymore.

Running ./scripts/sanitycheck -p qemu_riscv32 -s tests/kernel/timer/timer_api/kernel.timer

@nashif
Copy link
Member Author

nashif commented Nov 8, 2018

@nategraff-sifive does it fail for you on other commits? This is very difficult to reproduce locally, happens only in CI it seems (under heavy load)

@nategraff-sifive
Copy link
Contributor

nategraff-sifive commented Nov 8, 2018

I seem to remember this test failing for me sometimes on another commit. Maybe I'm remembering another test failure.

@andyross
Copy link
Contributor

andyross commented Nov 8, 2018

I thought I had most of these issues fixed, but note that on a loaded host machine architectures (risc-v and xtensa are ones!) that need to reset the timer in the ISR for the next tick even in "non-tickless" mode are always going to have some exposure to the race.

The root cause is that when you ask qemu for a cycle count, it gives you back "real time". So if the host OS blocks the process for a while (i.e. a host OS tick -- roughly the same size as our ticks), it will look as if the guess CPU clock got stalled or something, and the tick that should have been scheduled won't, and we'll miss a tick. The driver will (should) recover, but lots of the tests are looking for jitter like this and will fail.

@andyross
Copy link
Contributor

FWIW, the newer riscv timer driver in #10556 seems a little more robust than the old one and doesn't spurious fail, at least not in a few iterations of sanitycheck on my build box.

I'm lost on the Xtensa bug. It seems like qemu is tossing that exception almost immediately after boot, it hits about 10 qemu_xtensa tests on a given run of sanitycheck, might be dependent on host load (I couldn't make it happen on an isolated run anyway), and doesn't seem to have anything to do with the application code at all (it randomly affects anything, at least so far as I can see). I'm tempted to blame this on a qemu bug, but it's worth investigation for sure.

@andyross andyross self-assigned this Nov 10, 2018
@andyross
Copy link
Contributor

Spent some time on this this morning. I think I'm comfortable putting the RISC-V bug to bed. It doesn't happen anymore. The combination of the new (still unmerged) driver and "qemu hardening" seems to have done the trick. I've done many santiycheck runs over qemu_riscv32 and haven't seen this once.

And I have a proximate cause for the Xtensa crash: the timer interrupt is firing very early, before the _current thread pointer is set up. That's not supposed to happen. Interrupts are supposed to be disabled in early boot and into _Cstart(), and then enabled only once threading is ready and drivers are ready to bring up. This matches the observed behavior well: when the host is unloaded (or when we're on real hardware) the guest proceeds through initialization quickly and doesn't get a chance to see the interrupt fire. When the host is loaded, it can be descheduled for a bit and when the VM "catches up" it throws the interrupt. So you only see it in a loaded sanitycheck.

Yet I can't seem to work around it with anything obvious. Taking an irq_lock() at the top of _Cstart is too late to catch all the crashes. Setting the interrupt mask bits all to zero at the top of Xtensa reset vector doesn't work, because something (not the timer driver!) is turning it back on. I thought for a while I'd caught a Qemu bug: the timer interrupt we're using (interrupt 10) is at priority 3, which is the highest maskable priority. I thought for a bit that Qemu had goofed and make this level unmaskable, but a quick test seems to show that it's OK. I think.

Anyway I'll play with it more on Monday. Executive summary is that I'm 80% certain this is just a VM glitch with (once I find it) an easy workaround and not a scary bug on our side.

@andyross andyross changed the title tests/kernel/timer/timer_api/kernel.timer fails sporadically on riscv32 qemu qemu_xtensa tests fail spuriously under sanitycheck/CI Nov 10, 2018
andyross pushed a commit to andyross/zephyr that referenced this issue Nov 11, 2018
Zephyr isn't ready to handle interrupts yet, until the
threading/scheduler are set up and we make our first context switch.
This was a semi-hidden bug: only the timer interrupt would actually
get unmasked before the system was ready, and obviously would never
have time to fire a tick before the system completed initialization.
But a combination of system load and a new version of Qemu (which
seems to be more sensitive to non-deterministic timing glitchery) has
made this visible.  About 2-3% of the time when run under a full
sanitycheck, the qemu process will get swapped away for long enough
that the tick timer expires before _Cstart() has reached
enable_multithreading().

It looks like the original code was cut and pasted from another
implementation, which was expected to call into an "application"
main() routine that wanted interrupts ready.

Fixes zephyrproject-rtos#11182

(Note also that this code is not used for ESP-32, which has its own
startup path)

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
@andyross
Copy link
Contributor

Should have started looking yesterday. Didn't need much searching, it was pretty obvious and even called out in comments that the code was explicitly enabling interrupts (because the original idea was to enter an "app" that expected interrupts to be set up, AFAICT). Fixes the crazy crashes on boot in CI.

I still see a few (0-2 per run) of the spurious timer asserts on xtensa, both with master and the new timer driver (though the new one seems maybe to be a bit better). The new qemu is definitely more sensitive to host timing.

nashif pushed a commit that referenced this issue Nov 12, 2018
Zephyr isn't ready to handle interrupts yet, until the
threading/scheduler are set up and we make our first context switch.
This was a semi-hidden bug: only the timer interrupt would actually
get unmasked before the system was ready, and obviously would never
have time to fire a tick before the system completed initialization.
But a combination of system load and a new version of Qemu (which
seems to be more sensitive to non-deterministic timing glitchery) has
made this visible.  About 2-3% of the time when run under a full
sanitycheck, the qemu process will get swapped away for long enough
that the tick timer expires before _Cstart() has reached
enable_multithreading().

It looks like the original code was cut and pasted from another
implementation, which was expected to call into an "application"
main() routine that wanted interrupts ready.

Fixes #11182

(Note also that this code is not used for ESP-32, which has its own
startup path)

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
nashif pushed a commit that referenced this issue Nov 13, 2018
Zephyr isn't ready to handle interrupts yet, until the
threading/scheduler are set up and we make our first context switch.
This was a semi-hidden bug: only the timer interrupt would actually
get unmasked before the system was ready, and obviously would never
have time to fire a tick before the system completed initialization.
But a combination of system load and a new version of Qemu (which
seems to be more sensitive to non-deterministic timing glitchery) has
made this visible.  About 2-3% of the time when run under a full
sanitycheck, the qemu process will get swapped away for long enough
that the tick timer expires before _Cstart() has reached
enable_multithreading().

It looks like the original code was cut and pasted from another
implementation, which was expected to call into an "application"
main() routine that wanted interrupts ready.

Fixes #11182

(Note also that this code is not used for ESP-32, which has its own
startup path)

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Projects
None yet
Development

No branches or pull requests

3 participants