Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure QEMU to run independent of the host clock #14173

Closed
3 of 15 tasks
andrewboie opened this issue Mar 7, 2019 · 36 comments · Fixed by #22904
Closed
3 of 15 tasks

Configure QEMU to run independent of the host clock #14173

andrewboie opened this issue Mar 7, 2019 · 36 comments · Fixed by #22904
Assignees
Labels
area: Test Framework Issues related not to a particular test, but to the framework instead Enhancement Changes/Updates/Additions to existing features priority: high High impact/importance bug
Milestone

Comments

@andrewboie
Copy link
Contributor

andrewboie commented Mar 7, 2019

We have been struggling for years with issues related to how QEMU attempts to synchronize guest timer interrupts with the host clock, for example #12553. The symptom is that heavily loaded sanitycheck runs have tests spuriously failing due to timing related issues.

This creates noise in our CI runs which masks true bugs in our system which manifest only intermittently, causing real issues that will happen all the time at scale to be 'swept under the rug'; right now any time a test fails sanitycheck retries it a few times and only consecutive failures produce an error.

QEMU does nominally support some options to decouple the guest from host clock:

-rtc [base=utc|localtime|date][,clock=host|rt|vm][,driftfix=none|slew]
                set the RTC base and clock, enable drift fix for clock ticks (x86 only)
-icount [shift=N|auto][,align=on|off][,sleep=on|off,rr=record|replay,rrfile=<filename>,rrsnapshot=<snapshot>]
                enable virtual instruction counter with 2^N clock ticks per
                instruction, enable aligning the host and virtual clocks
                or disable real time cpu sleeping

Unfortunately to date nobody has ever gotten this working.

The objective is as follows: for all architectures that use QEMU for sanitycheck runs (currently x86, x86_64, arm, riscv32, xtensa, nios2) configure QEMU to use icount.

Beware that there may be bugs in QEMU which prevent this from working, and whoever works on this should be prepared to dive into the QEMU source and/or work with QEMU upstream.

Also it's worth noting that moving ANY arch to use icount is progress, so even if we can get only some but not all arches working with this, that is worth doing. At this time I recommend x86 and ARM (the qemu_x86 and mps2_an385 targets) as priority. But eventually we need them all to finally get rid of these spurious sanitycheck failures.

If we manage to get this working for all arches, then we can remove the logic in CI which keeps retrying failed tests.

Known build targets that use QEMU emulation, each tickbox is a separate board.cmake that sets QEMU_FLAGS_${ARCH}:

  • qemu_xtensa
  • mps2_an385
  • qemu_cortex_m0
  • qemu_cortex_m3
  • mps2_an521
  • qemu_cortex_r5
  • qemu_cortex_a53
  • qemu_nios2
  • qemu_riscv32
  • qemu_riscv64
  • hifive1
  • qemu_x86 (and related targets)
  • qemu_x86_64
  • qemu_arc_em
  • qemu_arc_hs

Check these tickboxes as patches are merged (not submitted) to indicate that they are now using icount.

@andrewboie andrewboie added Enhancement Changes/Updates/Additions to existing features priority: high High impact/importance bug labels Mar 7, 2019
@andrewboie andrewboie changed the title Configure QEMU to run independent of the host clock for all arches Configure QEMU to run independent of the host clock Mar 7, 2019
@wentongwu
Copy link
Contributor

I did some sanitychecks with "-icount shift=4,align=off,sleep=off" on qemu_x86 and mps2_an385, only tests/kernel/critical/kernel_critical failed on mps2_an385. And run that case(tests/kernel/critical/kernel_critical) on mps2_an385 specially, I can get this case run successfully, but it takes long time to execute, so it's why get the timeout result on sanitychecks. And it means icount mode produces deterministic output, but does not provide cycle accurate.
I will try to review the qemu code to try to understand icount mode well, and then do next plan.

@andrewboie
Copy link
Contributor Author

@wentongwu thanks for the update :)

@andrewboie
Copy link
Contributor Author

I remember tests/kernel/critical takes a while even without icount, so perhaps we need to change the test to not take so much time.

@wentongwu
Copy link
Contributor

I run several rounds stress tests, founding more failed/timeout, and most of them are related to timer accurateness, I need do more debug and code review.

andrewboie pushed a commit to andrewboie/zephyr that referenced this issue Jun 6, 2019
This test is already flaky, but becomes even flakier when
coverage is enabled.

Disable until we put a stake through the QEMU timing issues
being worked on in zephyrproject-rtos#14173.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
andrewboie pushed a commit that referenced this issue Jun 6, 2019
This test is already flaky, but becomes even flakier when
coverage is enabled.

Disable until we put a stake through the QEMU timing issues
being worked on in #14173.

Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
@wentongwu
Copy link
Contributor

With below patch, I did some sanitychecks again with "-icount shift=4,align=off,sleep=off -rtc clock=vm" on qemu_x86 and mps2_an385, so far no error occured. Still reviewing qemu code...

diff --git a/tests/kernel/critical/src/main.c b/tests/kernel/critical/src/main.c
index 372f931fa1..38d26a987c 100644
--- a/tests/kernel/critical/src/main.c
+++ b/tests/kernel/critical/src/main.c
@@ -32,8 +32,8 @@
#include <linker/sections.h>
#include <ztest.h>

-#define NUM_MILLISECONDS 5000
-#define TEST_TIMEOUT 20000
+#define NUM_MILLISECONDS 500
+#define TEST_TIMEOUT 2000

static u32_t critical_var;
static u32_t alt_thread_iterations;

@wentongwu
Copy link
Contributor

wentongwu commented Jun 7, 2019

With below patch in every code path, four processes run " for i in {1..10000}; do ./scripts/sanitycheck
-p qemu_x86 -p mps2_an385 -T tests/kernel ; done " individually in four different code path on the same one machine (ubuntu 18.04), two processes with "-icount shift=4,align=off,sleep=off -rtc clock=vm" configuration, one process with "-icount shift=4,align=off,sleep=off" configuration, the last one with the default configuration (the original configuration, no changes on qemu configuration). So far the default one easily fail, others run correctly.

diff --git a/scripts/sanitycheck b/scripts/sanitycheck
index 662172cda2..12788cf555 100755
--- a/scripts/sanitycheck
+++ b/scripts/sanitycheck
@@ -1453,7 +1453,7 @@ testcase_valid_keys = {"tags": {"type": "set", "required": False},
                        "build_on_all": {"type": "bool", "default": False},
                        "skip": {"type": "bool", "default": False},
                        "slow": {"type": "bool", "default": False},
-                       "timeout": {"type": "int", "default": 60},
+                       "timeout": {"type": "int", "default": 600},
                        "min_ram": {"type": "int", "default": 8},
                        "depends_on": {"type": "set"},
                        "min_flash": {"type": "int", "default": 32},

@andrewboie
Copy link
Contributor Author

five minutes is a lot for a default. how many / which tests start taking longer than 60 seconds with the icount change?

@andrewboie
Copy link
Contributor Author

"-icount shift=4,align=off,sleep=off -rtc clock=vm"

Can you give some detail on how these parameters were selected? What do you observe for different shift values, and why did you settle on 4?

marc-hb added a commit to marc-hb/zephyr that referenced this issue Jun 21, 2019
We already have the info so let's show it. This helps spots intermittent
issues[*], gives an indication of the time --build-only saves, can help
spot an overloaded test system, highlights the most time-consuming tests
which may need a longer timeout in their config, shows the effective
timeout value when one occurs... all this for a dirt cheap screen estate
price and two extra lines of code.

Sample -v output:

32/81 board123  tests/testme              PASSED (qemu 2.049s)
33/81 board456  samples/hello             PASSED (build)
34/81 qemu_x3   tests/kernel.stack.usage  FAILED: timeout (qemu 60.029s)
     see: sanity-out/qemu_x3/tests/kernel.stack.usage/handler.log
35/81 board456  tests/testme              PASSED (build)
36/81 qemu_x5   tests/kernel.queue        FAILED: failed (qemu 2.191s)
     see: sanity-out/qemu_x5/tests/kernel.queue/handler.log

[*] running qemu in heavily packed cloud virtual machines comes to mind,
    also see zephyrproject-rtos#12553, zephyrproject-rtos#14173 etc.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
nashif pushed a commit that referenced this issue Jun 21, 2019
We already have the info so let's show it. This helps spots intermittent
issues[*], gives an indication of the time --build-only saves, can help
spot an overloaded test system, highlights the most time-consuming tests
which may need a longer timeout in their config, shows the effective
timeout value when one occurs... all this for a dirt cheap screen estate
price and two extra lines of code.

Sample -v output:

32/81 board123  tests/testme              PASSED (qemu 2.049s)
33/81 board456  samples/hello             PASSED (build)
34/81 qemu_x3   tests/kernel.stack.usage  FAILED: timeout (qemu 60.029s)
     see: sanity-out/qemu_x3/tests/kernel.stack.usage/handler.log
35/81 board456  tests/testme              PASSED (build)
36/81 qemu_x5   tests/kernel.queue        FAILED: failed (qemu 2.191s)
     see: sanity-out/qemu_x5/tests/kernel.queue/handler.log

[*] running qemu in heavily packed cloud virtual machines comes to mind,
    also see #12553, #14173 etc.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
marc-hb added a commit to marc-hb/zephyr that referenced this issue Jun 27, 2019
Quoting bug "Configure QEMU to run independent of the host clock zephyrproject-rtos#14173"

  We have been struggling for years with issues related to how QEMU
  attempts to synchronize guest timer interrupts with the host clock,
  for example zephyrproject-rtos#12553. The symptom is that heavily loaded sanitycheck
  runs have tests spuriously failing due to timing related issues.

  This creates noise in our CI runs which masks true bugs in our system
  which manifest only intermittently, causing real issues that will
  happen all the time at scale to be 'swept under the rug'; right now
  any time a test fails sanitycheck retries it a few times and only
  consecutive failures produce an error.

There's also a lot of relevant information and more links in:
"List of tests that keep failing sporadically" zephyrproject-rtos#12553

This new "emu_time" tag helps by letting users either select or exclude
the tests that really need accurate time to pass and have a high chance
to actually be impacted by this emulation issue. As an example, it's
only with 'sanitycheck --exclude emu_time' that I could spot and file
intermittent but non-emu_time issue zephyrproject-rtos#16915. As Andrew predicted above,
it was drown in emu_time noise before that.

Conversely, "--tag emu_time" can be used by developers focusing on
fixing qemu's -icount feature, for instance zephyrproject-rtos#14173 or others.

Even before qemu's -icount is fixed, Continuous Integration could be
split in two separate runs: A. --tag emu_time and B. --exclude
emu_time. Only A tests would be allowed retries which would stop hiding
other, unrelated intermittent issues affecting B tests.

This initial commit does not pretend to exhaustively tag all affected
tests. However it's an already functional and useful start of 14 tests
collected from and tested over weeks of local sanitycheck runs and
_partially_ reviewed by qemu clock expert Andy Ross.

This commit also increases the individual timeout of 7 tests that have
been observed to consistently take much longer than the
median (2-3s). This flags how unusually long these are, lets users
temporarily reduce the very long 60s default timeout in their local
workspace and finally should reduce the chance of them timing out on a
heavily loaded system. Set their timeout to 3-4 times the duration
observed in CI and locally.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@wentongwu
Copy link
Contributor

@andrewboie if 10 PRs submitted at the same time, what's the CI's behavior? run shippable for each PR one by one ? or run shippable for all the PRs at the same time ? Thanks

@andrewboie
Copy link
Contributor Author

@wentongwu it runs it on the last patch in the series.
Having said that, your PRs need to be bisectable without breaking the build.

@wentongwu
Copy link
Contributor

@wentongwu it runs it on the last patch in the series.
Having said that, your PRs need to be bisectable without breaking the build.

I mean if ten different developers submit their own PRs at the same time, will CI schedule it one by one or run it concurrently? If concurrently, icount mode can't guarantee the complete time of the test case, because first it will take longer time than normal mode, second more qemu processes in one machine will cause little time for each qemu process.

@andrewboie
Copy link
Contributor Author

If concurrently, icount mode can't guarantee the complete time of the test case

I see what you mean. The build slaves are supposed to be load balanced, with the number of CPUs selected for the job matching the hardware it is running on.

Having said that, perhaps we need a more sophisticated accounting on how long tests are taking to complete to determine whether they have gotten stuck somewhere.

Are you at the point where icount is working but we just need longer test timeouts?

@wentongwu
Copy link
Contributor

quick testing, seems most of other QEMU platforms also can work, will pick-up little timeout or failed cases.

stephanosio added a commit to stephanosio/zephyr that referenced this issue Mar 18, 2020
This commit enables the QEMU icount emulation mode for improved timing
stability.

In normal emulation mode (without icount), the emulation timing of the
TTC system timer is particularly unstable and this results in a high CI
failure rate.

For more details, refer to the issues zephyrproject-rtos#14173 and zephyrproject-rtos#22904.

Signed-off-by: Stephanos Ioannidis <root@stephanos.io>
nashif pushed a commit that referenced this issue Mar 21, 2020
This commit enables the QEMU icount emulation mode for improved timing
stability.

In normal emulation mode (without icount), the emulation timing of the
TTC system timer is particularly unstable and this results in a high CI
failure rate.

For more details, refer to the issues #14173 and #22904.

Signed-off-by: Stephanos Ioannidis <root@stephanos.io>
@maksimmasalski
Copy link
Collaborator

@andrewboie Are you going to develop test case for that QEMU configuration?

@carlescufi
Copy link
Member

The current manifestation of the lack for this enhancement is the following:
tests/counter/counter_basic_api failing on mps2_an385 in CI occasionally.

hakehuang pushed a commit to hakehuang/zephyr that referenced this issue Apr 30, 2020
This commit enables the QEMU icount emulation mode for improved timing
stability.

In normal emulation mode (without icount), the emulation timing of the
TTC system timer is particularly unstable and this results in a high CI
failure rate.

For more details, refer to the issues zephyrproject-rtos#14173 and zephyrproject-rtos#22904.

Signed-off-by: Stephanos Ioannidis <root@stephanos.io>
@carlescufi carlescufi added area: Test Framework Issues related not to a particular test, but to the framework instead and removed area: Testing labels Apr 30, 2020
@andrewboie
Copy link
Contributor Author

I found and fixed two issues that were exposed by enabling icount on x86 targets:

#24879
#24880

After #24879 is merged, I have found icount works well on 32-bit x86, and I think we should enable it. I tested with -icount shift=7,align=off,sleep=off -rtc clock=vm

x86_64 unfortunately has other issues. With icount turned on and the above fix applied, I still see:

Deleting output directory /dev/shm/sanity-out
INFO    - JOBS: 48
INFO    - Building initial testcase list...
INFO    - 176 test configurations selected, 631 configurations discarded due to filters.
INFO    - Adding tasks to the queue...
INFO    - Total complete:   39/ 176  22%  skipped:   12, failed:    0
ERROR   - qemu_x86_64               tests/kernel/mbox/mbox_usage/kernel.mailbox.usage  FAILED: Failed
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/mbox/mbox_usage/kernel.mailbox.usage/handler.log
INFO    - Total complete:   48/ 176  27%  skipped:   12, failed:    1
ERROR   - qemu_x86_64               tests/kernel/lifo/lifo_usage/kernel.lifo.usage     FAILED: Failed
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/lifo/lifo_usage/kernel.lifo.usage/handler.log
INFO    - Total complete:   97/ 176  55%  skipped:   17, failed:    2
ERROR   - qemu_x86_64               tests/kernel/smp/kernel.multiprocessing.smp        FAILED: Failed
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/smp/kernel.multiprocessing.smp/handler.log
INFO    - Total complete:  106/ 176  60%  skipped:   19, failed:    3
ERROR   - qemu_x86_64               tests/kernel/common/kernel.common.misra            FAILED: Timeout
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/common/kernel.common.misra/handler.log
INFO    - Total complete:  107/ 176  60%  skipped:   19, failed:    4
ERROR   - qemu_x86_64               tests/kernel/common/kernel.common                  FAILED: Timeout
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/common/kernel.common/handler.log
INFO    - Total complete:  108/ 176  61%  skipped:   19, failed:    5
ERROR   - qemu_x86_64               tests/kernel/spinlock/kernel.multiprocessing.spinlock FAILED: Timeout
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/spinlock/kernel.multiprocessing.spinlock/handler.log
INFO    - Total complete:  109/ 176  61%  skipped:   19, failed:    6
ERROR   - qemu_x86_64               samples/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization FAILED: Timeout
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/samples/portability/cmsis_rtos_v2/timer_synchronization/sample.portability.cmsis_rtos_v2.timer_synchronization/handler.log
INFO    - Total complete:  121/ 176  68%  skipped:   19, failed:    7
ERROR   - qemu_x86_64               tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace FAILED: Failed
ERROR   - see: /dev/shm/sanity-out/qemu_x86_64/tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace/handler.log
INFO    - Total complete:  176/ 176  100%  skipped:   20, failed:    8
INFO    - 148 of 156 tests passed (94.87%), 8 failed, 20 skipped with 0 warnings in 116.42 seconds
INFO    - In total 818 test cases were executed on 1 out of total 243 platforms (0.41%)

@wentongwu
Copy link
Contributor

wentongwu commented May 1, 2020

INFO    - 20/86 qemu_x86_64               tests/kernel/smp/kernel.multiprocessing.smp        FAILED Failed (qemu 16.006s)
ERROR   - see: /home/wentongw/zephyr/zephyrproject/zephyr/sanity-out/qemu_x86_64/tests/kernel/smp/kernel.multiprocessing.smp/handler.log

this problem should be "with icount mode enabled, although configure Qemu multi cpu, there is only one vCPU thread to handle guest code", for case test_coop_resched_threads current cooperative thread doesn't give up the only vCPU and other high priority cooperative threads don't get vCPU time to run, so there is assert.

@wentongwu
Copy link
Contributor

INFO    - 74/86 qemu_x86_64               tests/kernel/lifo/lifo_usage/kernel.lifo.usage     FAILED Failed (qemu 7.419s)
ERROR   - see: /home/wentongw/zephyr/zephyrproject/zephyr/sanity-out/qemu_x86_64/tests/kernel/lifo/lifo_usage/kernel.lifo.usage/handler.log
INFO    - 61/86 qemu_x86_64               tests/kernel/mbox/mbox_usage/kernel.mailbox.usage  FAILED Failed (qemu 2.324s)
ERROR   - see: /home/wentongw/zephyr/zephyrproject/zephyr/sanity-out/qemu_x86_64/tests/kernel/mbox/mbox_usage/kernel.mailbox.usage/handler.log
INFO    - 71/86 qemu_x86_64               tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace FAILED Exited with 2 (qemu)
ERROR   - see: /home/wentongw/zephyr/zephyrproject/zephyr/sanity-out/qemu_x86_64/tests/kernel/mem_protect/userspace/kernel.memory_protection.userspace/handler.log

these three issues are all related to only one vCPU existed when icount enabled, but actually run time CONFIG_MP_NUM_CPUS=2

@wentongwu
Copy link
Contributor

INFO    - 50/86 qemu_x86_64               tests/kernel/spinlock/kernel.multiprocessing.spinlock FAILED Timeout (qemu 61.099s)
ERROR   - see: /home/wentongw/zephyr/zephyrproject/zephyr/sanity-out/qemu_x86_64/tests/kernel/spinlock/kernel.multiprocessing.spinlock/handler.log

this case requires

tests:
  kernel.multiprocessing.spinlock:
    filter: CONFIG_SMP and CONFIG_MP_NUM_CPUS > 1

but we only have one core with icount enabled.

@wentongwu
Copy link
Contributor

I see only above 5 failed cases with icount enabled on qemu_x86_64 platform, if config CONFIG_MP_NUM_CPUS=1 for these 5 cases, qemu_x86_64 can work well with -icount shift=5,align=off,sleep=off -rtc clock=vm

@andrewboie
Copy link
Contributor Author

andrewboie commented May 1, 2020

OK. I think we should open a separate issue specifically for x86_64 SMP. We may need a completely different emulator to do this right, or this may have to be a documented limitation. Let's reduce the scope of this ticket to non-SMP targets.

@abrodkin
Copy link
Collaborator

abrodkin commented Jul 6, 2020

Solution for ARC should be in #26646.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Test Framework Issues related not to a particular test, but to the framework instead Enhancement Changes/Updates/Additions to existing features priority: high High impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants