CI testing must be retry-free #14649

andyross · 2019-03-18T20:30:51Z

See #12553 for extensive analysis and discussion.

Essentially, there are known instabilities that cause spurious failures on tests in sanitycheck and CI that are currently being worked around by retrying failures. That is bad, and we should stop doing it. But the good news is that (we think!) all the instabilities are understood at this point and can be worked around.

For 1.15, all integration and regression testing should be free of retries and all tests should be reliable.

There are multiple paths to making this work:

We could fix our qemu usage to use the built in deterministic cycle counting, as discussed in Configure QEMU to run independent of the host clock #14173 and Make icount work for real on x86_64 #12027

Or we could rework the way sanitycheck runs tests, as discussed in #12553. Qemu is provably reliable if:

The qemu process is truly uncontended on the host CPU. Note that sanitycheck itself runs a thread to read its output, so this breaks right now. Run it with its output going to files and with a full host CPU core available per live test CPU (note that SMP platforms need more than one host CPU to be unloaded!)
The host CPU is fast enough to emulate the target CPU in real time. Currently a 2.5GHz Skylake core is not fast enough to keep up with the mps2_an385 board configuration in all cases. This can be worked around in the timer driver by increasing the CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC to effectively slow down the simulation rate.
The test does not require realtime timing accuracy greater than what the host kernel can provide. Qemu sometimes sleeps, and the host kernel can only wake it up to within ~1/CONFIG_HZ seconds of accuracy. Ubuntu kernels have a 250Hz tick timer, but some of our tests want as much as 1kHz of accuracy. Fedora uses CONFIG_HZ=1000 and may be a better choice for CI hardware. This can be worked around with a host kernel rebuild, or by further reducing the simulation clock rate as above.

andrewboie · 2019-03-18T20:45:09Z

My vote is a very strong preference for #14173

andyross · 2019-03-18T20:51:40Z

Agree for sure. Plan B is mentioned because it's been prototyped and is known to work, where fixing -icount probably involves diving into the guts of qemu.

marc-hb · 2019-04-22T21:23:28Z

Nit: not clear whether this particular github issue is specifically about some qemu timing issues (as described in the description), or more generally an "umbrella" issue trying to exhaustively list all retry issues (as described by the title)

andrewboie · 2019-04-22T21:47:05Z

Nit: not clear whether this particular github issue is specifically about some qemu timing issues (as described in the description), or more generally an "umbrella" issue trying to exhaustively list all retry issues (as described by the title)

I think it's fairly clear what we are trying to accomplish:

For 1.15, all integration and regression testing should be free of retries and all tests should be reliable.

I would consider the above to be acceptance criteria for closing this.
The root cause of most of these retries is QEMU's timing issues but we need to fix them all regardless of cause so that we can abolish the retry policy completely, and report all failures even if intermittent.

marc-hb · 2019-06-27T00:22:21Z

@cc @wentongwu

I've been excluding tests with clear qemu timing issues and running sanitycheck a lot. I'm still experiencing a few intermittent failures (#16915, others not filed yet) and surprising 30s timeouts for tests that almost always take less than 5s.

I also noticed very recently (in htop, not top) that there is no thread pool or anything to limit the number of python threads spawned by sanitycheck. When starting a run with 2000 tests, sanitycheck starts by spawning (among others) 2000 python threads and using 70Gigabytes of virtual memory.

@nashif recently complained on Slack about a new "too many files" error below.

I'm wondering whether all these "stress" issues could be related to each other (and unrelated to qemu)

2400 tests selected, 110651 tests discarded due to filters
Traceback (most recent call last):
  File "/home/nashif/zephyrproject/zephyr/scripts/sanitycheck", line 3527, in <module>
    main()
  File "/home/nashif/zephyrproject/zephyr/scripts/sanitycheck", line 3456, in main
    ts.instances)
  File "/home/nashif/zephyrproject/zephyr/scripts/sanitycheck", line 2420, in execute
    self.goals = mg.execute(cb, cb_context)
  File "/home/nashif/zephyrproject/zephyr/scripts/sanitycheck", line 1375, in execute
    open(self.logfile, "wt") as make_log:
OSError: [Errno 24] Too many open files: '/home/nashif/zephyrproject/zephyr/sanity-out/make.log'

nashif · 2019-06-27T00:44:08Z

@marc-hb this really does not belong here, different issue completely. Open a new a bug please.

stephanosio · 2022-05-04T14:24:24Z

Closing this issue since:

With the help of the QEMU icount feature, all serious failures have been fixed (failure rate has been reduced).
There are still some tests that fail intermittently when the host CPU load is very high, but the failure rate is not high enough to actually make the CI report a failure (succeeds within one or two retries).

andyross added Enhancement Changes/Updates/Additions to existing features priority: high High impact/importance bug labels Mar 18, 2019

andyross added this to the v1.15.0 milestone Mar 18, 2019

andyross mentioned this issue Mar 18, 2019

List of tests that keep failing sporadically #12553

Closed

18 tasks

ioannisg modified the milestones: v2.0.0, v2.1.0 Sep 3, 2019

dleach02 modified the milestones: v2.1.0, v2.2.0 Dec 10, 2019

jhedberg modified the milestones: v2.2.0, v2.3.0 Mar 10, 2020

carlescufi modified the milestones: v2.3.0, v2.4.0 Jun 5, 2020

stephanosio removed this from the v2.4.0 milestone Oct 3, 2021

stephanosio added the area: Continuous Integration label May 4, 2022

stephanosio closed this as completed May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI testing must be retry-free #14649

CI testing must be retry-free #14649

andyross commented Mar 18, 2019

andrewboie commented Mar 18, 2019

andyross commented Mar 18, 2019

marc-hb commented Apr 22, 2019

andrewboie commented Apr 22, 2019

marc-hb commented Jun 27, 2019

nashif commented Jun 27, 2019

stephanosio commented May 4, 2022

CI testing must be retry-free #14649

CI testing must be retry-free #14649

Comments

andyross commented Mar 18, 2019

andrewboie commented Mar 18, 2019

andyross commented Mar 18, 2019

marc-hb commented Apr 22, 2019

andrewboie commented Apr 22, 2019

marc-hb commented Jun 27, 2019

nashif commented Jun 27, 2019

stephanosio commented May 4, 2022