Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI testing must be retry-free #14649

Closed
andyross opened this issue Mar 18, 2019 · 7 comments
Closed

CI testing must be retry-free #14649

andyross opened this issue Mar 18, 2019 · 7 comments
Labels
area: Continuous Integration Enhancement Changes/Updates/Additions to existing features priority: high High impact/importance bug

Comments

@andyross
Copy link
Contributor

See #12553 for extensive analysis and discussion.

Essentially, there are known instabilities that cause spurious failures on tests in sanitycheck and CI that are currently being worked around by retrying failures. That is bad, and we should stop doing it. But the good news is that (we think!) all the instabilities are understood at this point and can be worked around.

For 1.15, all integration and regression testing should be free of retries and all tests should be reliable.

There are multiple paths to making this work:

Or we could rework the way sanitycheck runs tests, as discussed in #12553. Qemu is provably reliable if:

  1. The qemu process is truly uncontended on the host CPU. Note that sanitycheck itself runs a thread to read its output, so this breaks right now. Run it with its output going to files and with a full host CPU core available per live test CPU (note that SMP platforms need more than one host CPU to be unloaded!)

  2. The host CPU is fast enough to emulate the target CPU in real time. Currently a 2.5GHz Skylake core is not fast enough to keep up with the mps2_an385 board configuration in all cases. This can be worked around in the timer driver by increasing the CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC to effectively slow down the simulation rate.

  3. The test does not require realtime timing accuracy greater than what the host kernel can provide. Qemu sometimes sleeps, and the host kernel can only wake it up to within ~1/CONFIG_HZ seconds of accuracy. Ubuntu kernels have a 250Hz tick timer, but some of our tests want as much as 1kHz of accuracy. Fedora uses CONFIG_HZ=1000 and may be a better choice for CI hardware. This can be worked around with a host kernel rebuild, or by further reducing the simulation clock rate as above.

@andyross andyross added Enhancement Changes/Updates/Additions to existing features priority: high High impact/importance bug labels Mar 18, 2019
@andyross andyross added this to the v1.15.0 milestone Mar 18, 2019
@andrewboie
Copy link
Contributor

My vote is a very strong preference for #14173

@andyross
Copy link
Contributor Author

Agree for sure. Plan B is mentioned because it's been prototyped and is known to work, where fixing -icount probably involves diving into the guts of qemu.

@marc-hb
Copy link
Collaborator

marc-hb commented Apr 22, 2019

Nit: not clear whether this particular github issue is specifically about some qemu timing issues (as described in the description), or more generally an "umbrella" issue trying to exhaustively list all retry issues (as described by the title)

@andrewboie
Copy link
Contributor

Nit: not clear whether this particular github issue is specifically about some qemu timing issues (as described in the description), or more generally an "umbrella" issue trying to exhaustively list all retry issues (as described by the title)

I think it's fairly clear what we are trying to accomplish:

For 1.15, all integration and regression testing should be free of retries and all tests should be reliable.

I would consider the above to be acceptance criteria for closing this.
The root cause of most of these retries is QEMU's timing issues but we need to fix them all regardless of cause so that we can abolish the retry policy completely, and report all failures even if intermittent.

@marc-hb
Copy link
Collaborator

marc-hb commented Jun 27, 2019

@cc @wentongwu

I've been excluding tests with clear qemu timing issues and running sanitycheck a lot. I'm still experiencing a few intermittent failures (#16915, others not filed yet) and surprising 30s timeouts for tests that almost always take less than 5s.

I also noticed very recently (in htop, not top) that there is no thread pool or anything to limit the number of python threads spawned by sanitycheck. When starting a run with 2000 tests, sanitycheck starts by spawning (among others) 2000 python threads and using 70Gigabytes of virtual memory.

@nashif recently complained on Slack about a new "too many files" error below.

I'm wondering whether all these "stress" issues could be related to each other (and unrelated to qemu)

2400 tests selected, 110651 tests discarded due to filters
Traceback (most recent call last):
  File "/home/nashif/zephyrproject/zephyr/scripts/sanitycheck", line 3527, in <module>
    main()
  File "/home/nashif/zephyrproject/zephyr/scripts/sanitycheck", line 3456, in main
    ts.instances)
  File "/home/nashif/zephyrproject/zephyr/scripts/sanitycheck", line 2420, in execute
    self.goals = mg.execute(cb, cb_context)
  File "/home/nashif/zephyrproject/zephyr/scripts/sanitycheck", line 1375, in execute
    open(self.logfile, "wt") as make_log:
OSError: [Errno 24] Too many open files: '/home/nashif/zephyrproject/zephyr/sanity-out/make.log'

@nashif
Copy link
Member

nashif commented Jun 27, 2019

@marc-hb this really does not belong here, different issue completely. Open a new a bug please.

@ioannisg ioannisg modified the milestones: v2.0.0, v2.1.0 Sep 3, 2019
@dleach02 dleach02 modified the milestones: v2.1.0, v2.2.0 Dec 10, 2019
@jhedberg jhedberg modified the milestones: v2.2.0, v2.3.0 Mar 10, 2020
@carlescufi carlescufi modified the milestones: v2.3.0, v2.4.0 Jun 5, 2020
@stephanosio stephanosio removed this from the v2.4.0 milestone Oct 3, 2021
@stephanosio
Copy link
Member

Closing this issue since:

  • With the help of the QEMU icount feature, all serious failures have been fixed (failure rate has been reduced).
  • There are still some tests that fail intermittently when the host CPU load is very high, but the failure rate is not high enough to actually make the CI report a failure (succeeds within one or two retries).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Continuous Integration Enhancement Changes/Updates/Additions to existing features priority: high High impact/importance bug
Projects
None yet
Development

No branches or pull requests

9 participants