New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI testing must be retry-free #14649
Comments
My vote is a very strong preference for #14173 |
Agree for sure. Plan B is mentioned because it's been prototyped and is known to work, where fixing -icount probably involves diving into the guts of qemu. |
Nit: not clear whether this particular github issue is specifically about some qemu timing issues (as described in the description), or more generally an "umbrella" issue trying to exhaustively list all retry issues (as described by the title) |
I think it's fairly clear what we are trying to accomplish:
I would consider the above to be acceptance criteria for closing this. |
I've been excluding tests with clear qemu timing issues and running sanitycheck a lot. I'm still experiencing a few intermittent failures (#16915, others not filed yet) and surprising 30s timeouts for tests that almost always take less than 5s. I also noticed very recently (in htop, not top) that there is no thread pool or anything to limit the number of python threads spawned by sanitycheck. When starting a run with 2000 tests, sanitycheck starts by spawning (among others) 2000 python threads and using 70Gigabytes of virtual memory. @nashif recently complained on Slack about a new "too many files" error below. I'm wondering whether all these "stress" issues could be related to each other (and unrelated to qemu)
|
@marc-hb this really does not belong here, different issue completely. Open a new a bug please. |
Closing this issue since:
|
See #12553 for extensive analysis and discussion.
Essentially, there are known instabilities that cause spurious failures on tests in sanitycheck and CI that are currently being worked around by retrying failures. That is bad, and we should stop doing it. But the good news is that (we think!) all the instabilities are understood at this point and can be worked around.
For 1.15, all integration and regression testing should be free of retries and all tests should be reliable.
There are multiple paths to making this work:
Or we could rework the way sanitycheck runs tests, as discussed in #12553. Qemu is provably reliable if:
The qemu process is truly uncontended on the host CPU. Note that sanitycheck itself runs a thread to read its output, so this breaks right now. Run it with its output going to files and with a full host CPU core available per live test CPU (note that SMP platforms need more than one host CPU to be unloaded!)
The host CPU is fast enough to emulate the target CPU in real time. Currently a 2.5GHz Skylake core is not fast enough to keep up with the mps2_an385 board configuration in all cases. This can be worked around in the timer driver by increasing the CONFIG_SYS_CLOCK_HW_CYCLES_PER_SEC to effectively slow down the simulation rate.
The test does not require realtime timing accuracy greater than what the host kernel can provide. Qemu sometimes sleeps, and the host kernel can only wake it up to within ~1/CONFIG_HZ seconds of accuracy. Ubuntu kernels have a 250Hz tick timer, but some of our tests want as much as 1kHz of accuracy. Fedora uses CONFIG_HZ=1000 and may be a better choice for CI hardware. This can be worked around with a host kernel rebuild, or by further reducing the simulation clock rate as above.
The text was updated successfully, but these errors were encountered: