-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sanitycheck: don't multiply CPU count #17276
sanitycheck: don't multiply CPU count #17276
Conversation
scripts/sanitycheck
Outdated
@@ -226,7 +226,7 @@ LAST_SANITY_XUNIT = os.path.join(ZEPHYR_BASE, "scripts", "sanity_chk", | |||
"last_sanity.xml") | |||
RELEASE_DATA = os.path.join(ZEPHYR_BASE, "scripts", "sanity_chk", | |||
"sanity_last_release.csv") | |||
JOBS = multiprocessing.cpu_count() * 2 | |||
JOBS = multiprocessing.cpu_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep the *2 in -b mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
We have a number of timing sensitive tests which run correctly on a much more frequent basis if the system is not so heavily loaded. Instead of squeezing a few more crumbs of performance by doubling the CPU count, just use the number of CPUs reported by the system. Signed-off-by: Andrew Boie <andrew.p.boie@intel.com>
7051224
to
d5651fd
Compare
I'm lucky enough to have access to a system with about ~100 cores (hi @andyross). I did some testing on it and noticed sanitycheck performance caps after JOBS=+-50. That's what I've been using on this system to minimize timeouts and upset coworkers. Pretty sure it's because storage becomes the bottleneck past that point. The number of cores is very crude because it obviously ignores storage and no I don't have anything better to suggest, just making sure everyone's expectations stay very low. cc @dcpleung who's been playing with RAM disks and/or tmpfs (on my list)
Do you have some evidence it helps? Keep in mind that sanitycheck starts many more threads than JOBS, 2000+ Python threads for a start in the default settings. See discussion in #17239. If nothing else the v2 of this patch makes it more obvious that there is a --jobs option! Even though they sometimes vary, CI timings will / would have been interesting. |
-b mode is --build-only, no emulators are spawned, we don't get spurious test execution failures which this patch is trying to help mitigate. So keeping the double amount for this scenario is reasonable to me. This patch is not a fix for #17239. That fails for me even with -j12. It appears to be a function of the number of test cases selected. What this patch is doing is adding some default stability until #14173 is resolved (which may take some time, we've been fighting it for years) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-b mode is --build-only, no emulators are spawned, we don't get spurious test execution failures which this patch is trying to help mitigate. So keeping the double amount for this scenario is reasonable to me.
Even if *2 is not great for --build-only either, you're right better change/address one thing at a time.
(I'm still interested in @galak's numbers if any)
This patch is not a fix for #17239.
Not a fix but maybe it will mitigate a bit. That was just some additional background anyway.
That fails for me even with -j12. It appears to be a function of the number of test cases selected.
I'm looking right now at yet another run with JOBS=30 and 2190 tests without any issue. I never hit #17239 yet, strange.
Just spotted this nit sorry for the extra email: - sanitycheck: don't multiply CPU count
+ sanitycheck: lower JOBS default _when running tests_ |
The main reason for the issue in #17239 is that we have added yet another qemu platform recently and the number of tests has also increased basically hitting the maximum open files (each qemu thread open three files at least, multiplied by the number of platforms multiplied by the number of cases that need to run...) Removing one of the default qemu platforms resolves the issue... or you increase the max. number of open files to 4096 :) |
I don't know if you were just referring to qemu_x86_coverage but in any case I think you just nailed it: I carry a private hack to exclude That's very likely why I haven't hit this, thanks. |
We have a number of timing sensitive tests which run
correctly on a much more frequent basis if the system
is not so heavily loaded. Instead of squeezing a few
more crumbs of performance by doubling the CPU count,
just use the number of CPUs reported by the system.
Signed-off-by: Andrew Boie andrew.p.boie@intel.com