Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Twister jobserver support eliminates parallel build for me #54289

Closed
keith-packard opened this issue Feb 1, 2023 · 16 comments
Closed

Twister jobserver support eliminates parallel build for me #54289

keith-packard opened this issue Feb 1, 2023 · 16 comments
Assignees
Labels
area: Twister Twister bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug Regression Something, which was working, does not anymore Stale

Comments

@keith-packard
Copy link
Collaborator

Describe the bug
The new Linux jobserver support in twister looks like it should have improved my twister runtimes by avoiding huge load spikes, but instead it has eliminated almost all parallelism during the run. Only one test appears to run at a time.

Please also mention any information which could help others to understand
the problem you're facing:

  • What target platform are you using? Debian Linux
  • What have you tried to diagnose or workaround this issue? I've reverted the jobserver patches and performance returned to previous values. https://github.com/keith-packard/zephyr/tree/revert-jobserver
  • Is this a regression? If yes, have you been able to "git bisect" it to a
    specific commit? Yes, as above, reverting the three jobserver-related patches restores the previous behavior and performance.
  • ...

To Reproduce
Steps to reproduce the behavior:

  1. ./scripts/twister -j 25
  2. Note that only one cmake process is ever active.

Expected behavior
I expect 25 cmake processes followed by hundreds of compilation processes to keep the build machine busy for about three hours for the complete twister test.

Impact
I am unable to complete twister test runs in a reasonable amount of time; I abandoned the test run after eight hours.

Environment (please complete the following information):

  • OS: Debian Unstable
  • Toolchain: Zephyr SDK 0.16 (beta)
  • Commit SHA or Version used: ef9d5bb

Additional context
I'm running tests on a 64-thread system, so I would expect a load average of around 200 for much of this test.

@keith-packard keith-packard added bug The issue is a bug, or the PR is fixing a bug area: Twister Twister labels Feb 1, 2023
@nordicjm
Copy link
Collaborator

nordicjm commented Feb 1, 2023

I expect 25 cmake processes followed by hundreds of compilation processes to keep the build machine busy for about three hours for the complete twister test.

I think this is a misunderstanding, before the patch one set of compilation tasks would spawn as many compilation processes as you have cores on a system, so building one application would use 16 threads if you had 16 cores, but combined with twister, if you specified -j4, this would run 4 sets of builds in parallel which would spawn 4 * 16 threads = 64 threads, most of which would starve because only 16 can be processed at once. I know when I did it here, my machine would basically freeze because I have 8 hyperthreading cores and a -j8 in twister would send my load average above 60.

Perhaps what would be good is a way to control how many compilation tasks can run in parallel for each concurrent build, so -j8 to to run 8 builds in parallel and -t4 to run 4 threads per parallel build (= 32 thread count if everything is busy)

@keith-packard
Copy link
Collaborator Author

I think this is a misunderstanding, before the patch one set of compilation tasks would spawn as many compilation processes as you have cores on a system, so building one application would use 16 threads if you had 16 cores, but combined with twister, if you specified -j4, this would run 4 sets of builds in parallel which would spawn 4 * 16 threads = 64 threads, most of which would starve because only 16 can be processed at once. I know when I did it here, my machine would basically freeze because I have 8 hyperthreading cores and a -j8 in twister would send my load average above 60.

I don't think I misunderstood the code. I used -j 25 on my machine to spawn 25 parallel builds, each of which would spawn 64 compile processes sending the load over 200. As this is a 64-core build machine that I don't actually use for interactive development, that wasn't great, but it did keep the machine saturated for the complete twister run. With the change, I get a single cmake run, and then ninja runs that single build using all 64 cores for compiling. The compile happens in a second or two and the system is back to running the next cmake. The overall twister run time increases from 3 hours to well over 8 as cmake ends up taking way more time than compiling.

Perhaps what would be good is a way to control how many compilation tasks can run in parallel for each concurrent build, so -j8 to to run 8 builds in parallel and -t4 to run 4 threads per parallel build (= 32 thread count if everything is busy)

We used to have this option by using -j 8 for twister and CMAKE_BUILD_PARALLEL_LEVEL=4 to set the number of processes run by ninja.

In a perfect world, twister would coordinate with ninja to keep the maximum number of parallel processes as close as practical to the specified value, but (as has been noticed by @nashif already), ninja hasn't merged that yet (ninja-build/ninja#1139).

@thedjnK
Copy link
Collaborator

thedjnK commented Feb 1, 2023

With the change, I get a single cmake run, and then ninja runs that single build using all 64 cores for compiling

Are you using the patched version of ninja, not the default/normal/repo one? Because this sounds like you're not using the patched version

@keith-packard
Copy link
Collaborator Author

ah, there's a magic version of ninja somewhere? Doesn't seem to be in the SDK version I'm running. I figured it was something simple like that, but couldn't figure out where to look.

@thedjnK
Copy link
Collaborator

thedjnK commented Feb 1, 2023

ah, there's a magic version of ninja somewhere? Doesn't seem to be in the SDK version I'm running. I figured it was something simple like that, but couldn't figure out where to look.

Yes, I made the same mistake of trying this and wondering why everything looked broken until someone pointed out you need the kitware version of ninja https://github.com/Kitware/ninja

@jeremybettis would probably be a good idea to update the getting started guide, twister documentation and release notes with links on how to get this special version, because right now it's not documented anywhere (ideally for the 3.3 release)

@keith-packard
Copy link
Collaborator Author

yeah, seems like a pretty important note. would be nice if this custom version of twister was supplied with the SDK.

@thedjnK
Copy link
Collaborator

thedjnK commented Feb 1, 2023

@stephanosio for info too

@keith-packard
Copy link
Collaborator Author

Hrm. Even with the custom version of ninja, my machine is completely underwhelmed when running twister with the jobserver support enabled. It never seems to run multiple cmake instances in parallel anymore, and as cmake is the most expensive part of the build, that means the machine never gets up a full head of steam.

@thedjnK
Copy link
Collaborator

thedjnK commented Feb 1, 2023

I'll need to check again tomorrow but pretty sure when I checked earlier, I saw as many python processes (1 for each build) as the -j x I specified

@thedjnK
Copy link
Collaborator

thedjnK commented Feb 1, 2023

Works for me, using command ./scripts/twister -p nrf52840dk_nrf52840 -T tests/kernel/ -j6 gives:

1428 pts/2    Ss     0:00 -bash
   1441 pts/1    S+     0:01 python3 ./scripts/twister -p nrf52840dk_nrf52840 -T tests/kernel/ -j
   1460 pts/1    Sl+    0:00 python3 ./scripts/twister -p nrf52840dk_nrf52840 -T tests/kernel/ -j
   1469 pts/1    S+     0:00 python3 ./scripts/twister -p nrf52840dk_nrf52840 -T tests/kernel/ -j
   1470 pts/1    S+     0:00 python3 ./scripts/twister -p nrf52840dk_nrf52840 -T tests/kernel/ -j
   1473 pts/1    S+     0:00 python3 ./scripts/twister -p nrf52840dk_nrf52840 -T tests/kernel/ -j
   1476 pts/1    S+     0:00 python3 ./scripts/twister -p nrf52840dk_nrf52840 -T tests/kernel/ -j
   1481 pts/1    S+     0:00 python3 ./scripts/twister -p nrf52840dk_nrf52840 -T tests/kernel/ -j
   1483 pts/1    S+     0:00 /usr/bin/cmake -B/tmp/aa/zephyr/twister-out/nrf52840dk_nrf52840/test
   1487 pts/1    S+     0:00 python3 ./scripts/twister -p nrf52840dk_nrf52840 -T tests/kernel/ -j
   1490 pts/1    S+     0:00 /usr/bin/cmake -B/tmp/aa/zephyr/twister-out/nrf52840dk_nrf52840/test
   1492 pts/1    S+     0:00 /usr/bin/cmake -B/tmp/aa/zephyr/twister-out/nrf52840dk_nrf52840/test
   1496 pts/1    S+     0:00 /usr/bin/cmake -B/tmp/aa/zephyr/twister-out/nrf52840dk_nrf52840/test
   1497 pts/1    S+     0:00 /usr/bin/cmake -B/tmp/aa/zephyr/twister-out/nrf52840dk_nrf52840/test
   1498 pts/1    S+     0:00 /usr/bin/cmake -B/tmp/aa/zephyr/twister-out/nrf52840dk_nrf52840/test
   1571 pts/1    R+     0:00 /usr/bin/python3.10 -m west topdir
   1572 pts/1    R+     0:00 /usr/bin/python3.10 /tmp/aa/zephyr/scripts/zephyr_module.py --zephyr
   1573 pts/1    R+     0:00 /usr/bin/python3.10 /tmp/aa/zephyr/scripts/zephyr_module.py --zephyr
   1578 pts/1    R+     0:00 /usr/bin/python3.10 /tmp/aa/zephyr/scripts/zephyr_module.py --zephyr
   1579 pts/1    R+     0:00 /usr/bin/python3.10 /tmp/aa/zephyr/scripts/dts/gen_defines.py --dts
   1580 pts/1    R+     0:00 /usr/bin/python3.10 /tmp/aa/zephyr/scripts/zephyr_module.py --zephyr
   1581 pts/2    R+     0:00 ps ax

Checked that it's different projects:

   1715 pts/1    S+     0:00 /usr/bin/cmake -B/tmp/aa/zephyr/twister-out/nrf52840dk_nrf52840/test
s/kernel/cache/kernel.cache.api -DTC_RUNID=1d6a8a8854347a6e4e6beafe5215860f -DEXTRA_CFLAGS=-Werro
r -DEXTRA_AFLAGS=-Werror -Wa,--fatal-warnings -DEXTRA_LDFLAGS=-Wl,--fatal-warnings -DEXTRA_GEN_DE
FINES_ARGS=--edtlib-Werror -GNinja -S/tmp/aa/zephyr/tests/kernel/cache -DBOARD=nrf52840dk_nrf5284
0
   1720 pts/1    S+     0:00 /usr/bin/cmake -B/tmp/aa/zephyr/twister-out/nrf52840dk_nrf52840/test
s/kernel/common/kernel.common.picolibc -DTC_RUNID=a62f1ffecb017142668b20d4c6e1f50f -DEXTRA_CFLAGS
=-Werror -DEXTRA_AFLAGS=-Werror -Wa,--fatal-warnings -DEXTRA_LDFLAGS=-Wl,--fatal-warnings -DEXTRA
_GEN_DEFINES_ARGS=--edtlib-Werror -GNinja -S/tmp/aa/zephyr/tests/kernel/common -DOVERLAY_CONFIG=/
tmp/aa/zephyr/twister-out/nrf52840dk_nrf52840/tests/kernel/common/kernel.common.picolibc/twister/
testsuite_extra.conf -DBOARD=nrf52840dk_nrf52840
   1721 pts/1    S+     0:00 /usr/bin/cmake -B/tmp/aa/zephyr/twister-out/nrf52840dk_nrf52840/test
s/kernel/common/kernel.common.nano64 -DTC_RUNID=342ff4f460988f05587c13a0856d1115 -DEXTRA_CFLAGS=-
Werror -DEXTRA_AFLAGS=-Werror -Wa,--fatal-warnings -DEXTRA_LDFLAGS=-Wl,--fatal-warnings -DEXTRA_G
EN_DEFINES_ARGS=--edtlib-Werror -GNinja -S/tmp/aa/zephyr/tests/kernel/common -DOVERLAY_CONFIG=/tm
p/aa/zephyr/twister-out/nrf52840dk_nrf52840/tests/kernel/common/kernel.common.nano64/twister/test
suite_extra.conf -DBOARD=nrf52840dk_nrf52840
   1725 pts/1    S+     0:00 /usr/bin/cmake -B/tmp/aa/zephyr/twister-out/nrf52840dk_nrf52840/test
s/kernel/common/kernel.common.nano32 -DTC_RUNID=ba512fde866504ed3b6c43dac6fcb469 -DEXTRA_CFLAGS=-
Werror -DEXTRA_AFLAGS=-Werror -Wa,--fatal-warnings -DEXTRA_LDFLAGS=-Wl,--fatal-warnings -DEXTRA_G
EN_DEFINES_ARGS=--edtlib-Werror -GNinja -S/tmp/aa/zephyr/tests/kernel/common -DOVERLAY_CONFIG=/tm
p/aa/zephyr/twister-out/nrf52840dk_nrf52840/tests/kernel/common/kernel.common.nano32/twister/test
suite_extra.conf -DBOARD=nrf52840dk_nrf52840
   1726 pts/1    S+     0:00 /usr/bin/cmake -B/tmp/aa/zephyr/twister-out/nrf52840dk_nrf52840/test
s/kernel/common/kernel.common.misra -DTC_RUNID=a92758396ec09deabca118a3b161ed5d -DEXTRA_CFLAGS=-W
error -DEXTRA_AFLAGS=-Werror -Wa,--fatal-warnings -DEXTRA_LDFLAGS=-Wl,--fatal-warnings -DEXTRA_GE
N_DEFINES_ARGS=--edtlib-Werror -GNinja -S/tmp/aa/zephyr/tests/kernel/common -DOVERLAY_CONFIG=/tmp
/aa/zephyr/twister-out/nrf52840dk_nrf52840/tests/kernel/common/kernel.common.misra/twister/testsu
ite_extra.conf -DBOARD=nrf52840dk_nrf52840

@keith-packard
Copy link
Collaborator Author

Yup, I get the correct number of python processes, but only ever one cmake. I must have some local configuration broken here. I'm running directly from main:

$ git show
commit 60a20471b561a0a3a74b377991fe1976c2ea83d7 (HEAD -> main, origin/main)
Author: Tomasz Leman <tomasz.m.leman@intel.com>
Date:   Fri Jan 27 15:05:42 2023 +0100

I just built ninja from the kitware sources:

$ ninja --version
1.11.1.git.kitware.jobserver-1

A mystery.

@keith-packard
Copy link
Collaborator Author

keith-packard commented Feb 1, 2023

ah ha! I had MAKEFLAGS=-j96 in my environment, which appears to have confused some part of the system into preventing the parallel cmake runs, even though I'm not using make anywhere. Sounds like there might be a minor bug in the jobserver support somewhere, but it looks like it can be worked around by clearing that variable.

@nashif
Copy link
Member

nashif commented Feb 1, 2023

It should be possible to make the job server optional or at least configurable, right now we just enable it for Linux. If this is an option that would help, please shout. I spent some time benchmarking this and main reason this was introduced is ninja, if you build with make, you will not get into the original problem the job-server was trying to solve.
We switched to ninja by default some time ago, because it is "faster", maybe that was the wrong choice.

Also, you get the patched ninja with pip, no need to build it yourself :)

@keith-packard
Copy link
Collaborator Author

jobserver is a huge improvement now that I've figured out what the core issue was, but for people without the interest/ability to get a fixed ninja, it might be nice to have it optional. The other issue was the stale (ah, legacy) MAKEFLAGS environment variable that messed up my fine ninja build by disabling the parallel cmake invocations. It might be nice to figure out how to detect/fix that issue, but it's definitely not a blocker.

So, it seems like there are two (minor) issues here:

  1. You need the kitware version of ninja. My recommendation is to include this with the Zephyr SDK bits, but perhaps a simple runtime check in twister that provides a warning/error if you don't have that would be good too.
  2. The MAKEFLAGS environment variable does "bad things" to the jobserver usage. Perhaps, when using ninja, it would be good to just ignore this?

I'm running a twister build and things seem to be going well -- it's holding a load average of around 64, which is optimal for this machine.

@stephanosio
Copy link
Member

Anas commented:

It should be possible to make the job server optional or at least configurable, right now we just enable it for Linux. If this is an option that would help, please shout.

(shouts very loudly)

Keith commented:

You need the kitware version of ninja. My recommendation is to include this with the Zephyr SDK bits

That would require users to manually update their PATH to point to the Zephyr SDK ninja bin dir, which will not fit nicely into the current CMake package-based automagical discovery of Zephyr SDK.

but perhaps a simple runtime check in twister that provides a warning/error if you don't have that would be good too.

I think this is a more practical solution given that the Kitware version of ninja-build can be installed through pip.

@laurenmurphyx64 laurenmurphyx64 added Regression Something, which was working, does not anymore priority: medium Medium impact/importance bug labels Feb 7, 2023
@github-actions
Copy link

github-actions bot commented Apr 9, 2023

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Twister Twister bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug Regression Something, which was working, does not anymore Stale
Projects
None yet
Development

No branches or pull requests

7 participants