-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llvm-exegesis] Timeout if subprocess executor hangs #132861
base: main
Are you sure you want to change the base?
Conversation
The call to llvm_exegesis_exe nondeterministically hangs, causing llvm's tests to block. Introduce a timeout when this happens.
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-tools-llvm-exegesis Author: Stephen Huan (stephen-huan) ChangesThe call to cmake -S llvm -B build -G Ninja -DCMAKE_BUILD_TYPE=Debug
ninja -C build check-all # hangs, sent ^C ^Cllvm-lit: .../llvm-project/llvm/utils/lit/lit/TestingConfig.py:156: fatal: unable to parse config file '.../llvm-project/llvm/test/tools/llvm-exegesis/lit.local.cfg', traceback: Traceback (most recent call last):
File ".../llvm-project/llvm/utils/lit/lit/TestingConfig.py", line 144, in load_from_path
exec(compile(data, path, "exec"), cfg_globals, None)
File ".../llvm-project/llvm/test/tools/llvm-exegesis/lit.local.cfg", line 46, in <module>
if can_use_perf_counters(
^^^^^^^^^^^^^^^^^^^^^^
File ".../llvm-project/llvm/test/tools/llvm-exegesis/lit.local.cfg", line 22, in can_use_perf_counters
return_code = subprocess.call(
^^^^^^^^^^^^^^^^
File "/nix/store/f2krmq3iv5nibcvn4rw7nrnrciqprdkh-python3-3.12.9/lib/python3.12/subprocess.py", line 393, in call
return p.wait(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^
File "/nix/store/f2krmq3iv5nibcvn4rw7nrnrciqprdkh-python3-3.12.9/lib/python3.12/subprocess.py", line 1266, in wait
return self._wait(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nix/store/f2krmq3iv5nibcvn4rw7nrnrciqprdkh-python3-3.12.9/lib/python3.12/subprocess.py", line 2061, in _wait
(pid, sts) = self._try_wait(0)
^^^^^^^^^^^^^^^^^
File "/nix/store/f2krmq3iv5nibcvn4rw7nrnrciqprdkh-python3-3.12.9/lib/python3.12/subprocess.py", line 2019, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt When printing out the time it takes on cc @boomanaiden154 Full diff: https://github.com/llvm/llvm-project/pull/132861.diff 1 Files Affected:
diff --git a/llvm/test/tools/llvm-exegesis/lit.local.cfg b/llvm/test/tools/llvm-exegesis/lit.local.cfg
index a51a2d73442fa..8b7c13f5eb1b6 100644
--- a/llvm/test/tools/llvm-exegesis/lit.local.cfg
+++ b/llvm/test/tools/llvm-exegesis/lit.local.cfg
@@ -24,9 +24,10 @@ def can_use_perf_counters(mode, extra_options=[]):
+ extra_options,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
+ timeout=1,
)
return return_code == 0
- except OSError:
+ except (OSError, subprocess.TimeoutExpired):
print("could not exec llvm-exegesis")
return False
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried getting a stack trace?
I haven't seen any timeout issues with this code reported in the ~2 years it's been in tree. There was a race condition fixed about a year ago, but that just caused test failures rather than timeouts.
I would probably need some convincing that this is a problem with llvm-exegesis
itself rather than your environment to accept this patch.
Also, what LLVM version is this with? |
What should I run exactly? Sorry, not very knowledgeable, just building llvm for other projects (triton-cpu, packaging).
Sure, I'm building llvm with the nix package manager which tries to guarantee a completely deterministic build environment for reproducibility (nix builds packages in a sandbox which only allows access to other nix packages, so ideally builds would be bit-for-bit reproducible). That is to say, there are very few environment-specific impurities outside of hardware (my system doesn't have a desktop information
19.1.7 and latest master (tried a66376b, e6de45a). The earliest release I could find with the code was 17.0.0 (88bf774) but I haven't tried it with this desktop yet. Don't feel strongly about this PR, in any case we can maintain the patch downstream. |
Attach lldb to the
To me this further suggests it's something weird with your system rather than the test/code itself. These tests don't get exercised as much as they probably should as they need bare metal hardware access for performance counters (might be good to double check with other people that can't reproduce that the tests are actually running and not just disabled), but they do get run on lots of systems, and there haven't been any reports until now. |
Thanks for the detailed advice (stacktrace below). It seems it's getting stuck on llvm-project/llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp Lines 213 to 214 in 9b060d1
(lldb) bt
* thread #1, name = 'llvm-exegesis', stop reason = signal SIGSTOP
* frame #0: 0x00007f5391f0cd04 libc.so.6`__poll + 20
frame #1: 0x000000000143aa60 llvm-exegesis`llvm::exegesis::X86LbrCounter::doReadCounter(void const*, void const*) const [inlined] poll(__fds=0x00007ffc422cf100, __nfds=1, __timeout=10000) at poll2.h:44:10
frame #2: 0x000000000143aa5b llvm-exegesis`llvm::exegesis::X86LbrCounter::doReadCounter(void const*, void const*) const [inlined] pollLbrPerfEvent(FileDescriptor=<unavailable>) at X86Counter.cpp:54:14
frame #3: 0x000000000143aa42 llvm-exegesis`llvm::exegesis::X86LbrCounter::doReadCounter(this=0x00007ffc422cf680, From=0x0000000000000000, To=0x0000000000000000) const at X86Counter.cpp:224:34
frame #4: 0x000000000143b28d llvm-exegesis`llvm::exegesis::X86LbrCounter::checkLbrSupport() at X86Counter.cpp:182:62
frame #5: 0x0000000001433458 llvm-exegesis`llvm::exegesis::(anonymous namespace)::ExegesisX86Target::checkFeatureSupport(this=<unavailable>) const at Target.cpp:833:45
frame #6: 0x00000000004d638a llvm-exegesis`llvm::exegesis::benchmarkMain() at llvm-exegesis.cpp:494:60
|
Interesting. That test has nothing to do with the subprocess executor, but I can see why there might be some issues there. It doesn't seem like the tests are even running on my system at all by default (Skylake-X, so definitely has LBR) so it looks like there might have been a lack of test coverage for this for a while. Although it passes if I remove the @oontvoo I know this was a long time ago, but do you have any recollection of why you set up the timeout to be so long? |
This patch fixes the LBR check in the local lit config. The test would segfault as the loop body would be basically empty, causing a divide by zero error. More investigation is needed there so we do not actually hit that assertion and report a cleaner error somewhere. Specifying an actual opcode to benchmark fixes the problem. The test would also fail as -mcpu was set to the default x86 CPU rather than the one currently being run on, so it would always fail to find a perf counter. This patch fixes that by simply removing the -mcpu flag. Given these issues, I'm not sure these tests have ever run in the ~5 years they have been in tree. There were some issues reported in \#132861, so I guess we'll see if there are further issues when the testing becomes more broad.
It's actually interesting that this test ran at all on your system. By default in tree you end up with a divide by zero error because the loop body is empty given the input is a snippet file pointing at I pushed e696f4e to fix that. |
After applying e696f4e I still get the hang (and stack trace the same). Simpler reproduction, hangs on the following. ./build/bin/llvm-exegesis -mode latency -opcode-name=ADD64rr -x86-lbr-sample-period 123 -repetition-mode loop When it doesn't hang I get the error
So it either gives the above warning or hangs. The three calls I mentioned in #132861 (comment) map to ./build/bin/llvm-exegesis -mode latency -opcode-name=ADD64rr
./build/bin/llvm-exegesis -mode uops -opcode-name=ADD64rr
./build/bin/llvm-exegesis -mode latency -opcode-name=ADD64rr -x86-lbr-sample-period 123 -repetition-mode loop
I'm not sure which test you're referring to exactly, but to be clear I was running ninja -C build test/check-llvm-tools-llvm-exegesis
|
The patch isn't intended to fix anything, just actually get the test running on more systems. |
The call to
llvm_exegesis_exe
nondeterministically hangs, causing llvm's tests to block (even before the progress bar-- Testing: 57793 tests, 28 workers --
appears). Introduce a timeout when this happens. Reproduction:cmake -S llvm -B build -G Ninja -DCMAKE_BUILD_TYPE=Debug ninja -C build check-all # hangs, sent ^C
When printing out the time it takes on
ninja -C build check-all
, it prints three times---the first two calls take ~0.007 seconds, and the third time (if it doesn't hang) takes ~0.015--0.031 seconds (between 2-4x slower than the first two calls). If it hangs, it always hangs on the third call (waiting for minutes and still doesn't finish). Setting-DLLVM_LIT_ARGS="-j1"
doesn't seem to change anything. In 50 manual invocations it succeeded 33 times and hung 17 times.cc @boomanaiden154