Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llvm-exegesis] Timeout if subprocess executor hangs #132861

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

stephen-huan
Copy link

The call to llvm_exegesis_exe nondeterministically hangs, causing llvm's tests to block (even before the progress bar -- Testing: 57793 tests, 28 workers -- appears). Introduce a timeout when this happens. Reproduction:

cmake -S llvm -B build -G Ninja -DCMAKE_BUILD_TYPE=Debug
ninja -C build check-all # hangs, sent ^C
^Cllvm-lit: .../llvm-project/llvm/utils/lit/lit/TestingConfig.py:156: fatal: unable to parse config file '.../llvm-project/llvm/test/tools/llvm-exegesis/lit.local.cfg', traceback: Traceback (most recent call last):
  File ".../llvm-project/llvm/utils/lit/lit/TestingConfig.py", line 144, in load_from_path
    exec(compile(data, path, "exec"), cfg_globals, None)
  File ".../llvm-project/llvm/test/tools/llvm-exegesis/lit.local.cfg", line 46, in <module>
    if can_use_perf_counters(
       ^^^^^^^^^^^^^^^^^^^^^^
  File ".../llvm-project/llvm/test/tools/llvm-exegesis/lit.local.cfg", line 22, in can_use_perf_counters
    return_code = subprocess.call(
                  ^^^^^^^^^^^^^^^^
  File "/nix/store/f2krmq3iv5nibcvn4rw7nrnrciqprdkh-python3-3.12.9/lib/python3.12/subprocess.py", line 393, in call
    return p.wait(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/f2krmq3iv5nibcvn4rw7nrnrciqprdkh-python3-3.12.9/lib/python3.12/subprocess.py", line 1266, in wait
    return self._wait(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/f2krmq3iv5nibcvn4rw7nrnrciqprdkh-python3-3.12.9/lib/python3.12/subprocess.py", line 2061, in _wait
    (pid, sts) = self._try_wait(0)
                 ^^^^^^^^^^^^^^^^^
  File "/nix/store/f2krmq3iv5nibcvn4rw7nrnrciqprdkh-python3-3.12.9/lib/python3.12/subprocess.py", line 2019, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

When printing out the time it takes on ninja -C build check-all, it prints three times---the first two calls take ~0.007 seconds, and the third time (if it doesn't hang) takes ~0.015--0.031 seconds (between 2-4x slower than the first two calls). If it hangs, it always hangs on the third call (waiting for minutes and still doesn't finish). Setting -DLLVM_LIT_ARGS="-j1" doesn't seem to change anything. In 50 manual invocations it succeeded 33 times and hung 17 times.

cc @boomanaiden154

The call to llvm_exegesis_exe nondeterministically hangs, causing
llvm's tests to block. Introduce a timeout when this happens.
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Mar 25, 2025

@llvm/pr-subscribers-tools-llvm-exegesis

Author: Stephen Huan (stephen-huan)

Changes

The call to llvm_exegesis_exe nondeterministically hangs, causing llvm's tests to block (even before the progress bar -- Testing: 57793 tests, 28 workers -- appears). Introduce a timeout when this happens. Reproduction:

cmake -S llvm -B build -G Ninja -DCMAKE_BUILD_TYPE=Debug
ninja -C build check-all # hangs, sent ^C
^Cllvm-lit: .../llvm-project/llvm/utils/lit/lit/TestingConfig.py:156: fatal: unable to parse config file '.../llvm-project/llvm/test/tools/llvm-exegesis/lit.local.cfg', traceback: Traceback (most recent call last):
  File ".../llvm-project/llvm/utils/lit/lit/TestingConfig.py", line 144, in load_from_path
    exec(compile(data, path, "exec"), cfg_globals, None)
  File ".../llvm-project/llvm/test/tools/llvm-exegesis/lit.local.cfg", line 46, in &lt;module&gt;
    if can_use_perf_counters(
       ^^^^^^^^^^^^^^^^^^^^^^
  File ".../llvm-project/llvm/test/tools/llvm-exegesis/lit.local.cfg", line 22, in can_use_perf_counters
    return_code = subprocess.call(
                  ^^^^^^^^^^^^^^^^
  File "/nix/store/f2krmq3iv5nibcvn4rw7nrnrciqprdkh-python3-3.12.9/lib/python3.12/subprocess.py", line 393, in call
    return p.wait(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/f2krmq3iv5nibcvn4rw7nrnrciqprdkh-python3-3.12.9/lib/python3.12/subprocess.py", line 1266, in wait
    return self._wait(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nix/store/f2krmq3iv5nibcvn4rw7nrnrciqprdkh-python3-3.12.9/lib/python3.12/subprocess.py", line 2061, in _wait
    (pid, sts) = self._try_wait(0)
                 ^^^^^^^^^^^^^^^^^
  File "/nix/store/f2krmq3iv5nibcvn4rw7nrnrciqprdkh-python3-3.12.9/lib/python3.12/subprocess.py", line 2019, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

When printing out the time it takes on ninja -C build check-all, it prints three times---the first two calls take ~0.007 seconds, and the third time (if it doesn't hang) takes ~0.015--0.031 seconds (between 2-4x slower than the first two calls). If it hangs, it always hangs on the third call (waiting for minutes and still doesn't finish). Setting -DLLVM_LIT_ARGS="-j1" doesn't seem to change anything. In 50 manual invocations it succeeded 33 times and hung 17 times.

cc @boomanaiden154


Full diff: https://github.com/llvm/llvm-project/pull/132861.diff

1 Files Affected:

  • (modified) llvm/test/tools/llvm-exegesis/lit.local.cfg (+2-1)
diff --git a/llvm/test/tools/llvm-exegesis/lit.local.cfg b/llvm/test/tools/llvm-exegesis/lit.local.cfg
index a51a2d73442fa..8b7c13f5eb1b6 100644
--- a/llvm/test/tools/llvm-exegesis/lit.local.cfg
+++ b/llvm/test/tools/llvm-exegesis/lit.local.cfg
@@ -24,9 +24,10 @@ def can_use_perf_counters(mode, extra_options=[]):
             + extra_options,
             stdout=subprocess.DEVNULL,
             stderr=subprocess.DEVNULL,
+            timeout=1,
         )
         return return_code == 0
-    except OSError:
+    except (OSError, subprocess.TimeoutExpired):
         print("could not exec llvm-exegesis")
         return False
 

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried getting a stack trace?

I haven't seen any timeout issues with this code reported in the ~2 years it's been in tree. There was a race condition fixed about a year ago, but that just caused test failures rather than timeouts.

I would probably need some convincing that this is a problem with llvm-exegesis itself rather than your environment to accept this patch.

@boomanaiden154
Copy link
Contributor

Also, what LLVM version is this with?

@stephen-huan
Copy link
Author

Have you tried getting a stack trace?

What should I run exactly? Sorry, not very knowledgeable, just building llvm for other projects (triton-cpu, packaging).

I would probably need some convincing that this is a problem with llvm-exegesis itself rather than your environment to accept this patch.

Sure, I'm building llvm with the nix package manager which tries to guarantee a completely deterministic build environment for reproducibility (nix builds packages in a sandbox which only allows access to other nix packages, so ideally builds would be bit-for-bit reproducible). That is to say, there are very few environment-specific impurities outside of hardware (my system doesn't have a /bin, /lib, or /usr for example). Unfortunately others (NixOS/nixpkgs#392651 (comment)) can't reproduce it either and I haven't experienced it on my laptop (running roughly the same system). I also tried 17.0.0 (88bf774) on a different computer (a slurm cluster) and I can't reproduce. This desktop is strange I guess.

desktop information
Dell Precision 3680

Host: Dell Inc. 0VTKY7
CPU: Intel i7-14700K (28) @ 5.500GHz
GPU: NVIDIA GeForce RTX 4070
GPU: Intel Raptor Lake-S GT1 [UHD Graphics 770]

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   28
  On-line CPU(s) list:    0-27
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i7-14700K
    CPU family:           6
    Model:                183
    Thread(s) per core:   2
    Core(s) per socket:   20
    Socket(s):            1
    Stepping:             1
    CPU(s) scaling MHz:   35%
    CPU max MHz:          5600.0000
    CPU min MHz:          800.0000
    BogoMIPS:             6835.20
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat p
                          se36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdp
                          e1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
                           xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqd
                          q dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pd
                          cm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xs
                          ave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd
                          ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_
                          ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx
                          smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves
                           split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp
                          hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku osp
                          ke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_cle
                          ar serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization features:
  Virtualization:         VT-x
Caches (sum of all):
  L1d:                    768 KiB (20 instances)
  L1i:                    1 MiB (20 instances)
  L2:                     28 MiB (11 instances)
  L3:                     33 MiB (1 instance)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-27
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Mitigation; Clear Register File
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitizatio
                          n
  Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB fillin
                          g; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
  Srbds:                  Not affected
  Tsx async abort:        Not affected

Also, what LLVM version is this with?

19.1.7 and latest master (tried a66376b, e6de45a). The earliest release I could find with the code was 17.0.0 (88bf774) but I haven't tried it with this desktop yet.

Don't feel strongly about this PR, in any case we can maintain the patch downstream.

@boomanaiden154
Copy link
Contributor

What should I run exactly? Sorry, not very knowledgeable, just building llvm for other projects (triton-cpu, packaging).

Attach lldb to the llvm-exegesis process that is hanging and then run bt. Maybe on the subprocess too that gets spawned, although that is a lot more complicated to collect. You need to attach a debugger to the main process, manually execute a PTRACE_DETACH from within that process, make sure a SIGSTOP is sent correctly, and then attach a debugger to the child process. The debugging might be more effort than it's worth though, depending upon how deep you're willing to dig into the issue.

Unfortunately others (NixOS/nixpkgs#392651 (comment)) can't reproduce it either and I haven't experienced it on my laptop (running roughly the same system). I also tried 17.0.0 (88bf774) on a different computer (a slurm cluster) and I can't reproduce. This desktop is strange I guess.

To me this further suggests it's something weird with your system rather than the test/code itself. These tests don't get exercised as much as they probably should as they need bare metal hardware access for performance counters (might be good to double check with other people that can't reproduce that the tests are actually running and not just disabled), but they do get run on lots of systems, and there haven't been any reports until now.

@stephen-huan
Copy link
Author

Thanks for the detailed advice (stacktrace below). It seems it's getting stuck on poll (no clue why, and no clue how to determine whether this is a kernel or userspace bug). I guess the 10000 ms poll timeout * 160 tries = about half an hour of waiting theoretically. Maybe a simple fix is to just adjust the constants here to something more reasonable?

// The max number of time-outs/retries before we give up.
static constexpr int kMaxTimeouts = 160;

(lldb) bt
* thread #1, name = 'llvm-exegesis', stop reason = signal SIGSTOP
  * frame #0: 0x00007f5391f0cd04 libc.so.6`__poll + 20
    frame #1: 0x000000000143aa60 llvm-exegesis`llvm::exegesis::X86LbrCounter::doReadCounter(void const*, void const*) const [inlined] poll(__fds=0x00007ffc422cf100, __nfds=1, __timeout=10000) at poll2.h:44:10
    frame #2: 0x000000000143aa5b llvm-exegesis`llvm::exegesis::X86LbrCounter::doReadCounter(void const*, void const*) const [inlined] pollLbrPerfEvent(FileDescriptor=<unavailable>) at X86Counter.cpp:54:14
    frame #3: 0x000000000143aa42 llvm-exegesis`llvm::exegesis::X86LbrCounter::doReadCounter(this=0x00007ffc422cf680, From=0x0000000000000000, To=0x0000000000000000) const at X86Counter.cpp:224:34
    frame #4: 0x000000000143b28d llvm-exegesis`llvm::exegesis::X86LbrCounter::checkLbrSupport() at X86Counter.cpp:182:62
    frame #5: 0x0000000001433458 llvm-exegesis`llvm::exegesis::(anonymous namespace)::ExegesisX86Target::checkFeatureSupport(this=<unavailable>) const at Target.cpp:833:45
    frame #6: 0x00000000004d638a llvm-exegesis`llvm::exegesis::benchmarkMain() at llvm-exegesis.cpp:494:60

ExitOnErr(State.getExegesisTarget().checkFeatureSupport());

return X86LbrCounter::checkLbrSupport();

auto ResultOrError = counter.doReadCounter(nullptr, nullptr);

PollResult = pollLbrPerfEvent(getFileDescriptor());

return poll(&PollFd, 1 /* num of fds */, 10000 /* timeout in ms */);

@boomanaiden154
Copy link
Contributor

Interesting. That test has nothing to do with the subprocess executor, but I can see why there might be some issues there.

It doesn't seem like the tests are even running on my system at all by default (Skylake-X, so definitely has LBR) so it looks like there might have been a lack of test coverage for this for a while. Although it passes if I remove the -mcpu flag so it can actually find a sched model with perf counter mappings and I bypass the REQUIRES:. Updating the timeout thresholds I think makes sense. Something has definitely gone wrong if we're not getting results within a second. I'm not familiar enough with how LBR is handled in the perf subsystem to be able to see anything wrong by inspection.

@oontvoo I know this was a long time ago, but do you have any recollection of why you set up the timeout to be so long?

boomanaiden154 added a commit that referenced this pull request Mar 25, 2025
This patch fixes the LBR check in the local lit config. The test would
segfault as the loop body would be basically empty, causing a divide by
zero error. More investigation is needed there so we do not actually hit
that assertion and report a cleaner error somewhere. Specifying an
actual opcode to benchmark fixes the problem. The test would also fail
as -mcpu was set to the default x86 CPU rather than the one currently
being run on, so it would always fail to find a perf counter. This patch
fixes that by simply removing the -mcpu flag.

Given these issues, I'm not sure these tests have ever run in the ~5
years they have been in tree. There were some issues reported in
\#132861, so I guess we'll see if there are further issues when the
testing becomes more broad.
@boomanaiden154
Copy link
Contributor

It's actually interesting that this test ran at all on your system. By default in tree you end up with a divide by zero error because the loop body is empty given the input is a snippet file pointing at /dev/null when running the feature detection.

I pushed e696f4e to fix that.

@stephen-huan
Copy link
Author

After applying e696f4e I still get the hang (and stack trace the same). Simpler reproduction, hangs on the following.

./build/bin/llvm-exegesis -mode latency -opcode-name=ADD64rr -x86-lbr-sample-period 123 -repetition-mode loop

When it doesn't hang I get the error

llvm-exegesis error: can't run 'latency' mode, sched model does not define a cycle counter. You can pass --benchmark-phase=... to skip the actual benchmarking or --use-dummy-perf-counters to not query the kernel for real event counts.

So it either gives the above warning or hangs. The three calls I mentioned in #132861 (comment) map to

./build/bin/llvm-exegesis -mode latency -opcode-name=ADD64rr
./build/bin/llvm-exegesis -mode uops -opcode-name=ADD64rr
./build/bin/llvm-exegesis -mode latency -opcode-name=ADD64rr -x86-lbr-sample-period 123 -repetition-mode loop
llvm-exegesis error: can't run 'latency' mode, sched model does not define a cycle counter. You can pass --benchmark-phase=... to skip the actual benchmarking or --use-dummy-perf-counters to not query the kernel for real event counts.
llvm-exegesis error: can't run 'uops' mode, sched model does not define uops or issue counters. You can pass --benchmark-phase=... to skip the actual benchmarking or --use-dummy-perf-counters to not query the kernel for real event counts.
llvm-exegesis error: can't run 'latency' mode, sched model does not define a cycle counter. You can pass --benchmark-phase=... to skip the actual benchmarking or --use-dummy-perf-counters to not query the kernel for real event counts.

It's actually interesting that this test ran at all on your system.

I'm not sure which test you're referring to exactly, but to be clear I was running ninja -C build check-all which loads the configuration llvm/test/tools/llvm-exegesis/lit.local.cfg regardless of whether the tests are unsupported or not. Since I get a return code 1 for those three calls, many of the llvm-exegesis tests are unsupported. That is, I get

ninja -C build test/check-llvm-tools-llvm-exegesis
Testing Time: 1.17s

Total Discovered Tests: 102
  Unsupported: 29 (28.43%)
  Passed     : 73 (71.57%)

@boomanaiden154
Copy link
Contributor

After applying e696f4e I still get the hang (and stack trace the same). Simpler reproduction, hangs on the following.

The patch isn't intended to fix anything, just actually get the test running on more systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants