{llvm,triton-llvm}: fix nondeterministic hang #392651

stephen-huan · 2025-03-24T08:10:46Z

I'm experiencing a nondeterministic hang while running llvm's tests (it gets stuck on a lit test and doesn't finish, even after waiting for a long time). Other times it finishes fine. Seems to be more stable with 4 cores or less.

Possibly caused by llvm/llvm-project#56336 (see llvm/llvm-project@61708ec and the following comment)

# FIXME: This test is flaky and hangs randomly on multi-core systems.
# See https://github.com/llvm/llvm-project/issues/56336 for more
# details.
# REQUIRES:  less-than-4-cpu-cores-in-parallel

Not sure if this is the only flaky test as the hang is nondeterministic and llvm takes a long time to build.

Of course this could be fixed by --cores 4 but that would slow down the build as well (and not just the tests).

Happy to change the PR to just disable this test (rm llvm/utils/lit/tests/max-failures.py) if that would be cleaner.

Things done

Add a 👍 reaction to pull requests you find important.

pkgs/by-name/tr/triton-llvm/package.nix

RossComputerGuy · 2025-03-24T13:05:57Z

I haven't been able to reproduce this issue. Tests pass after a few minutes. Limiting the lit jobs for me would severely slow LLVM builds down.

stephen-huan · 2025-03-24T21:45:35Z

I haven't been able to reproduce this issue. Tests pass after a few minutes.

What degree of parallelism are you using? It hangs for me with reasonably high probability (around one in two/three) at 8 cores, and I suspect the more cores the more likely. I haven't tested > 8 since memory becomes the bottleneck.

Limiting the lit jobs for me would severely slow LLVM builds down.

I noticed this as well. Would disabling the test instead be preferred (rm llvm/utils/lit/tests/max-failures.py)? I didn't originally write this PR in this way because I was concerned there may be other flaky tests, but I can disable them as they arise. (It's hard to test because llvm takes a long time to build + would have to build repeatedly to be confident.)

RossComputerGuy · 2025-03-24T21:48:22Z

What degree of parallelism are you using? It hangs for me with reasonably high probability (around one in two/three) at 8 cores, and I suspect the more cores the more likely. I haven't tested > 8 since memory becomes the bottleneck.

My system's default of 64 cores, I've been packaging LLVM since 17 released and I've never experienced this issue.

I noticed this as well. Would disabling the test instead be preferred (rm llvm/utils/lit/tests/max-failures.py)? I didn't originally write this PR in this way because I was concerned there may be other flaky tests, but I can disable them as they arise. (It's hard to test because llvm takes a long time to build + would have to build repeatedly to be confident.)

That is the correct approach here, the other thing is it might not be a bad idea to investigate into what exactly triggers this. I've never experienced this natively on aarch64, x86_64, and riscv64.

pkgs/development/compilers/llvm/common/llvm/default.nix

pkgs/development/compilers/llvm/common/llvm/llvm-exegesis-timeout.patch

stephen-huan · 2025-03-25T03:03:52Z

the other thing is it might not be a bad idea to investigate into what exactly triggers this.

Thanks for the tip. It turns out the lit test was a red herring as the REQUIRES already correctly disables the test (!).

# REQUIRES:  less-than-4-cpu-cores-in-parallel

The real cause is a random hang in loading the configuration llvm/test/tools/llvm-exegesis/lit.local.cfg before the tests are even ran (see llvm/llvm-project#132861).

pkgs/development/compilers/llvm/common/llvm/default.nix

stephen-huan · 2025-03-26T03:21:09Z

I have a simpler reproduction, just using llvm-exegesis from pkgs.llvmPackages.llvm (no need to rebuild llvm). Run

llvm-exegesis -mode latency -opcode-name=ADD64rr -x86-lbr-sample-period 123 -repetition-mode loop

On the problematic desktop, it gives

---
mode:            latency
key:
  instructions:
    - 'ADD64rr RAX RAX R12'
  config:          ''
  register_initial_values:
    - 'RAX=0x0'
    - 'R12=0x0'
cpu_name:        alderlake
llvm_triple:     x86_64-unknown-linux-gnu
min_instructions: 10000
measurements:
  - { key: latency, value: 0.0001, per_snippet_value: 0.0001, validation_counters: {} }
error:           ''
info:            Repeating a single implicitly serial instruction
assembled_snippet: 415448B8000000000000000049BC000000000000000049B802000000000000004C01E04C01E04983C0FF75F4415CC3
...

or hangs. On my laptop (with basically the same nixos configuration) it gives

llvm-exegesis error: LBR not supported on this kernel and/or platform

without hanging. Seems that a hardware difference determines whether LBR is available or not.

nix-owners · 2025-04-05T14:23:33Z

The PR's base branch is set to staging, but 33 commits from the master branch are included. Make sure you know the right base branch for your changes, then:

If the changes should go to the master branch, change the base branch to master

If the changes should go to the staging branch, rebase your PR onto the merge base with the staging branch:

# git rebase --onto $(git merge-base upstream/staging HEAD) $(git merge-base upstream/master HEAD)
git rebase --onto 3b48b2eb41f0bcd2c0551cd1c2457fdae806c7a3 1fc111791739091bde809be0a03a628e19c2e3d6
git push --force-with-lease

github-actions bot added the 6.topic: llvm/clang Issues related to llvmPackages, clangStdenv and related label Mar 24, 2025

nix-owners bot requested review from Ericson2314, dtzWill, rrbutani, lovek323, alyssais, sternenseemann and RossComputerGuy March 24, 2025 08:12

stephen-huan force-pushed the fix-llvm branch from 656e598 to 4971349 Compare March 24, 2025 08:18

nix-owners bot requested review from SomeoneSerge and Madouura March 24, 2025 08:21

stephen-huan changed the base branch from master to staging March 24, 2025 08:26

alyssais reviewed Mar 24, 2025

View reviewed changes

pkgs/by-name/tr/triton-llvm/package.nix Outdated Show resolved Hide resolved

stephen-huan force-pushed the fix-llvm branch from 4971349 to 2b3fb9f Compare March 24, 2025 21:41

stephen-huan force-pushed the fix-llvm branch from 2b3fb9f to d56804e Compare March 25, 2025 02:58

RossComputerGuy requested changes Mar 25, 2025

View reviewed changes

stephen-huan force-pushed the fix-llvm branch from d56804e to 23011e5 Compare March 25, 2025 03:28

RossComputerGuy requested changes Mar 25, 2025

View reviewed changes

pkgs/development/compilers/llvm/common/llvm/default.nix Outdated Show resolved Hide resolved

stephen-huan force-pushed the fix-llvm branch from 23011e5 to 90bc6b6 Compare March 25, 2025 03:37

stephen-huan mentioned this pull request Mar 25, 2025

[llvm-exegesis] Timeout if subprocess executor hangs llvm/llvm-project#132861

Open

wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Apr 2, 2025

stephen-huan force-pushed the fix-llvm branch from 90bc6b6 to 8bf5aa7 Compare April 5, 2025 14:22

ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Apr 5, 2025

stephen-huan force-pushed the fix-llvm branch from 8bf5aa7 to 4e33650 Compare April 5, 2025 14:26

stephen-huan force-pushed the fix-llvm branch from 4e33650 to bf52f71 Compare April 5, 2025 14:31

wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label May 17, 2025

stephen-huan added 2 commits June 23, 2025 03:27

llvm: fix nondeterministic hang

d34bc9d

triton-llvm: fix nondeterministic hang

c68a06f

stephen-huan force-pushed the fix-llvm branch from bf52f71 to c68a06f Compare June 23, 2025 07:28

ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Jun 23, 2025

github-actions bot added the 10.rebuild-darwin-stdenv This PR causes stdenv to rebuild on Darwin and must target a staging branch. label Jun 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

{llvm,triton-llvm}: fix nondeterministic hang #392651

{llvm,triton-llvm}: fix nondeterministic hang #392651

Uh oh!

stephen-huan commented Mar 24, 2025

Uh oh!

Uh oh!

RossComputerGuy commented Mar 24, 2025

Uh oh!

stephen-huan commented Mar 24, 2025

Uh oh!

RossComputerGuy commented Mar 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephen-huan commented Mar 25, 2025

Uh oh!

Uh oh!

stephen-huan commented Mar 26, 2025

Uh oh!

nix-owners bot commented Apr 5, 2025

Uh oh!

Uh oh!

Uh oh!

{llvm,triton-llvm}: fix nondeterministic hang #392651

Are you sure you want to change the base?

{llvm,triton-llvm}: fix nondeterministic hang #392651

Uh oh!

Conversation

stephen-huan commented Mar 24, 2025

Things done

Uh oh!

Uh oh!

RossComputerGuy commented Mar 24, 2025

Uh oh!

stephen-huan commented Mar 24, 2025

Uh oh!

RossComputerGuy commented Mar 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephen-huan commented Mar 25, 2025

Uh oh!

Uh oh!

stephen-huan commented Mar 26, 2025

Uh oh!

nix-owners bot commented Apr 5, 2025

Uh oh!

Uh oh!