Skip to content

{llvm,triton-llvm}: fix nondeterministic hang #392651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: staging
Choose a base branch
from

Conversation

stephen-huan
Copy link
Member

I'm experiencing a nondeterministic hang while running llvm's tests (it gets stuck on a lit test and doesn't finish, even after waiting for a long time). Other times it finishes fine. Seems to be more stable with 4 cores or less.

Possibly caused by llvm/llvm-project#56336 (see llvm/llvm-project@61708ec and the following comment)

# FIXME: This test is flaky and hangs randomly on multi-core systems.
# See https://github.com/llvm/llvm-project/issues/56336 for more
# details.
# REQUIRES:  less-than-4-cpu-cores-in-parallel

Not sure if this is the only flaky test as the hang is nondeterministic and llvm takes a long time to build.

Of course this could be fixed by --cores 4 but that would slow down the build as well (and not just the tests).

Happy to change the PR to just disable this test (rm llvm/utils/lit/tests/max-failures.py) if that would be cleaner.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 25.05 Release Notes (or backporting 24.11 and 25.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@github-actions github-actions bot added the 6.topic: llvm/clang Issues related to llvmPackages, clangStdenv and related label Mar 24, 2025
@github-actions github-actions bot added 10.rebuild-darwin: 5001+ This PR causes many rebuilds on Darwin and must target the staging branches. 10.rebuild-darwin: 501+ This PR causes many rebuilds on Darwin and should normally target the staging branches. 10.rebuild-linux: 5001+ This PR causes many rebuilds on Linux and must target the staging branches. 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. labels Mar 24, 2025
@nix-owners nix-owners bot requested review from SomeoneSerge and Madouura March 24, 2025 08:21
@stephen-huan stephen-huan changed the base branch from master to staging March 24, 2025 08:26
@RossComputerGuy
Copy link
Member

I haven't been able to reproduce this issue. Tests pass after a few minutes. Limiting the lit jobs for me would severely slow LLVM builds down.

@stephen-huan
Copy link
Member Author

I haven't been able to reproduce this issue. Tests pass after a few minutes.

What degree of parallelism are you using? It hangs for me with reasonably high probability (around one in two/three) at 8 cores, and I suspect the more cores the more likely. I haven't tested > 8 since memory becomes the bottleneck.

Limiting the lit jobs for me would severely slow LLVM builds down.

I noticed this as well. Would disabling the test instead be preferred (rm llvm/utils/lit/tests/max-failures.py)? I didn't originally write this PR in this way because I was concerned there may be other flaky tests, but I can disable them as they arise. (It's hard to test because llvm takes a long time to build + would have to build repeatedly to be confident.)

@RossComputerGuy
Copy link
Member

What degree of parallelism are you using? It hangs for me with reasonably high probability (around one in two/three) at 8 cores, and I suspect the more cores the more likely. I haven't tested > 8 since memory becomes the bottleneck.

My system's default of 64 cores, I've been packaging LLVM since 17 released and I've never experienced this issue.

I noticed this as well. Would disabling the test instead be preferred (rm llvm/utils/lit/tests/max-failures.py)? I didn't originally write this PR in this way because I was concerned there may be other flaky tests, but I can disable them as they arise. (It's hard to test because llvm takes a long time to build + would have to build repeatedly to be confident.)

That is the correct approach here, the other thing is it might not be a bad idea to investigate into what exactly triggers this. I've never experienced this natively on aarch64, x86_64, and riscv64.

@stephen-huan
Copy link
Member Author

the other thing is it might not be a bad idea to investigate into what exactly triggers this.

Thanks for the tip. It turns out the lit test was a red herring as the REQUIRES already correctly disables the test (!).

# REQUIRES:  less-than-4-cpu-cores-in-parallel

The real cause is a random hang in loading the configuration llvm/test/tools/llvm-exegesis/lit.local.cfg before the tests are even ran (see llvm/llvm-project#132861).

@stephen-huan
Copy link
Member Author

I have a simpler reproduction, just using llvm-exegesis from pkgs.llvmPackages.llvm (no need to rebuild llvm). Run

llvm-exegesis -mode latency -opcode-name=ADD64rr -x86-lbr-sample-period 123 -repetition-mode loop

On the problematic desktop, it gives

---
mode:            latency
key:
  instructions:
    - 'ADD64rr RAX RAX R12'
  config:          ''
  register_initial_values:
    - 'RAX=0x0'
    - 'R12=0x0'
cpu_name:        alderlake
llvm_triple:     x86_64-unknown-linux-gnu
min_instructions: 10000
measurements:
  - { key: latency, value: 0.0001, per_snippet_value: 0.0001, validation_counters: {} }
error:           ''
info:            Repeating a single implicitly serial instruction
assembled_snippet: 415448B8000000000000000049BC000000000000000049B802000000000000004C01E04C01E04983C0FF75F4415CC3
...

or hangs. On my laptop (with basically the same nixos configuration) it gives

llvm-exegesis error: LBR not supported on this kernel and/or platform

without hanging. Seems that a hardware difference determines whether LBR is available or not.

@wegank wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Apr 2, 2025
@github-actions github-actions bot added 6.topic: python Python is a high-level, general-purpose programming language. 6.topic: vscode A free and versatile code editor that supports almost every major programming language. 6.topic: php PHP is a general-purpose scripting language geared towards web development. labels Apr 5, 2025
@ofborg ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Apr 5, 2025
@nix-owners
Copy link

nix-owners bot commented Apr 5, 2025

The PR's base branch is set to staging, but 33 commits from the master branch are included. Make sure you know the right base branch for your changes, then:

  • If the changes should go to the master branch, change the base branch to master
  • If the changes should go to the staging branch, rebase your PR onto the merge base with the staging branch:
    # git rebase --onto $(git merge-base upstream/staging HEAD) $(git merge-base upstream/master HEAD)
    git rebase --onto 3b48b2eb41f0bcd2c0551cd1c2457fdae806c7a3 1fc111791739091bde809be0a03a628e19c2e3d6
    git push --force-with-lease

@github-actions github-actions bot removed 6.topic: python Python is a high-level, general-purpose programming language. 6.topic: vscode A free and versatile code editor that supports almost every major programming language. 6.topic: php PHP is a general-purpose scripting language geared towards web development. labels Apr 5, 2025
@wegank wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label May 17, 2025
@ofborg ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Jun 23, 2025
@github-actions github-actions bot added the 10.rebuild-darwin-stdenv This PR causes stdenv to rebuild on Darwin and must target a staging branch. label Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: llvm/clang Issues related to llvmPackages, clangStdenv and related 10.rebuild-darwin: 501+ This PR causes many rebuilds on Darwin and should normally target the staging branches. 10.rebuild-darwin: 5001+ This PR causes many rebuilds on Darwin and must target the staging branches. 10.rebuild-darwin-stdenv This PR causes stdenv to rebuild on Darwin and must target a staging branch. 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-linux: 5001+ This PR causes many rebuilds on Linux and must target the staging branches.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants