Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{llvm,triton-llvm}: fix nondeterministic hang #392651

Open
wants to merge 2 commits into
base: staging
Choose a base branch
from

Conversation

stephen-huan
Copy link
Member

I'm experiencing a nondeterministic hang while running llvm's tests (it gets stuck on a lit test and doesn't finish, even after waiting for a long time). Other times it finishes fine. Seems to be more stable with 4 cores or less.

Possibly caused by llvm/llvm-project#56336 (see llvm/llvm-project@61708ec and the following comment)

# FIXME: This test is flaky and hangs randomly on multi-core systems.
# See https://github.com/llvm/llvm-project/issues/56336 for more
# details.
# REQUIRES:  less-than-4-cpu-cores-in-parallel

Not sure if this is the only flaky test as the hang is nondeterministic and llvm takes a long time to build.

Of course this could be fixed by --cores 4 but that would slow down the build as well (and not just the tests).

Happy to change the PR to just disable this test (rm llvm/utils/lit/tests/max-failures.py) if that would be cleaner.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 25.05 Release Notes (or backporting 24.11 and 25.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@RossComputerGuy
Copy link
Member

I haven't been able to reproduce this issue. Tests pass after a few minutes. Limiting the lit jobs for me would severely slow LLVM builds down.

@stephen-huan
Copy link
Member Author

I haven't been able to reproduce this issue. Tests pass after a few minutes.

What degree of parallelism are you using? It hangs for me with reasonably high probability (around one in two/three) at 8 cores, and I suspect the more cores the more likely. I haven't tested > 8 since memory becomes the bottleneck.

Limiting the lit jobs for me would severely slow LLVM builds down.

I noticed this as well. Would disabling the test instead be preferred (rm llvm/utils/lit/tests/max-failures.py)? I didn't originally write this PR in this way because I was concerned there may be other flaky tests, but I can disable them as they arise. (It's hard to test because llvm takes a long time to build + would have to build repeatedly to be confident.)

@RossComputerGuy
Copy link
Member

What degree of parallelism are you using? It hangs for me with reasonably high probability (around one in two/three) at 8 cores, and I suspect the more cores the more likely. I haven't tested > 8 since memory becomes the bottleneck.

My system's default of 64 cores, I've been packaging LLVM since 17 released and I've never experienced this issue.

I noticed this as well. Would disabling the test instead be preferred (rm llvm/utils/lit/tests/max-failures.py)? I didn't originally write this PR in this way because I was concerned there may be other flaky tests, but I can disable them as they arise. (It's hard to test because llvm takes a long time to build + would have to build repeatedly to be confident.)

That is the correct approach here, the other thing is it might not be a bad idea to investigate into what exactly triggers this. I've never experienced this natively on aarch64, x86_64, and riscv64.

@stephen-huan
Copy link
Member Author

the other thing is it might not be a bad idea to investigate into what exactly triggers this.

Thanks for the tip. It turns out the lit test was a red herring as the REQUIRES already correctly disables the test (!).

# REQUIRES:  less-than-4-cpu-cores-in-parallel

The real cause is a random hang in loading the configuration llvm/test/tools/llvm-exegesis/lit.local.cfg before the tests are even ran (see llvm/llvm-project#132861).

@stephen-huan
Copy link
Member Author

I have a simpler reproduction, just using llvm-exegesis from pkgs.llvmPackages.llvm (no need to rebuild llvm). Run

llvm-exegesis -mode latency -opcode-name=ADD64rr -x86-lbr-sample-period 123 -repetition-mode loop

On the problematic desktop, it gives

---
mode:            latency
key:
  instructions:
    - 'ADD64rr RAX RAX R12'
  config:          ''
  register_initial_values:
    - 'RAX=0x0'
    - 'R12=0x0'
cpu_name:        alderlake
llvm_triple:     x86_64-unknown-linux-gnu
min_instructions: 10000
measurements:
  - { key: latency, value: 0.0001, per_snippet_value: 0.0001, validation_counters: {} }
error:           ''
info:            Repeating a single implicitly serial instruction
assembled_snippet: 415448B8000000000000000049BC000000000000000049B802000000000000004C01E04C01E04983C0FF75F4415CC3
...

or hangs. On my laptop (with basically the same nixos configuration) it gives

llvm-exegesis error: LBR not supported on this kernel and/or platform

without hanging. Seems that a hardware difference determines whether LBR is available or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants