Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build failure: magma #393035

Closed
3 tasks done
gshpychka opened this issue Mar 25, 2025 · 3 comments
Closed
3 tasks done

Build failure: magma #393035

gshpychka opened this issue Mar 25, 2025 · 3 comments
Assignees
Labels
0.kind: build failure A package fails to build 6.topic: cuda Parallel computing platform and API

Comments

@gshpychka
Copy link
Contributor

gshpychka commented Mar 25, 2025

Nixpkgs version

  • Unstable (25.05)

Steps to reproduce

Build magma-2.9.0

I have cudaSupport set to true in my nixpkgs config

Can Hydra reproduce this build failure?

I was not able to find a Hydra job for magma-2.9.0. I could find one in staging-next, but it was last run in February for 2.7.2 and it was manually canceled: https://hydra.nixos.org/job/nixpkgs/staging-next/magma.x86_64-linux

Link to Hydra build job

No response

Relevant log output

/build/cci1gAqQ.s: Fatal error: CMakeFiles/magma.dir/magmablas/slarf_batched_fused_reg.cu.o: No such file or directory
[1391/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/dlarf_batched_fused_reg.cu.o
[1392/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/sgetf2_kernels.cu.o
[1393/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/zgeqr2_batched_fused_reg_medium.cu.o
FAILED: CMakeFiles/magma.dir/magmablas/zgeqr2_batched_fused_reg_medium.cu.o
/nix/store/m9xp0gz6a7hix8jkhqw9as0z7vn25cnl-cuda_nvcc-12.4.131/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/nix/store/npgifb317zzrf99pppxa51j0s4a1ha8y-gcc-wrapper-13.3.0/bin/c++  -I/build/magma-2.9.0/build/include ->
/nix/store/71xyq87gpb2qrwn314p0sf2n002lgd91-glibc-2.40-66-dev/include/bits/mathcalls.h(158): catastrophic error: error while writing generated C file: No space left on device

1 catastrophic error detected in the compilation of "/build/magma-2.9.0/magmablas/zgeqr2_batched_fused_reg_medium.cu".
Compilation terminated.
[1394/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/dgetf2_kernels.cu.o
FAILED: CMakeFiles/magma.dir/magmablas/dgetf2_kernels.cu.o
/nix/store/m9xp0gz6a7hix8jkhqw9as0z7vn25cnl-cuda_nvcc-12.4.131/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/nix/store/npgifb317zzrf99pppxa51j0s4a1ha8y-gcc-wrapper-13.3.0/bin/c++  -I/build/magma-2.9.0/build/include ->
In file included from tmpxft_0000561b_00000000-6_dgetf2_kernels.compute_90a.cudafe1.stub.c:1:
/build/tmpxft_0000561b_00000000-6_dgetf2_kernels.compute_90a.cudafe1.stub.c:68:27: fatal error: error writing to /build/ccOgSeQg.s: No space left on device
   68 | static void __device_stub__Z27dgetf2_fused_kernel_batchedILi30EEviPPdiiiPPiS2_i(int, double **, int, int, int, magma_int_t **, magma_int_t *, int);
      |                           ^
compilation terminated.
[1395/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/sgeqr2_batched_fused_reg.cu.o
FAILED: CMakeFiles/magma.dir/magmablas/sgeqr2_batched_fused_reg.cu.o
/nix/store/m9xp0gz6a7hix8jkhqw9as0z7vn25cnl-cuda_nvcc-12.4.131/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/nix/store/npgifb317zzrf99pppxa51j0s4a1ha8y-gcc-wrapper-13.3.0/bin/c++  -I/build/magma-2.9.0/build/include ->
In file included from /build/tmpxft_00005235_00000000-6_sgeqr2_batched_fused_reg.compute_90a.cudafe1.stub.c:9,
                 from tmpxft_00005235_00000000-6_sgeqr2_batched_fused_reg.compute_90a.cudafe1.stub.c:1:
/build/tmpxft_00005235_00000000-3_sgeqr2_batched_fused_reg.fatbin.c:1983280:1: error: missing terminating " character
1983280 | ".quad 0x0000002c00081104,0x0008120400000000,0x000000000000041e,0x0000041e000
        | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/build/tmpxft_00005235_00000000-3_sgeqr2_batched_fused_reg.fatbin.c:1983279:86: error: expected ')' before 'static'
1983279 | ".quad 0x000000000000002d,0x0000002d00081104,0x0008120400000000,0x000000000000002c\n"
        |                                                                                      ^
        |                                                                                      )
In file included from /build/tmpxft_00005235_00000000-6_sgeqr2_batched_fused_reg.compute_90a.cudafe1.stub.c:8:
/build/tmpxft_00005235_00000000-6_sgeqr2_batched_fused_reg.compute_90a.cudafe1.stub.c: In function 'void __sti____cudaRegisterAll()':
/build/tmpxft_00005235_00000000-6_sgeqr2_batched_fused_reg.compute_90a.cudafe1.stub.c:541:44: error: '__fatDeviceText' was not declared in this scope
  541 | static void __sti____cudaRegisterAll(void){__cudaRegisterBinary(__nv_cudaEntityRegisterCallback);}
      |                                            ^~~~~~~~~~~~~~~~~~~~
[1396/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/dgeqr2_batched_fused_reg.cu.o
[1397/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/zgeqr2_batched_fused_reg.cu.o
[1398/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/cgetf2_kernels.cu.o
[1399/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/zgetf2_kernels_var.cu.o
[1400/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/sgetf2_native_kernel.cu.o
[1401/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/sgeqr2_batched_fused_reg_tall.cu.o
[1402/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/sgeqr2_batched_fused_reg_medium.cu.o
[1403/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/dgeqr2_batched_fused_reg_tall.cu.o
[1404/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/cgeqr2_batched_fused_reg.cu.o
[1405/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/dgeqr2_batched_fused_reg_medium.cu.o
[1406/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/dlarf_batched_fused_reg_medium.cu.o
[1407/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/slarf_batched_fused_reg_medium.cu.o
[1408/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/zgeqr2_batched_fused_reg_tall.cu.o
[1409/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/zlarf_batched_fused_reg_tall.cu.o
[1410/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/slarf_batched_fused_reg_tall.cu.o
[1411/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/cgeqr2_batched_fused_reg_tall.cu.o
[1412/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/cgeqr2_batched_fused_reg_medium.cu.o
[1413/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/clarf_batched_fused_reg_medium.cu.o
[1414/3492] Building CUDA object CMakeFiles/magma.dir/magmablas/clarf_batched_fused_reg_tall.cu.o
ninja: build stopped: subcommand failed.

Additional context

It says "No space left on device", but I have a couple of hundred GB free.

Not sure if this is relevant, but my RAM usage goes up to 50GB+ during the build (I have enough)

System metadata

  • system: "x86_64-linux"
  • host os: Linux 6.13.7, NixOS, 25.05 (Warbler), 25.05.20250319.2a725d4
  • multi-user?: yes
  • sandbox: yes
  • version: nix-env (Nix) 2.24.12
  • nixpkgs: /nix/store/aymywl6wcxvxfj804gck5gsli32pv7q1-source

Notify maintainers

@ConnorBaker @NixOS/cuda-maintainers


Note for maintainers: Please tag this issue in your pull request description. (i.e. Resolves #ISSUE.)

I assert that this issue is relevant for Nixpkgs

Is this issue important to you?

Add a 👍 reaction to issues you find important.

@gshpychka gshpychka added the 0.kind: build failure A package fails to build label Mar 25, 2025
@ConnorBaker
Copy link
Contributor

The NixOS Hydra doesn’t build with CUDA support enabled, so that’s why it doesn’t exist there.

I’ll take a look at this later today to see if I can reproduce.

Any idea if your builds occur in a tmpfs-backed directory or on disk? I have a fuzzy recollection of the default Nix configuration using a build directory on a tmpfs mount (which is limited to half the amount of RAM)… but I’m on my phone so who knows 🤷‍♂️

Also, did you specify any cudaCapabilities in your Nixpkgs config? If not, I’d recommend doing so because it makes builds faster and closures smaller.

@ConnorBaker ConnorBaker added the 6.topic: cuda Parallel computing platform and API label Mar 25, 2025
@ConnorBaker ConnorBaker self-assigned this Mar 25, 2025
@gshpychka
Copy link
Contributor Author

gshpychka commented Mar 25, 2025

Any idea if your builds occur in a tmpfs-backed directory or on disk? I have a fuzzy recollection of the default Nix configuration using a build directory on a tmpfs mount (which is limited to half the amount of RAM)… but I’m on my phone so who knows

Ah yes, good catch - indeed they occur in tmpfs (not by default, but I've configured that myself). So now the 50GB+ RAM usage and "no space left on device" both make sense. My tmpfs was set to 16GB, though.

I will try increasing it to 32GB and building again, will report back.

Also, did you specify any cudaCapabilities in your Nixpkgs config? If not, I’d recommend doing so because it makes builds faster and closures smaller.

If you mean cudaSupport, then yes. Not sure what cudaCapabilities are, though - couldn't find that in the manual Found it, set it to ["8.9"] for my Ada Lovelace card, thanks.

Off topic question if I may: how can I avoid building from source in this case? I do have the following in my nix.settings:

extra-substituters = ["https://cuda-maintaners.cachix.org"];
extra-trusted-public-keys = ["cuda-maintainers.cachix.org-1:0dq3bujKpuEPMCX6U4WylrUDZ9JyUG0VpVZa7CNfq5E="];

Is it just because I'm using unstable and it's not in the cache yet?

In either case, it's always fun to see my CPU at 100% load on all cores, that is a rare sight :D

@gshpychka
Copy link
Contributor Author

The issue was that my build was happening in tmpfs and I ran out of memory. Increasing my tmpfs size and setting cudaCapabilities solved the issue (not sure if both were required). Thanks!

@github-project-automation github-project-automation bot moved this from New to ✅ Done in CUDA Team Mar 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: build failure A package fails to build 6.topic: cuda Parallel computing platform and API
Projects
Status: Done
Development

No branches or pull requests

2 participants