Long build compilation time (>5 hours) for CUDA ARM build #126980
Labels
module: arm
Related to ARM architectures builds of PyTorch. Includes Apple M1
module: build
Build system issues
module: cuda
Related to torch.cuda, and CUDA support in general
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
🐛 Describe the bug
Issue summary:
As part of process to add CUDA ARM nightly wheel, we are seeing long build compilation time. Needs ~5hrs to compile https://github.com/pytorch/pytorch/actions/runs/9115630888/job/25074847028?pr=126174#step:16:1320.
Need to figure out how to decrease the build compilation time.
Description:
Initially, we are seeing OOM Error while building flash_attn in adding the https://github.com/pytorch/builder/pull/1775/files to nightly CI.
2024-04-26T02:20:01.5283695Z /pytorch/aten/src/ATen/../../../third_party/cutlass/include/cute/layout.hpp(988): catastrophic error: out of memory
Error link - https://github.com/pytorch/pytorch/actions/runs/8840652730/job/24276381274?pr=124112.
Relevant PR for above error - #124112.
Tried set MAX_JOBS=4 (default is 6), no OOM error, but build takes >7 hours
Link - https://github.com/pytorch/pytorch/actions/runs/8970425814/job/24633947792
Now we are going with MAX_JOBS=5.
Status:
Exploring the option with CMAKE JOB_POOLS for flash-attn only.
cc @malfet @seemethere @ptrblck @msaroufim @snadampal @atalman @Aidyn-A @nWEIdia
Versions
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-72-generic-aarch64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA A100-PCIE-40GB
Nvidia driver version: 525.105.17
cuDNN version: Probably one of the following:
/usr/lib/aarch64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A
CPU:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per cluster: 80
Socket(s): -
Cluster(s): 1
Stepping: r3p1
Frequency boost: disabled
CPU max MHz: 2800.0000
CPU min MHz: 1000.0000
BogoMIPS: 50.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
L1d cache: 5 MiB (80 instances)
L1i cache: 5 MiB (80 instances)
L2 cache: 80 MiB (80 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-79
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Mitigation; CSV2, BHB
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] onnx==1.16.0
[pip3] optree==0.11.0
[pip3] pytorch-quantization==2.1.2
[pip3] pytorch-triton==3.0.0+989adb9a2
[pip3] torch==2.4.0a0+07cecf4168.nvinternal
[pip3] torch-tensorrt==2.4.0a0
[pip3] torchvision==0.19.0a0
[conda] Could not collect
The text was updated successfully, but these errors were encountered: