Long build compilation time (>5 hours) for CUDA ARM build #126980

tinglvv · 2024-05-23T16:15:04Z

🐛 Describe the bug

Issue summary:

As part of process to add CUDA ARM nightly wheel, we are seeing long build compilation time. Needs ~5hrs to compile https://github.com/pytorch/pytorch/actions/runs/9115630888/job/25074847028?pr=126174#step:16:1320.

Need to figure out how to decrease the build compilation time.

Description:

Initially, we are seeing OOM Error while building flash_attn in adding the https://github.com/pytorch/builder/pull/1775/files to nightly CI.

2024-04-26T02:20:01.5211732Z [6579/6896] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o�[K
2024-04-26T02:20:01.5215252Z �[31mFAILED: �[0mcaffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o 
2024-04-26T02:20:01.5250703Z /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUSPARSELT -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -DTORCH_ASSERT_NO_OPERATORS -I/pytorch/build/aten/src -I/pytorch/aten/src -I/pytorch/build -I/pytorch -I/pytorch/third_party/onnx -I/pytorch/build/third_party/onnx -I/pytorch/third_party/foxi -I/pytorch/build/third_party/foxi -I/pytorch/aten/src/THC -I/pytorch/aten/src/ATen/cuda -I/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/pytorch/build/caffe2/aten/src -I/pytorch/aten/src/ATen/.. -I/pytorch/build/nccl/include -I/pytorch/c10/cuda/../.. -I/pytorch/c10/.. -I/pytorch/third_party/tensorpipe -I/pytorch/build/third_party/tensorpipe -I/pytorch/third_party/tensorpipe/third_party/libnop/include -I/pytorch/torch/csrc/api -I/pytorch/torch/csrc/api/include -isystem /pytorch/build/third_party/gloo -isystem /pytorch/cmake/../third_party/gloo -isystem /pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /pytorch/third_party/protobuf/src -isystem /pytorch/third_party/gemmlowp -isystem /pytorch/third_party/neon2sse -isystem /pytorch/third_party/XNNPACK/include -isystem /pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /pytorch/third_party/ideep/include -isystem /pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_50,code=sm_50 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Xcompiler=-Wall,-Wextra,-Wdeprecated,-Wno-unused-parameter,-Wno-unused-function,-Wno-missing-field-initializers,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-maybe-uninitialized -Wno-deprecated-copy -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o.d -x cu -c /pytorch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o

2024-04-26T02:20:01.5283695Z /pytorch/aten/src/ATen/../../../third_party/cutlass/include/cute/layout.hpp(988): catastrophic error: out of memory
Error link - https://github.com/pytorch/pytorch/actions/runs/8840652730/job/24276381274?pr=124112.
Relevant PR for above error - #124112.

Tried set MAX_JOBS=4 (default is 6), no OOM error, but build takes >7 hours
Link - https://github.com/pytorch/pytorch/actions/runs/8970425814/job/24633947792

Now we are going with MAX_JOBS=5.

Status:

Exploring the option with CMAKE JOB_POOLS for flash-attn only.

cc @malfet @seemethere @ptrblck @msaroufim @snadampal @atalman @Aidyn-A @nWEIdia

Versions

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-72-generic-aarch64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA A100-PCIE-40GB
Nvidia driver version: 525.105.17
cuDNN version: Probably one of the following:
/usr/lib/aarch64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per cluster: 80
Socket(s): -
Cluster(s): 1
Stepping: r3p1
Frequency boost: disabled
CPU max MHz: 2800.0000
CPU min MHz: 1000.0000
BogoMIPS: 50.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
L1d cache: 5 MiB (80 instances)
L1i cache: 5 MiB (80 instances)
L2 cache: 80 MiB (80 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-79
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Mitigation; CSV2, BHB
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] onnx==1.16.0
[pip3] optree==0.11.0
[pip3] pytorch-quantization==2.1.2
[pip3] pytorch-triton==3.0.0+989adb9a2
[pip3] torch==2.4.0a0+07cecf4168.nvinternal
[pip3] torch-tensorrt==2.4.0a0
[pip3] torchvision==0.19.0a0
[conda] Could not collect

The text was updated successfully, but these errors were encountered:

Increasing the worker size, trying to improve build time for arm64 GPU machines. Here is the issue: pytorch/pytorch#126980 AWS Cost: 0.65c/hour In comparison to our Linux c5.4xlarge worker, AWS Cost: 0.68c/hour --------- Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>

rebasing #124112. too many conflict files, so starting a new PR. Test pytorch/builder#1775 (merged) for ARM wheel addition Test pytorch/builder#1828 (merged) for setting MAX_JOBS Current issue to follow up: #126980 Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com> Pull Request resolved: #126174 Approved by: https://github.com/nWEIdia, https://github.com/atalman

rebasing pytorch#124112. too many conflict files, so starting a new PR. Test pytorch/builder#1775 (merged) for ARM wheel addition Test pytorch/builder#1828 (merged) for setting MAX_JOBS Current issue to follow up: pytorch#126980 Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com> Pull Request resolved: pytorch#126174 Approved by: https://github.com/nWEIdia, https://github.com/atalman

tinglvv · 2024-07-03T21:55:24Z

#129402 was able to locally modify CMAKE flags for rowwise operations alone. Good reference for modifying CMAKE to a submodule.

tinglvv mentioned this issue May 23, 2024

CUDA 12.4 ARM wheel integration to CD - nightly build #126174

Closed

tinglvv mentioned this issue May 24, 2024

Cache OpenBLAS build to CUDA ARM Docker image pytorch/builder#1833

Merged

atalman mentioned this issue May 24, 2024

Update arch64 worker to m7g.4xlarge pytorch/test-infra#5265

Merged

tinglvv mentioned this issue May 30, 2024

Cache OpenBLAS to docker image for SBSA builds pytorch/builder#1842

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long build compilation time (>5 hours) for CUDA ARM build #126980

Long build compilation time (>5 hours) for CUDA ARM build #126980

tinglvv commented May 23, 2024 •

edited by pytorch-bot bot

Loading

tinglvv commented Jul 3, 2024

Long build compilation time (>5 hours) for CUDA ARM build #126980

Long build compilation time (>5 hours) for CUDA ARM build #126980

Comments

tinglvv commented May 23, 2024 • edited by pytorch-bot bot Loading

🐛 Describe the bug

Versions

tinglvv commented Jul 3, 2024

tinglvv commented May 23, 2024 •

edited by pytorch-bot bot

Loading