Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long build compilation time (>5 hours) for CUDA ARM build #126980

Open
tinglvv opened this issue May 23, 2024 · 1 comment
Open

Long build compilation time (>5 hours) for CUDA ARM build #126980

tinglvv opened this issue May 23, 2024 · 1 comment
Labels
module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@tinglvv
Copy link
Collaborator

tinglvv commented May 23, 2024

🐛 Describe the bug

Issue summary:

As part of process to add CUDA ARM nightly wheel, we are seeing long build compilation time. Needs ~5hrs to compile https://github.com/pytorch/pytorch/actions/runs/9115630888/job/25074847028?pr=126174#step:16:1320.

Need to figure out how to decrease the build compilation time.

Description:

Initially, we are seeing OOM Error while building flash_attn in adding the https://github.com/pytorch/builder/pull/1775/files to nightly CI.

2024-04-26T02:20:01.5211732Z [6579/6896] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o�[K
2024-04-26T02:20:01.5215252Z �[31mFAILED: �[0mcaffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o 
2024-04-26T02:20:01.5250703Z /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUSPARSELT -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -DTORCH_ASSERT_NO_OPERATORS -I/pytorch/build/aten/src -I/pytorch/aten/src -I/pytorch/build -I/pytorch -I/pytorch/third_party/onnx -I/pytorch/build/third_party/onnx -I/pytorch/third_party/foxi -I/pytorch/build/third_party/foxi -I/pytorch/aten/src/THC -I/pytorch/aten/src/ATen/cuda -I/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/pytorch/build/caffe2/aten/src -I/pytorch/aten/src/ATen/.. -I/pytorch/build/nccl/include -I/pytorch/c10/cuda/../.. -I/pytorch/c10/.. -I/pytorch/third_party/tensorpipe -I/pytorch/build/third_party/tensorpipe -I/pytorch/third_party/tensorpipe/third_party/libnop/include -I/pytorch/torch/csrc/api -I/pytorch/torch/csrc/api/include -isystem /pytorch/build/third_party/gloo -isystem /pytorch/cmake/../third_party/gloo -isystem /pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /pytorch/third_party/protobuf/src -isystem /pytorch/third_party/gemmlowp -isystem /pytorch/third_party/neon2sse -isystem /pytorch/third_party/XNNPACK/include -isystem /pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /pytorch/third_party/ideep/include -isystem /pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_50,code=sm_50 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Xcompiler=-Wall,-Wextra,-Wdeprecated,-Wno-unused-parameter,-Wno-unused-function,-Wno-missing-field-initializers,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-maybe-uninitialized -Wno-deprecated-copy -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o.d -x cu -c /pytorch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o

2024-04-26T02:20:01.5283695Z /pytorch/aten/src/ATen/../../../third_party/cutlass/include/cute/layout.hpp(988): catastrophic error: out of memory
Error link - https://github.com/pytorch/pytorch/actions/runs/8840652730/job/24276381274?pr=124112.
Relevant PR for above error - #124112.

Tried set MAX_JOBS=4 (default is 6), no OOM error, but build takes >7 hours
Link - https://github.com/pytorch/pytorch/actions/runs/8970425814/job/24633947792

Now we are going with MAX_JOBS=5.

Status:

Exploring the option with CMAKE JOB_POOLS for flash-attn only.

cc @malfet @seemethere @ptrblck @msaroufim @snadampal @atalman @Aidyn-A @nWEIdia

Versions

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-72-generic-aarch64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA A100-PCIE-40GB
Nvidia driver version: 525.105.17
cuDNN version: Probably one of the following:
/usr/lib/aarch64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per cluster: 80
Socket(s): -
Cluster(s): 1
Stepping: r3p1
Frequency boost: disabled
CPU max MHz: 2800.0000
CPU min MHz: 1000.0000
BogoMIPS: 50.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
L1d cache: 5 MiB (80 instances)
L1i cache: 5 MiB (80 instances)
L2 cache: 80 MiB (80 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-79
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Mitigation; CSV2, BHB
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] onnx==1.16.0
[pip3] optree==0.11.0
[pip3] pytorch-quantization==2.1.2
[pip3] pytorch-triton==3.0.0+989adb9a2
[pip3] torch==2.4.0a0+07cecf4168.nvinternal
[pip3] torch-tensorrt==2.4.0a0
[pip3] torchvision==0.19.0a0
[conda] Could not collect

@bdhirsh bdhirsh added module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 labels May 23, 2024
atalman added a commit to pytorch/test-infra that referenced this issue May 24, 2024
Increasing the worker size, trying to improve build time for arm64 GPU
machines. Here is the issue:
pytorch/pytorch#126980

AWS Cost: 0.65c/hour
In comparison to our Linux c5.4xlarge worker, AWS Cost: 0.68c/hour

---------

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
pytorchmergebot pushed a commit that referenced this issue May 27, 2024
rebasing #124112.
too many conflict files, so starting a new PR.

Test pytorch/builder#1775 (merged) for ARM wheel addition
Test pytorch/builder#1828 (merged) for setting MAX_JOBS

Current issue to follow up:
#126980

Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com>
Pull Request resolved: #126174
Approved by: https://github.com/nWEIdia, https://github.com/atalman
titaiwangms pushed a commit to titaiwangms/pytorch that referenced this issue May 28, 2024
rebasing pytorch#124112.
too many conflict files, so starting a new PR.

Test pytorch/builder#1775 (merged) for ARM wheel addition
Test pytorch/builder#1828 (merged) for setting MAX_JOBS

Current issue to follow up:
pytorch#126980

Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com>
Pull Request resolved: pytorch#126174
Approved by: https://github.com/nWEIdia, https://github.com/atalman
@tinglvv
Copy link
Collaborator Author

tinglvv commented Jul 3, 2024

#129402 was able to locally modify CMAKE flags for rowwise operations alone. Good reference for modifying CMAKE to a submodule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: build Build system issues module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

2 participants