-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
aarch64 DEBUG build failure #126283
Comments
Grabbing for myself to fix it, though surprisingly enough it compiles even in debug mode with clang, which makes me wonder whether or not this is a compiler bug on the GCC side: https://godbolt.org/z/96v73Pann |
I have never previously come across a situation where whether something is considered constexpr or not depends on the optimization levels. Regardless of aggressive inlining, this should be a language issue, not an optimization issue, right? |
@theComputeKid can you please check if following will work for you diff --git a/aten/src/ATen/native/cpu/int8mm_kernel.cpp b/aten/src/ATen/native/cpu/int8mm_kernel.cpp
index bd266030b25..0d4ca58d460 100644
--- a/aten/src/ATen/native/cpu/int8mm_kernel.cpp
+++ b/aten/src/ATen/native/cpu/int8mm_kernel.cpp
@@ -249,11 +249,16 @@ inline void tinygemm_kernel_(
c_val[i] = vfmaq_f32(c_val[i], a_val.val[0], b_val_low);
});
}
-
+#if __OPTIMIZE__
float32x4_t scale_val = load_as_float32x4(scales);
c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
C[m * ldc + i] = reduce(c_val[i]) * vgetq_lane_f32(scale_val, i);
});
+#else
+ c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
+ C[m * ldc + i] = reduce(c_val[i]) * float(scales[i]);
+ });
+#endif
}
} |
By working around GCCs quirks in instantiating templates that require immediate values Provide alternative implementation for scaling the output if code is compiled without any optimizations Fixes #126283
@malfet It compiles with that patch, thanks. |
By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Fixes #126283 Pull Request resolved: #126290 Approved by: https://github.com/atalman, https://github.com/seemethere
By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Test plan (after change was reverted): ssh into aarch64 runner and rebuild given file with `-O0` Fixes #126283 Pull Request resolved: #126290 Approved by: https://github.com/atalman, https://github.com/seemethere
By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define __OPTIMIZE__ if invoked with anything but -O0) Fixes pytorch#126283 Pull Request resolved: pytorch#126290 Approved by: https://github.com/atalman, https://github.com/seemethere
By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Fixes pytorch#126283 Pull Request resolved: pytorch#126290 Approved by: https://github.com/atalman, https://github.com/seemethere
By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Test plan (after change was reverted): ssh into aarch64 runner and rebuild given file with `-O0` Fixes pytorch#126283 Pull Request resolved: pytorch#126290 Approved by: https://github.com/atalman, https://github.com/seemethere
馃悰 Describe the bug
DEBUG builds on aarch64 have been failing since this patch by @malfet : #124023
Specifically this line that was added:
Specifically:
The error thrown is:
This can be seen when compiling pytorch with debug flags:
It seems that in release mode, the constant expression requirement of
__aarch64_vget_lane_any
is satisfied, but not so in debug. To prove this point, here is a standalone reproducer of the issue:It works when built in release as:
Building in debug causes a compile error:
Versions
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.22.1
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-1018-aws-aarch64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A
CPU:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: ARM
Model: 1
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 1
Stepping: r1p1
BogoMIPS: 2100.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
L1d cache: 4 MiB (64 instances)
L1i cache: 4 MiB (64 instances)
L2 cache: 64 MiB (64 instances)
L3 cache: 32 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Mitigation; CSV2, BHB
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] mypy==1.9.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.22.4
[pip3] onnx==1.14.1
[pip3] optree==0.11.0
[conda] Could not collect
cc @malfet @seemethere @snadampal
The text was updated successfully, but these errors were encountered: