aarch64 DEBUG build failure #126283

theComputeKid · 2024-05-15T14:13:33Z

🐛 Describe the bug

DEBUG builds on aarch64 have been failing since this patch by @malfet : #124023

Specifically this line that was added:

    c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
       C[m * ldc + i] = reduce(c_val[i]) * vgetq_lane_f32(scale_val, i);
     });

Specifically:

vgetq_lane_f32(scale_val, i);

The error thrown is:

                 from ./pytorch/c10/util/Float8_e5m2.h:17,
                 from ./pytorch/c10/core/ScalarType.h:8,
                 from ./pytorch/c10/core/Scalar.h:9,
                 from ./pytorch/build/aten/src/ATen/core/TensorBody.h:16,
                 from ./pytorch/aten/src/ATen/core/Tensor.h:3,
                 from ./pytorch/aten/src/ATen/native/cpu/int8mm_kernel.cpp:2,
                 from ./pytorch/build/aten/src/ATen/native/cpu/int8mm_kernel.cpp.DEFAULT.cpp:1:
In function 'float32_t vgetq_lane_f32(float32x4_t, int)',
    inlined from 'at::native::{anonymous}::tinygemm_kernel_<1, 1, c10::Half>(const c10::Half*, const int8_t*, const c10::Half*, c10::Half*, int, int, int, int)::<lambda(auto:25)> [with auto:25 = std::integral_constant<int, 0>]' at ./pytorch/aten/src/ATen/native/cpu/int8mm_kernel.cpp:255:57:
/usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:3271:10: error: lane index must be a constant immediate
 3271 |   return __aarch64_vget_lane_any (__a, __b);
      |          ^~~~~~~~~~~~~~~~~~~~~~~

This can be seen when compiling pytorch with debug flags:

BLAS=OpenBLAS CXX_FLAGS="-mcpu=neoverse-v1 -march=armv8.6-a" USE_OPENMP=1 USE_LAPACK=1 USE_CUDA=0 USE_FBGEMM=0 USE_DISTRIBUTED=0 USE_MKLDNN=1 USE_MKLDNN_ACL=1 DEBUG=1 python3 setup.py bdist_wheel

It seems that in release mode, the constant expression requirement of __aarch64_vget_lane_any is satisfied, but not so in debug. To prove this point, here is a standalone reproducer of the issue:

#include <arm_neon.h>
#include <c10/util/Unroll.h>

#include <iostream>

int main() {
  float16_t scales[] = {0.1, 0.2, 0.3, 0.4};

  float C[4];

  float32x4_t scale_val = vcvt_f32_f16(vld1_f16(reinterpret_cast<const float16_t *>(scales)));

  c10::ForcedUnroll<4>{}([&](auto i) {
      C[i] = vgetq_lane_f32(scale_val, i);
   });

  for(int i = 0; i < 4; i++) {
    std::cout << "C[" << i << "]: " << C[i] << std::endl;
  }

}

It works when built in release as:

g++ -O2 -I./pytorch/build/lib.linux-aarch64-cpython-38/torch/include unroll_reproducer.cpp

Building in debug causes a compile error:

g++ -O0 -I./pytorch/build/lib.linux-aarch64-cpython-38/torch/include unroll_reproducer.cpp

In file included from unroll_reproducer.cpp:1:
In function 'float32_t vgetq_lane_f32(float32x4_t, int)',
    inlined from 'main()::<lambda(auto:1)> [with auto:1 = int]' at unroll_reproducer.cpp:14:28:
/usr/lib/gcc/aarch64-linux-gnu/10/include/arm_neon.h:3271:10: error: lane index must be a constant immediate
 3271 |   return __aarch64_vget_lane_any (__a, __b);

Versions

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-1018-aws-aarch64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: ARM
Model: 1
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 1
Stepping: r1p1
BogoMIPS: 2100.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
L1d cache: 4 MiB (64 instances)
L1i cache: 4 MiB (64 instances)
L2 cache: 64 MiB (64 instances)
L3 cache: 32 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Mitigation; CSV2, BHB
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] mypy==1.9.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.22.4
[pip3] onnx==1.14.1
[pip3] optree==0.11.0
[conda] Could not collect

cc @malfet @seemethere @snadampal

The text was updated successfully, but these errors were encountered:

malfet · 2024-05-15T14:27:28Z

Grabbing for myself to fix it, though surprisingly enough it compiles even in debug mode with clang, which makes me wonder whether or not this is a compiler bug on the GCC side: https://godbolt.org/z/96v73Pann

theComputeKid · 2024-05-15T14:40:35Z

I have never previously come across a situation where whether something is considered constexpr or not depends on the optimization levels. Regardless of aggressive inlining, this should be a language issue, not an optimization issue, right?

malfet · 2024-05-15T14:58:50Z

@theComputeKid can you please check if following will work for you

diff --git a/aten/src/ATen/native/cpu/int8mm_kernel.cpp b/aten/src/ATen/native/cpu/int8mm_kernel.cpp
index bd266030b25..0d4ca58d460 100644
--- a/aten/src/ATen/native/cpu/int8mm_kernel.cpp
+++ b/aten/src/ATen/native/cpu/int8mm_kernel.cpp
@@ -249,11 +249,16 @@ inline void tinygemm_kernel_(
         c_val[i] = vfmaq_f32(c_val[i], a_val.val[0], b_val_low);
       });
     }
-
+#if __OPTIMIZE__
     float32x4_t scale_val = load_as_float32x4(scales);
     c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
       C[m * ldc + i] = reduce(c_val[i]) * vgetq_lane_f32(scale_val, i);
     });
+#else
+    c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
+      C[m * ldc + i] = reduce(c_val[i]) * float(scales[i]);
+    });
+#endif
   }
 }

By working around GCCs quirks in instantiating templates that require immediate values Provide alternative implementation for scaling the output if code is compiled without any optimizations Fixes #126283

theComputeKid · 2024-05-15T15:32:38Z

@malfet It compiles with that patch, thanks.

By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Fixes #126283 Pull Request resolved: #126290 Approved by: https://github.com/atalman, https://github.com/seemethere

By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Test plan (after change was reverted): ssh into aarch64 runner and rebuild given file with `-O0` Fixes #126283 Pull Request resolved: #126290 Approved by: https://github.com/atalman, https://github.com/seemethere

By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define __OPTIMIZE__ if invoked with anything but -O0) Fixes pytorch#126283 Pull Request resolved: pytorch#126290 Approved by: https://github.com/atalman, https://github.com/seemethere

By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Fixes pytorch#126283 Pull Request resolved: pytorch#126290 Approved by: https://github.com/atalman, https://github.com/seemethere

By working around GCCs quirks in instantiating templates that require immediate values. Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`) Test plan (after change was reverted): ssh into aarch64 runner and rebuild given file with `-O0` Fixes pytorch#126283 Pull Request resolved: pytorch#126290 Approved by: https://github.com/atalman, https://github.com/seemethere

malfet self-assigned this May 15, 2024

malfet added module: build Build system issues module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 15, 2024

malfet mentioned this issue May 15, 2024

Fix aarch64 debug build with GCC #126290

Closed

pytorchmergebot closed this as completed in a961e1a May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aarch64 DEBUG build failure #126283

aarch64 DEBUG build failure #126283

theComputeKid commented May 15, 2024 •

edited by malfet

Loading

malfet commented May 15, 2024 •

edited

Loading

theComputeKid commented May 15, 2024

malfet commented May 15, 2024

theComputeKid commented May 15, 2024

aarch64 DEBUG build failure #126283

aarch64 DEBUG build failure #126283

Comments

theComputeKid commented May 15, 2024 • edited by malfet Loading

🐛 Describe the bug

Versions

malfet commented May 15, 2024 • edited Loading

theComputeKid commented May 15, 2024

malfet commented May 15, 2024

theComputeKid commented May 15, 2024

theComputeKid commented May 15, 2024 •

edited by malfet

Loading

malfet commented May 15, 2024 •

edited

Loading