Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64 DEBUG build failure #126283

Closed
theComputeKid opened this issue May 15, 2024 · 4 comments
Closed

aarch64 DEBUG build failure #126283

theComputeKid opened this issue May 15, 2024 · 4 comments
Assignees
Labels
module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: build Build system issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@theComputeKid
Copy link

theComputeKid commented May 15, 2024

馃悰 Describe the bug

DEBUG builds on aarch64 have been failing since this patch by @malfet : #124023

Specifically this line that was added:

    c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
       C[m * ldc + i] = reduce(c_val[i]) * vgetq_lane_f32(scale_val, i);
     });

Specifically:

vgetq_lane_f32(scale_val, i);

The error thrown is:

                 from ./pytorch/c10/util/Float8_e5m2.h:17,
                 from ./pytorch/c10/core/ScalarType.h:8,
                 from ./pytorch/c10/core/Scalar.h:9,
                 from ./pytorch/build/aten/src/ATen/core/TensorBody.h:16,
                 from ./pytorch/aten/src/ATen/core/Tensor.h:3,
                 from ./pytorch/aten/src/ATen/native/cpu/int8mm_kernel.cpp:2,
                 from ./pytorch/build/aten/src/ATen/native/cpu/int8mm_kernel.cpp.DEFAULT.cpp:1:
In function 'float32_t vgetq_lane_f32(float32x4_t, int)',
    inlined from 'at::native::{anonymous}::tinygemm_kernel_<1, 1, c10::Half>(const c10::Half*, const int8_t*, const c10::Half*, c10::Half*, int, int, int, int)::<lambda(auto:25)> [with auto:25 = std::integral_constant<int, 0>]' at ./pytorch/aten/src/ATen/native/cpu/int8mm_kernel.cpp:255:57:
/usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:3271:10: error: lane index must be a constant immediate
 3271 |   return __aarch64_vget_lane_any (__a, __b);
      |          ^~~~~~~~~~~~~~~~~~~~~~~
      

This can be seen when compiling pytorch with debug flags:

BLAS=OpenBLAS CXX_FLAGS="-mcpu=neoverse-v1 -march=armv8.6-a" USE_OPENMP=1 USE_LAPACK=1 USE_CUDA=0 USE_FBGEMM=0 USE_DISTRIBUTED=0 USE_MKLDNN=1 USE_MKLDNN_ACL=1 DEBUG=1 python3 setup.py bdist_wheel

It seems that in release mode, the constant expression requirement of __aarch64_vget_lane_any is satisfied, but not so in debug. To prove this point, here is a standalone reproducer of the issue:

#include <arm_neon.h>
#include <c10/util/Unroll.h>

#include <iostream>

int main() {
  float16_t scales[] = {0.1, 0.2, 0.3, 0.4};

  float C[4];

  float32x4_t scale_val = vcvt_f32_f16(vld1_f16(reinterpret_cast<const float16_t *>(scales)));

  c10::ForcedUnroll<4>{}([&](auto i) {
      C[i] = vgetq_lane_f32(scale_val, i);
   });

  for(int i = 0; i < 4; i++) {
    std::cout << "C[" << i << "]: " << C[i] << std::endl;
  }

}

It works when built in release as:

g++ -O2 -I./pytorch/build/lib.linux-aarch64-cpython-38/torch/include unroll_reproducer.cpp

Building in debug causes a compile error:

g++ -O0 -I./pytorch/build/lib.linux-aarch64-cpython-38/torch/include unroll_reproducer.cpp

In file included from unroll_reproducer.cpp:1:
In function 'float32_t vgetq_lane_f32(float32x4_t, int)',
    inlined from 'main()::<lambda(auto:1)> [with auto:1 = int]' at unroll_reproducer.cpp:14:28:
/usr/lib/gcc/aarch64-linux-gnu/10/include/arm_neon.h:3271:10: error: lane index must be a constant immediate
 3271 |   return __aarch64_vget_lane_any (__a, __b);

Versions

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-1018-aws-aarch64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: ARM
Model: 1
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 1
Stepping: r1p1
BogoMIPS: 2100.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
L1d cache: 4 MiB (64 instances)
L1i cache: 4 MiB (64 instances)
L2 cache: 64 MiB (64 instances)
L3 cache: 32 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Mitigation; CSV2, BHB
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] mypy==1.9.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.22.4
[pip3] onnx==1.14.1
[pip3] optree==0.11.0
[conda] Could not collect

cc @malfet @seemethere @snadampal

@malfet malfet self-assigned this May 15, 2024
@malfet malfet added module: build Build system issues module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 15, 2024
@malfet
Copy link
Contributor

malfet commented May 15, 2024

Grabbing for myself to fix it, though surprisingly enough it compiles even in debug mode with clang, which makes me wonder whether or not this is a compiler bug on the GCC side: https://godbolt.org/z/96v73Pann

@theComputeKid
Copy link
Author

I have never previously come across a situation where whether something is considered constexpr or not depends on the optimization levels. Regardless of aggressive inlining, this should be a language issue, not an optimization issue, right?

@malfet
Copy link
Contributor

malfet commented May 15, 2024

@theComputeKid can you please check if following will work for you

diff --git a/aten/src/ATen/native/cpu/int8mm_kernel.cpp b/aten/src/ATen/native/cpu/int8mm_kernel.cpp
index bd266030b25..0d4ca58d460 100644
--- a/aten/src/ATen/native/cpu/int8mm_kernel.cpp
+++ b/aten/src/ATen/native/cpu/int8mm_kernel.cpp
@@ -249,11 +249,16 @@ inline void tinygemm_kernel_(
         c_val[i] = vfmaq_f32(c_val[i], a_val.val[0], b_val_low);
       });
     }
-
+#if __OPTIMIZE__
     float32x4_t scale_val = load_as_float32x4(scales);
     c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
       C[m * ldc + i] = reduce(c_val[i]) * vgetq_lane_f32(scale_val, i);
     });
+#else
+    c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
+      C[m * ldc + i] = reduce(c_val[i]) * float(scales[i]);
+    });
+#endif
   }
 }

malfet added a commit that referenced this issue May 15, 2024
By working around GCCs quirks in instantiating templates that require immediate values
Provide alternative implementation for scaling the output if code is compiled without any optimizations

Fixes #126283
@theComputeKid
Copy link
Author

@malfet It compiles with that patch, thanks.

pytorchmergebot pushed a commit that referenced this issue May 16, 2024
By working around GCCs quirks in instantiating templates that require immediate values.
Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`)

Fixes #126283

Pull Request resolved: #126290
Approved by: https://github.com/atalman, https://github.com/seemethere
pytorchmergebot pushed a commit that referenced this issue May 17, 2024
By working around GCCs quirks in instantiating templates that require immediate values.
Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`)

Test plan (after change was reverted): ssh into aarch64 runner and rebuild given file with `-O0`

Fixes #126283

Pull Request resolved: #126290
Approved by: https://github.com/atalman, https://github.com/seemethere
ZelboK pushed a commit to ZelboK/pytorch that referenced this issue May 19, 2024
By working around GCCs quirks in instantiating templates that require immediate values.
Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define __OPTIMIZE__ if invoked with anything but -O0)

Fixes pytorch#126283

Pull Request resolved: pytorch#126290
Approved by: https://github.com/atalman, https://github.com/seemethere
ZelboK pushed a commit to ZelboK/pytorch that referenced this issue May 19, 2024
By working around GCCs quirks in instantiating templates that require immediate values.
Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`)

Fixes pytorch#126283

Pull Request resolved: pytorch#126290
Approved by: https://github.com/atalman, https://github.com/seemethere
ZelboK pushed a commit to ZelboK/pytorch that referenced this issue May 19, 2024
By working around GCCs quirks in instantiating templates that require immediate values.
Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`)

Test plan (after change was reverted): ssh into aarch64 runner and rebuild given file with `-O0`

Fixes pytorch#126283

Pull Request resolved: pytorch#126290
Approved by: https://github.com/atalman, https://github.com/seemethere
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: build Build system issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants