Skip to content

Commit

Permalink
Fix aarch64 debug build with GCC (#126290)
Browse files Browse the repository at this point in the history
By working around GCCs quirks in instantiating templates that require immediate values.
Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`)

Test plan (after change was reverted): ssh into aarch64 runner and rebuild given file with `-O0`

Fixes #126283

Pull Request resolved: #126290
Approved by: https://github.com/atalman, https://github.com/seemethere
  • Loading branch information
malfet authored and ZelboK committed May 19, 2024
1 parent 6708519 commit 38a85b2
Showing 1 changed file with 8 additions and 0 deletions.
8 changes: 8 additions & 0 deletions aten/src/ATen/native/cpu/int8mm_kernel.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -250,10 +250,18 @@ inline void tinygemm_kernel_(
});
}

#if __OPTIMIZE__
float32x4_t scale_val = load_as_float32x4(scales);
c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
C[m * ldc + i] = reduce(c_val[i]) * vgetq_lane_f32(scale_val, i);
});
#else
// Workaround GCCs inability to infer lane index at compile time
// See https://github.com/pytorch/pytorch/issues/126283
c10::ForcedUnroll<BLOCK_N>{}([&](auto i) {
C[m * ldc + i] = reduce(c_val[i]) * float(scales[i]);
});
#endif
}
}

Expand Down

0 comments on commit 38a85b2

Please sign in to comment.