-
Notifications
You must be signed in to change notification settings - Fork 291
Closed
Description
On neon64, any calls vmaxvq_u8/16/32 based on data width [1]. Actually, we can always use vmaxvq_u32 as we only want to know if there are non-zero bits, regardless of the position and width.
vmaxvq_u8 does more work and is often slower than vmaxvq_u32. E.g., on neoverse n1, vmaxvq_u8 latency is 6, but vmaxvq_u32 latency is 3 (search UMAXV in neoverse n1 optimization guide [2]).
Same for neon code [3], continuous folding is not necessary.
[1] https://github.com/xtensor-stack/xsimd/blob/master/include/xsimd/arch/xsimd_neon64.hpp#L65
[2] https://developer.arm.com/documentation/swog309707/latest
[3] https://github.com/xtensor-stack/xsimd/blob/master/include/xsimd/arch/xsimd_neon.hpp#L2124
Metadata
Metadata
Assignees
Labels
No labels