[NEON] Improve any/all implementation

On neon64, `any` calls `vmaxvq_u8/16/32` based on data width [1]. Actually, we can always use `vmaxvq_u32` as we only want to know if there are non-zero bits, regardless of the position and width.

`vmaxvq_u8` does more work and is often slower than `vmaxvq_u32`. E.g., on neoverse n1, `vmaxvq_u8` latency is 6, but `vmaxvq_u32` latency is 3 (search UMAXV in neoverse n1 optimization guide [2]).

Same for neon code [3], continuous folding is not necessary.

[1] https://github.com/xtensor-stack/xsimd/blob/master/include/xsimd/arch/xsimd_neon64.hpp#L65
[2] https://developer.arm.com/documentation/swog309707/latest
[3] https://github.com/xtensor-stack/xsimd/blob/master/include/xsimd/arch/xsimd_neon.hpp#L2124

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NEON] Improve any/all implementation #672

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[NEON] Improve any/all implementation #672

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions