Skip to content

[NEON] Improve any/all implementation #672

@cyb70289

Description

@cyb70289

On neon64, any calls vmaxvq_u8/16/32 based on data width [1]. Actually, we can always use vmaxvq_u32 as we only want to know if there are non-zero bits, regardless of the position and width.

vmaxvq_u8 does more work and is often slower than vmaxvq_u32. E.g., on neoverse n1, vmaxvq_u8 latency is 6, but vmaxvq_u32 latency is 3 (search UMAXV in neoverse n1 optimization guide [2]).

Same for neon code [3], continuous folding is not necessary.

[1] https://github.com/xtensor-stack/xsimd/blob/master/include/xsimd/arch/xsimd_neon64.hpp#L65
[2] https://developer.arm.com/documentation/swog309707/latest
[3] https://github.com/xtensor-stack/xsimd/blob/master/include/xsimd/arch/xsimd_neon.hpp#L2124

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions