Skip to content

aarch64 splits one NEON shift into two (missed optimization) #64048

Closed
@neldredge

Description

@neldredge

Given the following aarch64 NEON intrinsic code:

#include <arm_neon.h>
void vectorized(const uint8_t* pCSI2, uint8_t* pBE, const uint8_t* pCSI2LineEnd)
{
    while (pCSI2 < pCSI2LineEnd) {
        uint8x16x3_t in = vld3q_u8(pCSI2);
        uint8x16x3_t out;
        out.val[0] = in.val[0];
        out.val[1] = vorrq_u8(vshlq_n_u8(in.val[2], 4), vshrq_n_u8(in.val[1], 4));
        out.val[2] = vorrq_u8(vshlq_n_u8(in.val[1], 4), vshrq_n_u8(in.val[2], 4));
        vst3q_u8(pBE, out);
        pCSI2 += 48;
        pBE += 48;
    }
}

For the vshrq_n_u8, instead of the obvious ushr v4.16b, v1.16b, #4, clang emits

ushr    v4.16b, v1.16b, #1
ushr    v4.16b, v4.16b, #3

This does not happen with a function consisting only of vshrq_n_u8(x, 4), so it apparently depends on the context somehow.

Try on godbolt

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions