aarch64 splits one NEON shift into two (missed optimization)

Given the following aarch64 NEON intrinsic code:
``` C++
#include <arm_neon.h>
void vectorized(const uint8_t* pCSI2, uint8_t* pBE, const uint8_t* pCSI2LineEnd)
{
    while (pCSI2 < pCSI2LineEnd) {
        uint8x16x3_t in = vld3q_u8(pCSI2);
        uint8x16x3_t out;
        out.val[0] = in.val[0];
        out.val[1] = vorrq_u8(vshlq_n_u8(in.val[2], 4), vshrq_n_u8(in.val[1], 4));
        out.val[2] = vorrq_u8(vshlq_n_u8(in.val[1], 4), vshrq_n_u8(in.val[2], 4));
        vst3q_u8(pBE, out);
        pCSI2 += 48;
        pBE += 48;
    }
}
```
For the `vshrq_n_u8`, instead of the obvious `ushr v4.16b, v1.16b, #4`, clang emits
```
ushr    v4.16b, v1.16b, #1
ushr    v4.16b, v4.16b, #3
```
This does not happen with a function consisting only of `vshrq_n_u8(x, 4)`, so it apparently depends on the context somehow.

[Try on godbolt](https://godbolt.org/z/z1b7z67zr)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

aarch64 splits one NEON shift into two (missed optimization) #64048

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

aarch64 splits one NEON shift into two (missed optimization) #64048

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions