Closed
Description
Given the following aarch64 NEON intrinsic code:
#include <arm_neon.h>
void vectorized(const uint8_t* pCSI2, uint8_t* pBE, const uint8_t* pCSI2LineEnd)
{
while (pCSI2 < pCSI2LineEnd) {
uint8x16x3_t in = vld3q_u8(pCSI2);
uint8x16x3_t out;
out.val[0] = in.val[0];
out.val[1] = vorrq_u8(vshlq_n_u8(in.val[2], 4), vshrq_n_u8(in.val[1], 4));
out.val[2] = vorrq_u8(vshlq_n_u8(in.val[1], 4), vshrq_n_u8(in.val[2], 4));
vst3q_u8(pBE, out);
pCSI2 += 48;
pBE += 48;
}
}
For the vshrq_n_u8
, instead of the obvious ushr v4.16b, v1.16b, #4
, clang emits
ushr v4.16b, v1.16b, #1
ushr v4.16b, v4.16b, #3
This does not happen with a function consisting only of vshrq_n_u8(x, 4)
, so it apparently depends on the context somehow.