Fixed length attributes #176

christian-herber-nxp · 2022-11-08T11:07:19Z

Has there been consideration for an attribute to specify a fixed length for vector and mask types?
A reference of such a solution could be ACLE vector bits attribute.

Background is that many existing SIMD libraries were build around fixed length vectors, and use those in places like classes and structs that are not possible with the sizeless types of the rvv intrinsics. Adding an RVV port is very tricky at least in those cases. To lower the entry barrier for RISC-V V, I could imagine this would be a good step.

eopXD · 2022-11-22T08:45:42Z

Yes, we can have the compiler derive types like float32x4_t to be an alias of vfloat32m1_t when something like rvv-fixed-length=128 is specified, this enables porting from existing SIMD libraries.

I personally don't think it is worth the effort because this approach cannot transit to what RVV is designed for, which the vector length can be scalable. I think the computations should rely on vsetvl to determine the number of elements to be processed and not a presumption to the vector length.

christian-herber-nxp · 2022-11-22T09:35:51Z

I agree to your comment, that this will not transition into an VLA approach. But it can help lower the burden of transitioning from current SIMD solutions which are vector length specific to RVV.

ARM has solved this through such attributes: __attribute__((arm_sve_vector_bits(512)));, then you don't need to multiply the number of defined types

eopXD · 2022-11-22T13:30:47Z

Sure, I agree with you this would help SIMD users to adapt RVV. Just added this topic to next Monday's open meeting.

topperc · 2023-01-31T16:52:42Z

I posted the first patch to support something like __attribute__((arm_sve_vector_bits(512))); here https://reviews.llvm.org/D142144

That patch only supports the command line option to specify the number of bits in a vector. It does not add the attribute yet.

topperc · 2023-03-01T18:50:34Z

Bigger patch has been posted https://reviews.llvm.org/D145088 to support LMUL=1 types.

…sve_vector_bits. This allows the user to set the size of the scalable vector so they can be used in structs and as the type of global variables. This works by representing the type as a fixed vector instead of a scalable vector in IR. Conversions to and from scalable vectors are made where necessary like function arguments/returns and intrinsics. This features has been requested here riscv-non-isa/rvv-intrinsic-doc#176 I know arm_sve_vector_bits is used by the Eigen library so this could be used to port Eigen to RVV. This patch adds a new preprocessor define `__riscv_v_fixed_vlen` that is set when -mrvv_vector_bits is passed on the command line. The code is largely based on the AArch64 code. A lot of code was copy/pasted and then modiied to RVV. There may be some opportunities for sharing. This first patch only supports the LMUL=1 types. Additional changes will be needed to support other LMULs. I have also not supported mask vectors. Differential Revision: https://reviews.llvm.org/D145088

camel-cdr · 2023-05-31T18:47:21Z

@eopXD

I personally don't think it is worth the effort because this approach cannot transit to what RVV is designed for, which the vector length can be scalable.

I think there is an option to support porting old fixed size code to rvv, while keeping the generated machine code scalable.

You don't even need a new attributes, if you can rely on the compiler eliminating redundant vector load/stores, which clang and gcc currently don't do.

Say you want to port the following code, that uses both 128 bit wide SSE2 intrinsics and 256 wide AVX2 intrinsics, to all Zvl64b compatible processors (VLEN >= 64):

typedef struct { __m128i a; __m256i b; } S;

S add(S x, S y) {
	return (S){ _mm_add_epi8(x.a, y.a), _mm256_add_epi8(x.b, y.b) };
}

This could be ported using:

typedef struct { uint8_t x[16]; } __m128i;
typedef struct { uint8_t x[32]; } __m256i;

static inline __m128i
_mm_add_epi32(__m128i a, __m128i b) {
	__m128i res;
	vuint8m2_t A = __riscv_vle8_v_u8m2((void*)&a.x, 16);
	vuint8m2_t B = __riscv_vle8_v_u8m2((void*)&b.x, 16);
 	__riscv_vse8_v_u8m2((void*)&res, __riscv_vadd_vv_u8m2(A, B, 16), 16);
	return res;
}

static inline __m256i
_mm256_add_epi32(__m256i a, __m256i b) {
	__m256i res;
	vuint8m4_t A = __riscv_vle8_v_u8m4((void*)&a.x, 32);
	vuint8m4_t B = __riscv_vle8_v_u8m4((void*)&b.x, 32);
 	__riscv_vse8_v_u8m4((void*)&res, __riscv_vadd_vv_u8m4(A, B, 32), 32);
	return res;
}

See how the above uses LMUL=2 for 128 bit vectors and LMUL=4 for 256 bit vectors. This is to make sure it works on all Zvl64b compatible processors (VLEN >= 64). If you only care about Zvl128b and above, then you can just use LMUL=1 and LMUL=2 instead.
There is a potential penalty on processors a larger VLEN than the minimum supported by the above, because it may do more work than necessary: E.g. when VLEN=128, and you use LMUL=4 to do emulate 256 bit vectors, then you potentially do two more 128 bit instructions than needed, but vector processors can and do short circuit the execution of larger LMULs, when the set vl is smaller,. So in practice the above approach will likely only restrict the number of available registers and not impact performance through other means.

The problem with the above is that it doesn't get optimized properly at all with current compilers: https://godbolt.org/z/EEjfxvMcd

But I'd expect that future compiler versions will be able to do so, as they are already able to optimize away redundant SSE/AVX load/stores: https://godbolt.org/z/Ghs3x388P

If this isn't the case, you could still create special attributes that would look similar to the following:

typedef struct {
	vuint32m1_t [[rvv_vl(4)]] a;
	vuint32m2_t [[rvv_vl(8)]] b;
} S;

S add(S x, S y) {
	return (S){ vadd_vv_u32m1(x.a, y.a), vadd_vv_u32m2(x.b, y.b) };
}

Here rvv_vl sets the fix vl of the respective type, and this is implicitly propagated to the vadd intrinsic.

topperc · 2023-05-31T18:59:03Z

@camel-cdr

typedef struct { __m128i a; __m256i b; } S;

S add(S x, S y) {
	return (S){ _mm_add_epi8(x.a, y.a), _mm256_add_epi8(x.b, y.b) };
}

This can also be ported as

typedef long long __m128i __attribute__((__vector_size__(16)));
typedef long long __m256i __attribute__((__vector_size__(32)));
typedef unsigned char __v16qu __attribute__((__vector_size__(16)));
typedef unsigned char __v32qu __attribute__((__vector_size__(32)));

typedef struct { __m128i a; __m256i b; } S;

__m128i _mm_add_epi8(__m128i __a, __m128i __b) {         
  return (__m128i)((__v16qu)__a + (__v16qu)__b);                                 
} 

__m256i _mm256_add_epi8(__m256i __a, __m256i __b) {         
  return (__m256i)((__v32qu)__a + (__v32qu)__b);                                 
} 

S add(S x, S y) {
	return (S){ _mm_add_epi8(x.a, y.a), _mm256_add_epi8(x.b, y.b) };
}

Which is the code from clang's emmintrin.h and avx2intrin.h headers.

https://godbolt.org/z/93qfjez4e

There has been some work in clang to add things like __builtin_elementwise_max for vector operations that don't have C operators.

camel-cdr · 2023-05-31T19:10:51Z

@topperc

This was just an example, but the same approach would work the other more complex instructions.

Wouldn't your comment also apply to riscv_rvv_vector_bits to the same degree?

There has been some work in clang to add things like __builtin_elementwise_max for vector operations that don't have C operators.

This feels like it's currently very limited, is integrating all of the intrinsics into a platform agnostic abstraction a thing clang aims to do?

Edit:

Also, would your code compiled for Zvl64b also work on Zvl128b without recompiling? Because I don't quite understand how it works exactly. Does clang assume a 128 bit VLEN by default?

topperc · 2023-05-31T19:37:29Z

@topperc

This was just an example, but the same approach would work the other more complex instructions.

Wouldn't your comment also apply to riscv_rvv_vector_bits to the same degree?

riscv_rvv_vector_bits allows you to tell the compiler exactly what VLEN your CPU has. Using -mrvv-vector-bits=zvl makes the most sense. This sets the vector width to the largest Zvl*b in -march. Normally -march is treated only as a lower bound. -mrvv-vector-bits=zvl makes it an upper bound too. The generated code won't be portable to other CPUs.

You could get some of the same effect from using vector_size(__riscv_v_min_vlen/8) and using clang's vector operators and builtins, but wouldn't be able to use any RISC-V intrinsics.

There has been some work in clang to add things like __builtin_elementwise_max for vector operations that don't have C operators.

This feels like it's currently very limited, is integrating all of the intrinsics into a platform agnostic abstraction a thing clang aims to do?

I think it's primarily just the things that have a single intrinsic generic IR representation that can easily be supported by any target. It's probably not going to have something weird like X86's psadbw for example.

For the most part RISC-V doesn't have weird instructions, so not being able to target them may not be a big deal?

Also, would your code compiled for Zvl64b also work on Zvl128b without recompiling? Because I don't quite understand how it works exactly. Does clang assume a 128 bit VLEN by default?

My example used V in the -march so clang was assuming VLEN>=128. If I had used Zve64 instead, it would assume VLEN>=64.

vineetgarc · 2024-03-22T17:04:00Z

FTR attribute support in gcc landed: https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648204.html

commit 47de95d801c6899033c303b1fe642feb0489994f
Author: Pan Li <pan2.li@intel.com>
Date:   Fri Mar 22 14:43:47 2024 +0800

    RISC-V: Introduce gcc attribute riscv_rvv_vector_bits for RVV
    
    This patch would like to introduce one new gcc attribute for RVV.
    This attribute is used to define fixed-length variants of one
    existing sizeless RVV types.
    
    This attribute is valid if and only if the mrvv-vector-bits=zvl, the only
    one args should be the integer constant and its' value is terminated
    by the LMUL and the vector register bits in zvl*b.  For example:
    
    typedef vint32m2_t fixed_vint32m2_t __attribute__((riscv_rvv_vector_bits(128)));
    
    The above type define is valid when -march=rv64gc_zve64d_zvl64b
    (aka 2(m2) * 64 = 128 for vin32m2_t), and will report error when
    -march=rv64gcv_zvl128b similar to below.
    
    "error: invalid RVV vector size '128', expected size is '256' based on
    LMUL of type and '-mrvv-vector-bits=zvl'"
    
    Meanwhile, a pre-define macro __riscv_v_fixed_vlen is introduced to
    represent the fixed vlen in a RVV vector register.
    
    For the vint*m*_t below operations are allowed.
    * The sizeof.
    * The global variable(s).
    * The element of union and struct.
    * The cast to other equalities.
    * CMP: >, <, ==, !=, <=, >=
    * ALU: +, -, *, /, %, &, |, ^, >>, <<, ~, -
    
    The CMP will return vint*m*_t the same as aarch64 sve. For example:
    typedef vint32m1_t fixed_vint32m1_t __attribute__((riscv_rvv_vector_bits(128)));
    fixed_vint32m1_t less_than (fixed_vint32m1_t a, fixed_vint32m1_t b)
    {
      return a < b;
    }
    
    For the vfloat*m*_t below operations are allowed.
    * The sizeof.
    * The global variable(s).
    * The element of union and struct.
    * The cast to other equalities.
    * CMP: >, <, ==, !=, <=, >=
    * ALU: +, -, *, /, -
    
    The CMP will return vfloat*m*_t the same as aarch64 sve. For example:
    typedef vfloat32m1_t fixed_vfloat32m1_t __attribute__((riscv_rvv_vector_bits(128)));
    fixed_vfloat32m1_t less_than (fixed_vfloat32m1_t a, fixed_vfloat32m1_t b)
    {
      return a < b;
    }
    
    For the vbool*_t types only below operations are allowed except
    the CMP and ALU. The CMP and ALU operations on vbool*_t is not
    well defined currently.
    * The sizeof.
    * The global variable(s).
    * The element of union and struct.
    * The cast to other equalities.
    
    For the vint*x*m*_t tuple types are not suppored in this patch which is
    compatible with clang.
    
    This patch passed the below testsuites.
    * The riscv fully regression tests.

eopXD added the Revisit after v1.0 Features or problems we will revisit after the v1.0 release label Mar 2, 2023

sh1boot mentioned this issue Dec 23, 2023

How to use a class to wrap or derive from a sizeless vector type #305

Open

zengdage mentioned this issue Apr 11, 2024

How to support fixed-size rvv intrinsic type in gcc ? howjmay/neon2rvv#373

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed length attributes #176

Fixed length attributes #176

christian-herber-nxp commented Nov 8, 2022

eopXD commented Nov 22, 2022 •

edited

Loading

christian-herber-nxp commented Nov 22, 2022

eopXD commented Nov 22, 2022

topperc commented Jan 31, 2023

topperc commented Mar 1, 2023

camel-cdr commented May 31, 2023

topperc commented May 31, 2023

camel-cdr commented May 31, 2023 •

edited

Loading

topperc commented May 31, 2023

vineetgarc commented Mar 22, 2024

Fixed length attributes #176

Fixed length attributes #176

Comments

christian-herber-nxp commented Nov 8, 2022

eopXD commented Nov 22, 2022 • edited Loading

christian-herber-nxp commented Nov 22, 2022

eopXD commented Nov 22, 2022

topperc commented Jan 31, 2023

topperc commented Mar 1, 2023

camel-cdr commented May 31, 2023

topperc commented May 31, 2023

camel-cdr commented May 31, 2023 • edited Loading

topperc commented May 31, 2023

vineetgarc commented Mar 22, 2024

eopXD commented Nov 22, 2022 •

edited

Loading

camel-cdr commented May 31, 2023 •

edited

Loading