Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed length attributes #176

Open
christian-herber-nxp opened this issue Nov 8, 2022 · 10 comments
Open

Fixed length attributes #176

christian-herber-nxp opened this issue Nov 8, 2022 · 10 comments
Labels
Revisit after v1.0 Features or problems we will revisit after the v1.0 release

Comments

@christian-herber-nxp
Copy link

Has there been consideration for an attribute to specify a fixed length for vector and mask types?
A reference of such a solution could be ACLE vector bits attribute.

Background is that many existing SIMD libraries were build around fixed length vectors, and use those in places like classes and structs that are not possible with the sizeless types of the rvv intrinsics. Adding an RVV port is very tricky at least in those cases. To lower the entry barrier for RISC-V V, I could imagine this would be a good step.

@eopXD
Copy link
Collaborator

eopXD commented Nov 22, 2022

Yes, we can have the compiler derive types like float32x4_t to be an alias of vfloat32m1_t when something like rvv-fixed-length=128 is specified, this enables porting from existing SIMD libraries.

I personally don't think it is worth the effort because this approach cannot transit to what RVV is designed for, which the vector length can be scalable. I think the computations should rely on vsetvl to determine the number of elements to be processed and not a presumption to the vector length.

@christian-herber-nxp
Copy link
Author

I agree to your comment, that this will not transition into an VLA approach. But it can help lower the burden of transitioning from current SIMD solutions which are vector length specific to RVV.

ARM has solved this through such attributes: __attribute__((arm_sve_vector_bits(512)));, then you don't need to multiply the number of defined types

@eopXD
Copy link
Collaborator

eopXD commented Nov 22, 2022

Sure, I agree with you this would help SIMD users to adapt RVV. Just added this topic to next Monday's open meeting.

@topperc
Copy link
Collaborator

topperc commented Jan 31, 2023

I posted the first patch to support something like __attribute__((arm_sve_vector_bits(512))); here https://reviews.llvm.org/D142144

That patch only supports the command line option to specify the number of bits in a vector. It does not add the attribute yet.

@topperc
Copy link
Collaborator

topperc commented Mar 1, 2023

Bigger patch has been posted https://reviews.llvm.org/D145088 to support LMUL=1 types.

@eopXD eopXD added the Revisit after v1.0 Features or problems we will revisit after the v1.0 release label Mar 2, 2023
topperc added a commit to llvm/llvm-project that referenced this issue Apr 28, 2023
…sve_vector_bits.

This allows the user to set the size of the scalable vector so they
can be used in structs and as the type of global variables. This works
by representing the type as a fixed vector instead of a scalable vector
in IR. Conversions to and from scalable vectors are made where necessary
like function arguments/returns and intrinsics.

This features has been requested here
riscv-non-isa/rvv-intrinsic-doc#176
I know arm_sve_vector_bits is used by the Eigen library so this
could be used to port Eigen to RVV.

This patch adds a new preprocessor define `__riscv_v_fixed_vlen` that
is set when -mrvv_vector_bits is passed on the command line.

The code is largely based on the AArch64 code. A lot of code was
copy/pasted and then modiied to RVV. There may be some opportunities
for sharing.

This first patch only supports the LMUL=1 types. Additional changes
will be needed to support other LMULs. I have also not supported
mask vectors.

Differential Revision: https://reviews.llvm.org/D145088
@camel-cdr
Copy link
Contributor

@eopXD

I personally don't think it is worth the effort because this approach cannot transit to what RVV is designed for, which the vector length can be scalable.

I think there is an option to support porting old fixed size code to rvv, while keeping the generated machine code scalable.

You don't even need a new attributes, if you can rely on the compiler eliminating redundant vector load/stores, which clang and gcc currently don't do.

Say you want to port the following code, that uses both 128 bit wide SSE2 intrinsics and 256 wide AVX2 intrinsics, to all Zvl64b compatible processors (VLEN >= 64):

typedef struct { __m128i a; __m256i b; } S;

S add(S x, S y) {
	return (S){ _mm_add_epi8(x.a, y.a), _mm256_add_epi8(x.b, y.b) };
}

This could be ported using:

typedef struct { uint8_t x[16]; } __m128i;
typedef struct { uint8_t x[32]; } __m256i;

static inline __m128i
_mm_add_epi32(__m128i a, __m128i b) {
	__m128i res;
	vuint8m2_t A = __riscv_vle8_v_u8m2((void*)&a.x, 16);
	vuint8m2_t B = __riscv_vle8_v_u8m2((void*)&b.x, 16);
 	__riscv_vse8_v_u8m2((void*)&res, __riscv_vadd_vv_u8m2(A, B, 16), 16);
	return res;
}

static inline __m256i
_mm256_add_epi32(__m256i a, __m256i b) {
	__m256i res;
	vuint8m4_t A = __riscv_vle8_v_u8m4((void*)&a.x, 32);
	vuint8m4_t B = __riscv_vle8_v_u8m4((void*)&b.x, 32);
 	__riscv_vse8_v_u8m4((void*)&res, __riscv_vadd_vv_u8m4(A, B, 32), 32);
	return res;
}

See how the above uses LMUL=2 for 128 bit vectors and LMUL=4 for 256 bit vectors. This is to make sure it works on all Zvl64b compatible processors (VLEN >= 64). If you only care about Zvl128b and above, then you can just use LMUL=1 and LMUL=2 instead.
There is a potential penalty on processors a larger VLEN than the minimum supported by the above, because it may do more work than necessary: E.g. when VLEN=128, and you use LMUL=4 to do emulate 256 bit vectors, then you potentially do two more 128 bit instructions than needed, but vector processors can and do short circuit the execution of larger LMULs, when the set vl is smaller,. So in practice the above approach will likely only restrict the number of available registers and not impact performance through other means.

The problem with the above is that it doesn't get optimized properly at all with current compilers: https://godbolt.org/z/EEjfxvMcd

But I'd expect that future compiler versions will be able to do so, as they are already able to optimize away redundant SSE/AVX load/stores: https://godbolt.org/z/Ghs3x388P

If this isn't the case, you could still create special attributes that would look similar to the following:

typedef struct {
	vuint32m1_t [[rvv_vl(4)]] a;
	vuint32m2_t [[rvv_vl(8)]] b;
} S;

S add(S x, S y) {
	return (S){ vadd_vv_u32m1(x.a, y.a), vadd_vv_u32m2(x.b, y.b) };
}

Here rvv_vl sets the fix vl of the respective type, and this is implicitly propagated to the vadd intrinsic.

@topperc
Copy link
Collaborator

topperc commented May 31, 2023

@camel-cdr

typedef struct { __m128i a; __m256i b; } S;

S add(S x, S y) {
	return (S){ _mm_add_epi8(x.a, y.a), _mm256_add_epi8(x.b, y.b) };
}

This can also be ported as

typedef long long __m128i __attribute__((__vector_size__(16)));
typedef long long __m256i __attribute__((__vector_size__(32)));
typedef unsigned char __v16qu __attribute__((__vector_size__(16)));
typedef unsigned char __v32qu __attribute__((__vector_size__(32)));

typedef struct { __m128i a; __m256i b; } S;

__m128i _mm_add_epi8(__m128i __a, __m128i __b) {         
  return (__m128i)((__v16qu)__a + (__v16qu)__b);                                 
} 

__m256i _mm256_add_epi8(__m256i __a, __m256i __b) {         
  return (__m256i)((__v32qu)__a + (__v32qu)__b);                                 
} 

S add(S x, S y) {
	return (S){ _mm_add_epi8(x.a, y.a), _mm256_add_epi8(x.b, y.b) };
}

Which is the code from clang's emmintrin.h and avx2intrin.h headers.

https://godbolt.org/z/93qfjez4e

There has been some work in clang to add things like __builtin_elementwise_max for vector operations that don't have C operators.

@camel-cdr
Copy link
Contributor

camel-cdr commented May 31, 2023

@topperc

This was just an example, but the same approach would work the other more complex instructions.

Wouldn't your comment also apply to riscv_rvv_vector_bits to the same degree?

There has been some work in clang to add things like __builtin_elementwise_max for vector operations that don't have C operators.

This feels like it's currently very limited, is integrating all of the intrinsics into a platform agnostic abstraction a thing clang aims to do?

Edit:

Also, would your code compiled for Zvl64b also work on Zvl128b without recompiling? Because I don't quite understand how it works exactly. Does clang assume a 128 bit VLEN by default?

@topperc
Copy link
Collaborator

topperc commented May 31, 2023

@topperc

This was just an example, but the same approach would work the other more complex instructions.

Wouldn't your comment also apply to riscv_rvv_vector_bits to the same degree?

riscv_rvv_vector_bits allows you to tell the compiler exactly what VLEN your CPU has. Using -mrvv-vector-bits=zvl makes the most sense. This sets the vector width to the largest Zvl*b in -march. Normally -march is treated only as a lower bound. -mrvv-vector-bits=zvl makes it an upper bound too. The generated code won't be portable to other CPUs.

You could get some of the same effect from using vector_size(__riscv_v_min_vlen/8) and using clang's vector operators and builtins, but wouldn't be able to use any RISC-V intrinsics.

There has been some work in clang to add things like __builtin_elementwise_max for vector operations that don't have C operators.

This feels like it's currently very limited, is integrating all of the intrinsics into a platform agnostic abstraction a thing clang aims to do?

I think it's primarily just the things that have a single intrinsic generic IR representation that can easily be supported by any target. It's probably not going to have something weird like X86's psadbw for example.

For the most part RISC-V doesn't have weird instructions, so not being able to target them may not be a big deal?

Also, would your code compiled for Zvl64b also work on Zvl128b without recompiling? Because I don't quite understand how it works exactly. Does clang assume a 128 bit VLEN by default?

My example used V in the -march so clang was assuming VLEN>=128. If I had used Zve64 instead, it would assume VLEN>=64.

@vineetgarc
Copy link

FTR attribute support in gcc landed: https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648204.html

commit 47de95d801c6899033c303b1fe642feb0489994f
Author: Pan Li <pan2.li@intel.com>
Date:   Fri Mar 22 14:43:47 2024 +0800

    RISC-V: Introduce gcc attribute riscv_rvv_vector_bits for RVV
    
    This patch would like to introduce one new gcc attribute for RVV.
    This attribute is used to define fixed-length variants of one
    existing sizeless RVV types.
    
    This attribute is valid if and only if the mrvv-vector-bits=zvl, the only
    one args should be the integer constant and its' value is terminated
    by the LMUL and the vector register bits in zvl*b.  For example:
    
    typedef vint32m2_t fixed_vint32m2_t __attribute__((riscv_rvv_vector_bits(128)));
    
    The above type define is valid when -march=rv64gc_zve64d_zvl64b
    (aka 2(m2) * 64 = 128 for vin32m2_t), and will report error when
    -march=rv64gcv_zvl128b similar to below.
    
    "error: invalid RVV vector size '128', expected size is '256' based on
    LMUL of type and '-mrvv-vector-bits=zvl'"
    
    Meanwhile, a pre-define macro __riscv_v_fixed_vlen is introduced to
    represent the fixed vlen in a RVV vector register.
    
    For the vint*m*_t below operations are allowed.
    * The sizeof.
    * The global variable(s).
    * The element of union and struct.
    * The cast to other equalities.
    * CMP: >, <, ==, !=, <=, >=
    * ALU: +, -, *, /, %, &, |, ^, >>, <<, ~, -
    
    The CMP will return vint*m*_t the same as aarch64 sve. For example:
    typedef vint32m1_t fixed_vint32m1_t __attribute__((riscv_rvv_vector_bits(128)));
    fixed_vint32m1_t less_than (fixed_vint32m1_t a, fixed_vint32m1_t b)
    {
      return a < b;
    }
    
    For the vfloat*m*_t below operations are allowed.
    * The sizeof.
    * The global variable(s).
    * The element of union and struct.
    * The cast to other equalities.
    * CMP: >, <, ==, !=, <=, >=
    * ALU: +, -, *, /, -
    
    The CMP will return vfloat*m*_t the same as aarch64 sve. For example:
    typedef vfloat32m1_t fixed_vfloat32m1_t __attribute__((riscv_rvv_vector_bits(128)));
    fixed_vfloat32m1_t less_than (fixed_vfloat32m1_t a, fixed_vfloat32m1_t b)
    {
      return a < b;
    }
    
    For the vbool*_t types only below operations are allowed except
    the CMP and ALU. The CMP and ALU operations on vbool*_t is not
    well defined currently.
    * The sizeof.
    * The global variable(s).
    * The element of union and struct.
    * The cast to other equalities.
    
    For the vint*x*m*_t tuple types are not suppored in this patch which is
    compatible with clang.
    
    This patch passed the below testsuites.
    * The riscv fully regression tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Revisit after v1.0 Features or problems we will revisit after the v1.0 release
Projects
None yet
Development

No branches or pull requests

5 participants