Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ABI alignment of vector types #347

Closed
kito-cheng opened this issue Oct 17, 2022 · 8 comments · Fixed by #380
Closed

ABI alignment of vector types #347

kito-cheng opened this issue Oct 17, 2022 · 8 comments · Fixed by #380
Milestone

Comments

@kito-cheng
Copy link
Collaborator

We should define the ABI alignment of (scalable) vector types, that could be separated sub-item from the full vector ABI.

Vector extension only require vector load/store align to the element width, e.g. require 8 byte alignment for element width=64, however some RISC-V core implementation might require larger alignment for best performance.

So we have following options:

  1. Alignment as the element width.
  2. Require higher alignment for vector types, e.g. align to 8 byte or 16 byte for all vector type for any element width.

And we have LMUL in the vector extension, we might also consider that in the alignment if needed.

NOTE: ABI alignment is the minimal requirement, compiler/programmer could set that alignment to larger than the ABI alignment is also conformance to the ABI

@kito-cheng kito-cheng added this to the Post 1.0 milestone Oct 17, 2022
@aswaterman
Copy link
Contributor

And we have LMUL in the vector extension, we might also consider that in the alignment if needed.

Most implementations (especially in applications processors) will be designed to run LMUL=1 code efficiently, and so using greater alignment when LMUL>1 will probably not be profitable.

My gut feeling is that using the existing stack alignment (16 bytes) is the way to go, but I don't have a strong argument for this proposal. One consideration is that V requires VLEN >= 128, and VLEN=128 will be a popular choice for many apps processors, so 16 bytes seems like a natural choice.

Of course, when optimizing for a specific target, we can increase the alignment when stack-allocating vectors, without breaking the ABI. So if we picked an ABI alignment of 16 bytes, we wouldn't be screwing the VLEN=256 implementations too badly.

@nick-knight
Copy link
Contributor

nick-knight commented Oct 17, 2022

Although not for the faint of heart, some programmers do perform type-punning on vector registers, reinterpreting the bits with the effect of fissioning or fusing consecutive vector elements. (In fact, the V-extension intrinsics provide auxiliary intrinsics to assist.) Without mandating stricter-than-element-width alignment, when compiling a C program like the following,

// vec_u8 passed on stack, according to current calling convention
void foo (vuint8m1_t vec_u8) {
  // 8-bit arithmetic with very high vector register pressure    
  vuint64m1_t vec_u64 = vreinterpret_v_u8m1_u64m1(vec_u8); // need to reload it
  // 64-bit arithmetic
}

the compiler may need to reload vec_u8 with a vl1re8.v, which may set the wrong expectations on an implementation that internally rearranges vector data for different element widths (to reduce datapath wiring costs). On the other hand, if alignment were >= 8B, the compiler could safely generate the appropriate vl1re*.v.

I admit this is a contrived scenario. I mention this only to point out a subtlety with Kito's first option,

Alignment as the element width

@kito-cheng
Copy link
Collaborator Author

Sounds 128 bits alignment should be most nature choose for the alignment of vector types for all different LMUL, and that's also resolve the potential issue which @nick-knight mentioned, actually I heard same issue during collecting issue from different community guys, so the issue might not be existing in synthetic benchmark/testcase I think.

Although one arguments is we could have VLEN=32 or VLEN=64, 128 bits alignment might waste some stack space for those 2 configurations, but that should be rare configuration, and even zve32* or zve64* configure still could have VLEN >= 128.

This topic will put into next psABI call :)

@workingjubilee
Copy link

It's worth noting, since the case of VLEN of 256 or 512 was discussed, that x86-64's AVX, the closest comparison, specifies that __m256 and __m512 are aligned to 32 and 64 bytes. However, in practice, the VEX encoding of these SIMD instructions guarantees, in general, that using these instructions off-alignment is acceptable (with I believe the notable exception of vmovaps?), whereas the SSE instructions and their legacy encodings sometimes enforces 16 byte alignment. So you would not be alone in having these types have a cut-off point at 16 byte alignment, even if it is more "effective" than "by-spec" in that case.

@programmerjake
Copy link

note that rust's project-portable-simd may define vector types (not necessarily the same ones as are used for C FFI) to require alignment be small enough that the vector types have no padding (all Rust types have size that is a multiple of alignment), this allows reinterpreting pointers to aligned portions of any valid array slice as pointers to vector types (e.g. &[u32] -> &[Simd<u32, 7>]): rust-lang/portable-simd#319

@workingjubilee
Copy link

workingjubilee commented Feb 23, 2023

That's not unambiguously decided, however, or at least, I believe the point regarding preferred type punning alignment is a good one.

@programmerjake
Copy link

well, conveniently every type combination that is valid to type pun (so doesn't try to e.g. type pun i32x3 to i64x2 where the last i64 is half undef) already satisfies the alignment requirements if we decide the alignment is gcd(sizeof(element) * length, next_power_of_2(sizeof(element) * length), some_global_constant_limit) (compatible with what i proposed for portable-simd) because:

  • assuming some_global_constant_limit >= sizeof(element) for all possible element types (handles wanting vector alignment to not increase without bound)
  • assuming element types are power-of-2 sized which afaict is true for RVV
  • assuming sizeof(target_element) * target_length <= sizeof(source_element) * source_length aka. that the type pun is valid because the target vector's elements are completely within the source elements and not taking bytes from after the end of the source elements
  • every type pun where sizeof(target_element) <= sizeof(source_element) the alignment works out so the target vector type needs the same or less alignment than the source (always the same alignment if sizeof(target_element) * target_length == sizeof(source_element) * source_length)
  • every type pun where sizeof(target_element) > sizeof(source_element) the assumption that the type pun is valid means that the source vector's length is a multiple of sizeof(target_element) / sizeof(source_element) which is a power of two therefore the alignment again works out so the target vector type needs the same or less alignment than the source (always the same alignment if sizeof(target_element) * target_length == sizeof(source_element) * source_length)

kito-cheng added a commit that referenced this issue May 18, 2023
The issue of Vector alignment is discussed in #347. It is mentioned that
aligning to 128 bytes might deliver better performance on some RISC-V cores,
but this behavior could lead to considerable stack wastage on zve32 and zve64
cores. For instance, in order to ensure a vector value in the stack conforms
to the ABI specification, we could potentially waste up to 96 bits per vector
object in stack for zve32, and the performance difference isn't always evident
across all core implementations.

Therefore, this proposal sets the alignment of vector types to element
alignment, to avoid wasting a significant amount of stack space in zve32 and
zve64 configurations. Also, since the ABI only specify the minimum alignment
and doesn't limit the compiler from adopting higher alignment for specific CPUs.

Fix #347.
@kito-cheng
Copy link
Collaborator Author

Further discussion move to here: #380 :)

kito-cheng added a commit that referenced this issue Jun 9, 2023
The issue of Vector alignment is discussed in #347. It is mentioned that
aligning to 128 bytes might deliver better performance on some RISC-V cores,
but this behavior could lead to considerable stack wastage on zve32 and zve64
cores. For instance, in order to ensure a vector value in the stack conforms
to the ABI specification, we could potentially waste up to 96 bits per vector
object in stack for zve32, and the performance difference isn't always evident
across all core implementations.

Therefore, this proposal sets the alignment of vector types to element
alignment, to avoid wasting a significant amount of stack space in zve32 and
zve64 configurations. Also, since the ABI only specify the minimum alignment
and doesn't limit the compiler from adopting higher alignment for specific CPUs.

Fix #347.
kito-cheng added a commit that referenced this issue Jan 10, 2024
The issue of Vector alignment is discussed in #347. It is mentioned that
aligning to 128 bytes might deliver better performance on some RISC-V cores,
but this behavior could lead to considerable stack wastage on zve32 and zve64
cores. For instance, in order to ensure a vector value in the stack conforms
to the ABI specification, we could potentially waste up to 96 bits per vector
object in stack for zve32, and the performance difference isn't always evident
across all core implementations.

Therefore, this proposal sets the alignment of vector types to element
alignment, to avoid wasting a significant amount of stack space in zve32 and
zve64 configurations. Also, since the ABI only specify the minimum alignment
and doesn't limit the compiler from adopting higher alignment for specific CPUs.

Fix #347.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants