Support non-power-of-two vector lengths. #63

calebzulawski · 2021-02-07T13:00:18Z

In rust-lang/rust#80652 I disabled non-power-of-two vector lengths to deconflict stdsimd and cg_clif development. Power-of-two vectors are typically sufficient, but there was at least one crate using length-3 vectors (yoanlcq/vek#66). It would be nice to reenable these at some point, which would require support from cranelift. Perhaps a feature request should be filed with cranelift?

cc @bjorn3

The text was updated successfully, but these errors were encountered:

thomcc · 2021-02-07T14:45:10Z

TBH I worry that supporting this will cause way more pain by complicating the semantics compared to the value in supporting the use cases.

Concretely, with this we'd have explicit always-padding/uninitialized/undefined fields in some of the simd types.

This feels dangerous, like it will lead to (likely latent) UB in peoples code the same way working with structs that have padding bytes is a headache in low level code, which I'd generally expect a lot of SIMD code to be.
The underlying intrinsics will still read these padding/uninit fields them and mix them into the result in a way that might be problematic in a number of ways.

Consider comparing 2 vec3s and producing a integer bitmask.

The ideal ideal would probably be the least significant 3 bits set and the rest are zeros. Comparison would produce a 3xN mask with one uninitialized field, but conversion to in integer will produce 0b0000_UZYX, where U is uninitialized.

That is to say, there's an undef bit, but as (at least) LLVM doesn't track this stuff at the bitwise level, the whole byte is undef... So, maybe simd_bitmask is implemented in a way to fix this, but I'd be very concerned, because as the real operations have to read the bit in question, it seems hard to guarantee robustly.
I suspect these would be semantically second-class still. That is:

There are several behaviors that are currently consistent across all vectors which now would not be consistent for these. Example: size being easily computable from the element count, alignment, repr(simd) being soundly transmutable to/from bytes...

Several questions exist whose answer is: "it behaves like the next power of two up" that the actual behavior should probably be that way too. For example: everything about layout again, simd_ffi, asm, ...
What if there's an arch in the future that ends up modeling non-POT vectors and then doesn't map well to the "uses next power of two up"? I think this is unlikely, but I think we should shy away from defining behavior if we can avoid it.
While it will could make programming easier for this user, I strongly suspect that it will cause many future bugs for other users as people make assumptions about the number of elements being a multiple of the size, assume there are no padding bytes, ... (basically all the weird questions in 3)
It seems to me like this wouldn't actually help veks code migrate to core::simd very much anyway, would it? repr(simd) isn't stabilizing with core::simd?

If it isn't: to use core_simd, vek almost certainly wrap our types to define its own functionality, at which point it could hide 4th field of the f32x4 from the public the vec3.

vek could even keep field access by doing the fields-only-exist-via-{Deref,DerefMut} trick: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=6b1799bb4796514d2a6823c519e71bb1

If it is, I think a lot of design work has to be done on it that hasn't been begun.

That's all to say: I'm kind of... pretty strongly opposed. If we do this it should be a decision made carefully and weighing all these things (and perhaps more) against the benefits.

One note is: if #[repr(simd)] is staying unstable forever, I think it would be fine to support this if it's unobtrusive to a compiler. My objections center on:

it being bad and weird in the core::simd API, and
it being semantically very bizarre and the kind of thing that sounds better off avoiding rather than trying to specify.

None of these apply to unstable features, which I don't have strong feelings on.

calebzulawski · 2021-02-07T15:26:17Z

In my opinion, if you are doing things that depend on the layout of a type, you need to ensure that you are properly handling the layout. The vector types already impose architecture-specific alignment implications, I don't think padding implications are particularly onerous. If you want to determine the size of a type, you should use core::mem::size_of and not make assumptions.

I also think it should be pointed out that LLVM entirely supports non-power-of-two vectors (ignoring floating-point exceptions). The limitation here really just comes from cranelift.

Overall I don't think this is about vek, vek only illustrates that there is a desire for non-power-of-two vector lengths. I'm a little confused why it would be weird in the std::simd API, as far as the API is concerned I believe it should be completely length-agnostic.

I do think this is a non-critical issue though, and can definitely be ignored for the time being.

programmerjake · 2021-02-07T16:12:07Z

@thomcc

What if there's an arch in the future that ends up modeling non-POT vectors and then doesn't map well to the "uses next power of two up"? I think this is unlikely, but I think we should shy away from defining behavior if we can avoid it.

both of RISC-V V and SimpleV natively support non-power-of-2 vector lengths; for SimpleV, the in-memory layout is the same as an array with the same length.

thomcc · 2021-02-07T23:44:01Z

both of RISC-V V and SimpleV natively support non-power-of-2 vector lengths; for SimpleV, the in-memory layout is the same as an array with the same length.

This is a far more compelling reason to support them IMO.

workingjubilee · 2021-02-08T07:24:44Z

Graphics code usually defers to the GPU, which has subtle but important differences from a typical SIMD register, but it is still useful to consider for our purposes, as a comparison point, if nothing else. And observationally GPUs handle Vec3 inputs by simply using a Vec4 in execution with a dummy value set in the 4th lane, but only use the first 3 lanes otherwise. The savings are on space on the hard drive and in RAM, basically, at the cost of complicating marshalling the data into the stream processors.

That actually a rather lot of programmers basically are okay with this quirk and, while there's some debate over whether or not they should just shove something in the spare element slot and get the advantage of fully aligned accesses when the data is on the CPU, they manage just fine without anything going particularly bad, seems to suggest we'll be okay.

Arm SVE is even more imminent as variable lane silicon that will be available on servers sometime Soon™ (I mean, it's probably already available in HPC land, but I mean for mere mortals who "just" have a big server) and it also can land on non-power-of-2 values for lane bitwidths, though admittedly, always a multiple of 128 bits.

workingjubilee · 2021-02-08T20:29:56Z

Filed https://github.com/bjorn3/rustc_codegen_cranelift/issues/1136 for now.

Also filing that reminded me that https://github.com/WebAssembly/flexible-vectors/ exists.

bjorn3 · 2021-02-22T08:49:44Z

Glam needs non-power-of-two vectors on SPIR-V.

XAMPPRocky · 2021-02-22T10:14:08Z

Hey, just wanted to note that the restriction introduced in rust-lang/rust#80652 breaks a lot of code for the rust-gpu spirv-unknown-unknown target, and prevents us from upgrading our compiler to the latest nightly, as SPIR-V has length-3 vectors (x, y, z), which we're using through the glam. In Vulkan these are defined as having a size of 12 and an alignment of 16.

The stated reason for the restriction is the lack of support in Cranelift, however this isn't a concern for us, as our backend does support non-power-of-two SIMD types. Would it be possible to re-enable it just for the rustc_codegen_spirv backend or when compiling for the spirv-unknown-unknown target? This would allow us to upgrade our code on the latest nightly, while keeping the restriction for cranelift.

calebzulawski · 2021-02-22T12:31:57Z

Perhaps we should revert the power-of-two restriction (not the entire patch as it fixes an ICE) and just allow cg_clif to error if it encounters one? There doesn't seem to be an obvious solution other than keeping #[repr(simd)] perma-unstable (and a little hacky?) pending cranelift support. @workingjubilee suggested GPU targets as a potential user of non-power-of-two vectors from the start and it didn't seem to take long to run into that...

workingjubilee · 2021-02-22T18:27:01Z

I agree. Is cranelift only used for wasm targeting right now?

bjorn3 · 2021-02-22T18:28:45Z

Cranelift doesn't have a wasm backend. It does have a wasm frontend, which is used by for example Wasmtime. cg_clif, which is another Cranelift frontend, targets x86_64 and in the future AArch64.

thomcc · 2021-02-22T18:30:23Z

Vulkan these are defined as having a size of 12 and an alignment of 16

This is going to be tricky, since rust requires size >= alignment...

There's some discussion about it here https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/Data.20layout.20.2F.20ABI.20expert.20needed, but it's possible another thread should be created if we need to support this.

workingjubilee · 2021-02-22T19:20:57Z

@thomcc Checking up on that thread, but where does Rust require size >= alignment?

thomcc · 2021-02-22T19:22:06Z

It's required by array layout. Each element is required to be size_of::<T>() apart, but if size < alignment, this would result in unaligned items.

workingjubilee · 2021-02-22T20:13:03Z

Hmm. I know that GPUs will execute over Vec3s with some multiple of 4 lanes anyways, but does code shuffling around such data en route to actual execution regularly assume the stride is 12 and not 16?

thomcc · 2021-02-22T20:15:04Z

It's required by array layout. Each element is required to be size_of::<T>() apart, but if size < alignment, this would result in unaligned items.

Apparently this isn't actually guaranteed, but I think it would probably require an RFC to change (and plausibly would break a lot of unsafe code in the wild — or rather, turn currently-sound interfaces to unsound-if-stride-and-size-dont-match ones)

programmerjake · 2021-02-22T20:24:33Z

Hmm. I know that GPUs will execute over Vec3s with some multiple of 4 lanes anyways, but does code shuffling around such data en route to actual execution regularly assume the stride is 12 and not 16?

That's often not true anymore (though it was usually true in the distant past, for OpenGL 2.x hardware), AMD's GPUs convert all vector operations to scalar operations on each SIMT thread iirc. So, a vec3 will just become 3 separate f32 variables.

thomcc · 2021-02-22T20:26:44Z

though it was usually true in the distant past, for OpenGL 2.x hardware

Hmm, my understanding is that it wasn't fully true here either, but it's been a while. But yes, what you're saying is definitely accurate for modern stuff. The float3/vec3/float4/vec4 types are just abstractions that don't model the actual hardware necessarily (the actual hardware is much much wider).

workingjubilee · 2021-02-22T20:34:26Z

Oh wow! That's absolutely wild.

programmerjake · 2021-02-22T21:00:15Z

So, my understanding is that for modern AMD GPUs, vectors in SPIR-V translate like this:

+-----------------------+
|      SPIR-V vec3      |
+-------+-------+-------+
| lane0 | lane1 | lane2 |
+-------+-------+-------+----+---------+
| 0.0   | 1.0   | 2.0   | 0  |         |
+-------+-------+-------+----+         |
| 3.0   | 4.0   | 5.0   | 1  |         |
+-------+-------+-------+----+         |
| 6.0   | 7.0   | 8.0   | 2  |         |
+-------+-------+-------+----+ HW lane |
| ...   | ...   | ...   |    |         |
+-------+-------+-------+----+         |
| 186.0 | 187.0 | 188.0 | 62 |         |
+-------+-------+-------+----+         |
| 189.0 | 190.0 | 191.0 | 63 |         |
+-------+-------+-------+----+---------+

So an add of 2 vec3 values would translate to AMDGPU assembly sorta like the following:
GLSL source:

vec3 a, b, r;
r = a + b;

Assembly (made up mnemonics):

add.f32.vec64 r0, a0, b0
add.f32.vec64 r1, a1, b1
add.f32.vec64 r2, a2, b2

kiljacken · 2021-02-22T22:39:26Z

To add to the above, modern shader compilers (and other graphics workloads, fx. raytracers) tend to use their lanes as pseudo "threads", masking for control flow, as most work done for graphics is the exact same for several pixels/vertecies/etc so you get a lot better use out of the lanes even for scalar calculations.

It also allows you to much better abstract away the width of the hardware SIMD unit while keeping your code path looking mostly like plain scalar code.

workingjubilee · 2021-02-24T00:20:14Z

We determined at the meeting yesterday that breaking the SPIRV target that others were using is an anticipated but not desirable side effect so we will figure out how to revert this in rustc and handle the problem where necessary on the cg_clif side.

…limit, r=nagisa Revert non-power-of-two vector restriction Removes the power of two restriction from rustc. As discussed in rust-lang/portable-simd#63 r? ``@calebzulawski`` cc ``@workingjubilee`` ``@thomcc``

…limit, r=nagisa Revert non-power-of-two vector restriction Removes the power of two restriction from rustc. As discussed in rust-lang/portable-simd#63 r? ```@calebzulawski``` cc ```@workingjubilee``` ```@thomcc```

programmerjake · 2021-12-20T21:21:59Z

llvm is gaining code allowing vectors to be aligned to the element's alignment, rather than to the rounded-up size of the vector: https://lists.llvm.org/pipermail/llvm-dev/2021-December/154192.html

workingjubilee · 2022-02-09T00:44:48Z

Doesn't seem to have much traction yet. This thread has has a higher speculation to information ratio than I would like. How does LLVM currently lower length 3 vectors for targets like x86?

programmerjake · 2022-02-09T07:58:19Z

on x86, llvm rounds up to a supported vector length (it did for length 3), though sometimes it rounds down and handles the leftover separately (what it does for length 5 without avx). Note that in all cases the memory write is split into some set of valid writes because llvm isn't allowed to write to bytes that are out of range (cuz there could be an atomic or unmapped memory or some memory-mapped hardware there).
https://rust.godbolt.org/z/3xEKvfxhP

example::add3:
        mov     rax, rdi
        movaps  xmm0, xmmword ptr [rsi]
        addps   xmm0, xmmword ptr [rdx]
        extractps       dword ptr [rdi + 8], xmm0, 2
        movlps  qword ptr [rdi], xmm0
        ret

example::add5:
        mov     rax, rdi
        movss   xmm0, dword ptr [rsi + 16]
        movaps  xmm1, xmmword ptr [rsi]
        addps   xmm1, xmmword ptr [rdx]
        addss   xmm0, dword ptr [rdx + 16]
        movss   dword ptr [rdi + 16], xmm0
        movaps  xmmword ptr [rdi], xmm1
        ret

workingjubilee · 2022-02-10T20:50:38Z

re: some older objections:

Concretely, with this we'd have explicit always-padding/uninitialized/undefined fields in some of the simd types.

Alas, people are already writing code that exhibits this problem via use of #[repr(simd)]. In our task we are charged with trying to resolve these problems, not necessarily to punt them to (other) Rust programmers. We may choose to do so, but it will not make the problem go away.

The underlying intrinsics will still read these padding/uninit fields them and mix them into the result in a way that might be problematic in a number of ways.

It seems LLVM already has an answer for this. It may not be the answer we want, but it is an answer.

What if there's an arch in the future that ends up modeling non-POT vectors and then doesn't map well to the "uses next power of two up"? I think this is unlikely, but I think we should shy away from defining behavior if we can avoid it.

Sure, the choice may be a mismatch for some obscure hardware architecture, but Rust already rejects many of those. It's not clear we can get away with ignoring non-pow2 vectors, as both SVE2 and RVV imply them, and SVE2 has begun shipping in consumer devices.

It seems to me like this wouldn't actually help veks code migrate to core::simd very much anyway, would it? repr(simd) isn't stabilizing with core::simd?

It's not, but it's also not actually clear that we shouldn't, in fact, commit to there being a single "SIMD container type" in the language which defines a behavior for how all vectors are stored, read, written, and passed, as an underlying device, like Layout or Box. Yes, vek can choose to wrap it somehow.

Apparently this isn't actually guaranteed, but I think it would probably require an RFC to change (and plausibly would break a lot of unsafe code in the wild — or rather, turn currently-sound interfaces to unsound-if-stride-and-size-dont-match ones)

"Sound but only by depending on implementation details that are not promised" is unsound in point of fact. The question is whether it should be made sound or not.

calebzulawski · 2022-08-27T22:20:51Z

all_lane_counts crate feature added in #300. We still need additional tests (particularly things like swizzles). Once we settle on layout (or make sure there are no issues with the current implementation), I think we can enable it in rustc.

calebzulawski added A-cranelift Rust's other codegen backend, coming Soon™ C-feature-request Category: a feature request, i.e. not implemented / a PR labels Feb 7, 2021

calebzulawski mentioned this issue Feb 8, 2021

Tracking issue for SIMD support rust-lang/rust#27731

Closed

calebzulawski mentioned this issue Feb 10, 2021

Limit all types to 64 lanes #67

Merged

Arc-blroth mentioned this issue Feb 20, 2021

Changes to #[repr(simd)] in nightly break XYZ structs on SPIR-V targets bitshifter/glam-rs#132

Closed

XAMPPRocky mentioned this issue Mar 2, 2021

Revert non-power-of-two vector restriction rust-lang/rust#82695

Merged

calebzulawski mentioned this issue Jun 7, 2022

Change LanesAtMost32 to LanesAtMost64 (or possibly remove the lane limit altogether) #90

Closed

programmerjake mentioned this issue Nov 30, 2022

non-power-of-2 Simd types have wrong size #319

Open

calebzulawski linked a pull request Jun 3, 2024 that will close this issue

Fix layout of non-power-of-two length vectors #422

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-power-of-two vector lengths. #63

Support non-power-of-two vector lengths. #63

calebzulawski commented Feb 7, 2021

thomcc commented Feb 7, 2021

calebzulawski commented Feb 7, 2021

programmerjake commented Feb 7, 2021

thomcc commented Feb 7, 2021

workingjubilee commented Feb 8, 2021 •

edited

Loading

workingjubilee commented Feb 8, 2021

bjorn3 commented Feb 22, 2021

XAMPPRocky commented Feb 22, 2021 •

edited

Loading

calebzulawski commented Feb 22, 2021 •

edited

Loading

workingjubilee commented Feb 22, 2021

bjorn3 commented Feb 22, 2021 •

edited

Loading

thomcc commented Feb 22, 2021 •

edited

Loading

workingjubilee commented Feb 22, 2021

thomcc commented Feb 22, 2021

workingjubilee commented Feb 22, 2021 •

edited

Loading

thomcc commented Feb 22, 2021

programmerjake commented Feb 22, 2021

thomcc commented Feb 22, 2021

workingjubilee commented Feb 22, 2021

programmerjake commented Feb 22, 2021

kiljacken commented Feb 22, 2021

workingjubilee commented Feb 24, 2021

programmerjake commented Dec 20, 2021

workingjubilee commented Feb 9, 2022

programmerjake commented Feb 9, 2022

workingjubilee commented Feb 10, 2022 •

edited

Loading

calebzulawski commented Aug 27, 2022

Support non-power-of-two vector lengths. #63

Support non-power-of-two vector lengths. #63

Comments

calebzulawski commented Feb 7, 2021

thomcc commented Feb 7, 2021

calebzulawski commented Feb 7, 2021

programmerjake commented Feb 7, 2021

thomcc commented Feb 7, 2021

workingjubilee commented Feb 8, 2021 • edited Loading

workingjubilee commented Feb 8, 2021

bjorn3 commented Feb 22, 2021

XAMPPRocky commented Feb 22, 2021 • edited Loading

calebzulawski commented Feb 22, 2021 • edited Loading

workingjubilee commented Feb 22, 2021

bjorn3 commented Feb 22, 2021 • edited Loading

thomcc commented Feb 22, 2021 • edited Loading

workingjubilee commented Feb 22, 2021

thomcc commented Feb 22, 2021

workingjubilee commented Feb 22, 2021 • edited Loading

thomcc commented Feb 22, 2021

programmerjake commented Feb 22, 2021

thomcc commented Feb 22, 2021

workingjubilee commented Feb 22, 2021

programmerjake commented Feb 22, 2021

kiljacken commented Feb 22, 2021

workingjubilee commented Feb 24, 2021

programmerjake commented Dec 20, 2021

workingjubilee commented Feb 9, 2022

programmerjake commented Feb 9, 2022

workingjubilee commented Feb 10, 2022 • edited Loading

calebzulawski commented Aug 27, 2022

workingjubilee commented Feb 8, 2021 •

edited

Loading

XAMPPRocky commented Feb 22, 2021 •

edited

Loading

calebzulawski commented Feb 22, 2021 •

edited

Loading

bjorn3 commented Feb 22, 2021 •

edited

Loading

thomcc commented Feb 22, 2021 •

edited

Loading

workingjubilee commented Feb 22, 2021 •

edited

Loading

workingjubilee commented Feb 10, 2022 •

edited

Loading