New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Portable packed SIMD vector types #2366

Open
wants to merge 17 commits into
base: master
from

Conversation

Projects
None yet
@gnzlbg
Contributor

gnzlbg commented Mar 22, 2018

This RFC extends Rust with portable packed SIMD vector types.

Rendered.


Acknowledgments

It would not have happened without the invaluable contributions of @alexcrichton and @BurntSushi , the excellent feedback given by the many reviewers of the initial drafts: @sunfish, @nagisa, @eddyb, and @hsivonen, amongst others, all the people using stdsimd, filling issues, and submitting PRs, and @huonw's work on the rust-lang-nursery/simd crate.

* `{Add,Sub,Mul,Div,Rem}<RHS=Self,Output=Self>`,
`{Add,Sub,Mul,Div,Rem}Assign<RHS=Self>`: vertical (lane-wise) arithmetic
operations.

This comment has been minimized.

@petrochenkov

petrochenkov Mar 22, 2018

Contributor

It's probably worth describing how behavior of these operators is different (or not different) from the same operators for non-vector arithmetic types - e.g. wrapping vs panicking on overflow.
(Same applies to bit-shift traits.)

This comment has been minimized.

@gnzlbg

gnzlbg Mar 22, 2018

Contributor

Thanks, I'm on this. @sunfish also requested to know the behavior of Div/Rem when dividing by zero, so I'll add that as well, and also to mention that Div and Rem are provided as a convenience, but that most (or no) hardware actually offers intrinsics for these.

This comment has been minimized.

@gnzlbg

gnzlbg Mar 22, 2018

Contributor

From looking at rustc's implementation this is the behavior that they seem to have (I need to add stdsimd test to exercise these everywhere though), but this is what LLVM says that it guarantees:

  • on integer vectors overflowing results in the correct mathematical result modulo 2^n (see exceptions below)

  • on floating point vectors overflowing and division by zero has the same behavior as that of f32 or f64 (+-INFINITY for overflow, and some NaN for division by zero IIRC).

  • on integer vectors we directly map division and rem to LLVM's {s,u}{div,rem} which state that if any element is divided by zero or the operation overflows these operations result in undefined behavior.

So this is what we are currently doing. Now the question: can we do better?

  • For Add, Sub, Mul, Shl, Shr: we could use nuw/nsw to generate a poison value and get the same behavior than for the scalar integers (that is, a panic). We could then expose wrapping arithmetic via wrapping_{add,sub,mul,shl,shr}. This might be a bit of a performance foot-gun, so the alternative could be to just expose the wrapping methods instead.

  • For Div and Rem: we could expose these as unsafe fn {div, rem}(self, o: Self) -> Self methods with a precondition on division by zero, and provide an implementation in the traits that checks that no element is zero before performing the operation (panicking on precondition violation). This is "expensive": one vector comparison + one horizontal reduction, and whether it is better than not providing these operations at all is debatable. Another alternative might be to not expose these operations.

This comment has been minimized.

@gnzlbg

gnzlbg Mar 22, 2018

Contributor

@petrochenkov I've gone ahead and updated the RFC which what I consider the "sane" arithmetic and shift operator semantics:

  • the arithmetic and shift traits behave like the operators for the scalars. That is, for integers they panic on overflow and division by zero, and for floats they produce the same results as the scalar.

  • the integer vectors also implement a set of wrapping_{add,sub,...} methods that perform modulo 2^n arithmetic, just like the scalar integers do

  • the integer vectors also implement two unsafe wrapping_{div,rem}_unchecked methods that perform no checks.

This comment has been minimized.

@petrochenkov

petrochenkov Mar 22, 2018

Contributor

What's sane here is a hard question.
On one hand, the only reason for SIMD's existence is improving performance and uNxM + uNxM should ideally turn into a single hardware SIMD addition instruction, but checked operations can't be lowered into a single instruction.
On the other hand, consistency with scalar arithmetic, error detection and everything...

This comment has been minimized.

@gnzlbg

gnzlbg Mar 22, 2018

Contributor

I think that any SIMD proposal that does not mesh with the rest of the language is not going to be "good enough". So what I meant with "sane" was the same defaults than the rest of the std library, while at the same time offering easy-to-reach more efficient operations with slightly different semantics. While the std library puts these typically in core::intrinsics, the portable SIMD vector types have these as inherent methods, making them more "first class".

IMO the SIMD std facilities have two jobs: protecting users from easy mistakes, and allowing them to write efficient code. These two jobs are at constant tension.

I don't think that unsafe operations should become the default, but this does not mean that they should be hard to reach for either.

This comment has been minimized.

@hsivonen

hsivonen Mar 23, 2018

On the other hand, consistency with scalar arithmetic, error detection and everything...

Being consistent with scalar arithmetic would have overflow checking for + (etc.) only in debug mode, so release mode would still lower it to a single instruction.

I think being consistent with scalar math on this point would be OK, but making + check for overflow in release mode would probably not be OK in terms of performance expectations.

If overflow is checked in debug mode, wrapping_foo variants are needed. E.g. the Rust SIMD code in Firefox for checking if a UTF-16 string contains right-to-left characters currently relies on wrapping underflow for - on u16x8.

This comment has been minimized.

@gnzlbg

gnzlbg Mar 23, 2018

Contributor

@hsivonen I've added that overflow checking only happens in debug mode.

This comment has been minimized.

@rkruppe

rkruppe Mar 23, 2018

Contributor

FWIW "release/debug" is not the right distinction for overflow checks. They're a separate switch (-C overflow-checks=[on|off]) that can be flipped independently of debug info, optimization level, etc. -- it's just by default tied to debug_assertions which in turn is by default tied to release/debug.

returns `true` if all lanes compare `true`. It is equivalent to
`a.eq(b).all()`.
* `PartialOrd<Self>`: compares two vectors lexicographically.
* `From/Into` lossless casts between vectors with the same number of lanes.

This comment has been minimized.

@Centril

Centril Mar 22, 2018

Contributor

What about From conversions [{et}{lw}; {nl}] -> {et}{wl}x{nl} (arrays) and {et}{wl}^{nl} -> {et}{wl}x{nl} (tuples) ?

This comment has been minimized.

@gnzlbg

gnzlbg Mar 22, 2018

Contributor

I haven't given these much thought until now, so I don't know if adding these is worth it. In any case, this is how they could be implemented:

  • For arrays, one could implement these on top of the {load,store}_unaligned APIs without any issues.

  • For tuples, it depends. IIRC the layout of tuple elements is unspecified to allow for things like reordering tuple elements to reduce their size. There was, however, a variadic RFC built on tuples that required the layout of the tuples to be fully specified to allow for things like slicing tuples. Iff the layout of tuples with multiple elements of the same type would be specified as being the same layout as that of an array/slice, then we could implement these for tuples with unaligned memory load/stores as well. Otherwise, we'll probably need to copy the tuple elements into an array first, and then perform the memory loads (and vice-versa for stores).

Having said this, we don't need to add these now either. Those interested in trying these out could add them to stdsimd behind an unstable feature gate. If they turn out to be useful a mini-FCP might be enough to stabilize these later on.

This comment has been minimized.

@Centril

Centril Mar 22, 2018

Contributor

Fair enough; On the question of "is it worth it" I'd like to note my opinion that it would help with ergonomics so I'd be 👍 for those impls.

Could you perhaps note the possibility of adding these somewhere in the RFC (like the Unresolved questions) for future reference?

##### Semantics for floating-point numbers
* `eq`: yields `true` if both operands are not a `QNAN` and `self` is equal to

This comment has been minimized.

@rkruppe

rkruppe Mar 22, 2018

Contributor

Why quiet NaN? Surely any NaN should result in false?

And what does "self is equal to other" mean anyway? Is it a bitwise comparison? Then that doesn't match scalar float comparisons wrt positive/negative zero.


I don't think this section (and others like it) is needed or useful at all. It's enough to simply state that each lane is compared as in the scalar case. I assume that's what you're going for anyway (if not, that's a big problem IMO)

This comment has been minimized.

@eternaleye

eternaleye Mar 22, 2018

Signaling NaN may result in the CPU trapping to the OS without giving the program a choice. LLVM recently tightened up their assumptions regarding the default floating-point environment, so only qNaN should be generated anyway.

This comment has been minimized.

@rkruppe

rkruppe Mar 22, 2018

Contributor

@eternaleye That's multiple kinds of false

  • signalling NaNs do not trap. Operations on them raise a floating point exception, which is nothing like language exceptions or traps. The specific exception sNaN operations raise, (invalid operation) is also raised by sqrt(-1) and myriad other operations and is handled by returning a qNaN.
  • some LLVM people have been confused about this too but they're now finally tidying up their stuff to explicitly not consider signalling NaNs UB-producing
  • the LangRef change you link does not say sNaN is not produced, it simply spells out some consequences of assuming the program does not change the default rounding mode, exception handling, etc. -- one consequences of that is that qNaN and sNaN are indistinguishable so LLVM doesn't need to preserve signalling-ness, but you can still get an sNaN bit pattern in many way.
  • finally, in any case this issue affects vectors and scalars equally so even if what you wrote was true it still would not explain the divergence here

This comment has been minimized.

@eternaleye

eternaleye Mar 22, 2018

@rkruppe Ah, sorry - thanks for clarifying that for me!

This comment has been minimized.

@gnzlbg

gnzlbg Mar 22, 2018

Contributor

Honestly, I just documented here what we are already doing, and QNAN is what LLVM guarantees. Why? I don't know. cc'ing here @sunfish, @chandlerc - they might know why LLVM guarantee this. While what you mention makes sense, I don't have a global view of what all architectures that LLVM supports actually do.

This comment has been minimized.

@gnzlbg

gnzlbg Mar 22, 2018

Contributor

I don't think this section (and others like it) is needed or useful at all. It's enough to simply state that each lane is compared as in the scalar case.

I need to check whether the way we are comparing floating-point vectors is actually the same way in which we are handling the scalar case. If not, then arguably we should, but maybe we can't for some reason. From an implementation point-of-view, the underlying LLVM intrinsics work on both scalars and vectors, so we ought to be doing the exact same thing for vectors as for scalars, but as mentioned, I haven't checked.

In any case, defining these to have the same semantics as the operations for scalar vectors makes sense.

This comment has been minimized.

@rkruppe

rkruppe Mar 22, 2018

Contributor

As stated earlier, quite a few LLVM developers (and consequently the LangRef) was really confused about sNaN for a long time. The reference to QNAN specifically in the LangRef probably comes from that confusion. Certainly there is no remotely justifiable reason to treat sNaN differently here, and IEEE 754 does not do so.

This comment has been minimized.

@gnzlbg

gnzlbg Mar 23, 2018

Contributor

@rkruppe

I don't think this section (and others like it)

Could you ping me in the other sections where wording like this is unnecessary?

than two times the number of lanes in the input vector. The length of the
resulting vector equals the number of indices provided.
Given a vector with `N` lanes, the indices in range `[0, N)` refer to the `N` elements in the vector. In the two-vector version, the indices in range `[N, 2*N)` refer to elements in the second vector.

This comment has been minimized.

@rkruppe

rkruppe Mar 22, 2018

Contributor

What happens on out-of-range indices? Compile time error?

This comment has been minimized.

@gnzlbg

gnzlbg Mar 22, 2018

Contributor

Yes, out-of-range indices are a compile-time error. Right now they are a monomorphization time error but before stabilization we should turn these into MIR typeck errors.

this in one form or another, but if one future backend does not, this RFC can be
implemented on top of the architecture specific types.
## Zero-overhead requirement for backends

This comment has been minimized.

@rkruppe

rkruppe Mar 22, 2018

Contributor

The compiler engineer in me is a bit worried by the tone of this section. Obviously we want to make nice source code execute as efficiently as possible, but this is also true of the rest of the language and, in the end, very much best-effort. It is simply not possible to guarantee optimality in any useful sense (cf. the "full employment theorems" for compiler writers 😄).

This comment has been minimized.

@gnzlbg

gnzlbg Mar 22, 2018

Contributor

Shhh. We all want to be employed full time on Rust. Don't spoil this!


Kidding aside. Yes. This is all best effort. The good thing is that if there are compiler backend issues that must be resolved, in most cases we can resolve them in Rust - either in the compiler or the library. Sometimes this might not be enough, but at least for stdsimd our way has always been to diagnose issues first, fill backend bug reports afterwards, and in the mean time, add library or rustc workarounds. As the backends and rustc have been fixed (often by ourselves) we have consistently been able to remove library workarounds.

While I actually wanted to write that the best thing we can do here is only to strive for optimality, even though we can never achieve it, I actually think that's not true. For many trivial SIMD operations we can often reliably generate optimal code. For non trivial operations we just have to keep trying. We will obviously never be perfect on all cases, but being good enough on most cases suffices.

This comment has been minimized.

@gnzlbg

gnzlbg Mar 23, 2018

Contributor

@rkruppe I've reworded this section

called scalable Vectors or scalable vectors. These include, amongst others, NecSX,
ARM SVE, RISC-V Vectors. These architectures have traditionally relied on
auto-vectorization combined with support for explicit vectorization annotations,
but newer architectures like ARM SVE and RISC-V introduce explicit vectorization

This comment has been minimized.

@rkruppe

rkruppe Mar 22, 2018

Contributor

SVE, sure, but RISC-V? Leaving aside that neither the vector extension nor compiler support for it is anywhere near finished, the architects have gone on record stating that they vastly prefer auto-vectorization for targeting their architecture.

This comment has been minimized.

@rkruppe

rkruppe Mar 22, 2018

Contributor

(I actually expect that intrinsics will wind up existing, but I'm still confused that this section implying they already exist. Although, for that matter, there are some aspects of RISC-V that might make assembly more attractive than intrinsics, but whatever.)

This comment has been minimized.

@gnzlbg

gnzlbg Mar 22, 2018

Contributor

the architects have gone on record stating that they vastly prefer auto-vectorization for targeting their architecture.

This is true, but it is also worth remarking that ARM SVE architects prefer auto-vectorization as well, 2) that they ended up providing explicit intrinsics that map to the ISA instructions anyways, and 3) that at least compared to "packed" SIMD vector models the RISC-V vector ISA is not that far away from the ARM SVE one.

In any case, time will tell. This paragraph does not suggest that the current programming model applies 1:1 to "scalable" vectors in any way, but rather that scalable vectors would require a slightly different programming model, and that this model does not really interfere that much with non-scalable vectors. Nothing more, nothing less.

This comment has been minimized.

@rkruppe

rkruppe Mar 22, 2018

Contributor

Again my main question/confusion here is the implication someone already went and defined intrinsics for the RISC-V vector extension.

This comment has been minimized.

@gnzlbg

gnzlbg Mar 23, 2018

Contributor

Again my main question/confusion here is the implication someone already went and defined intrinsics for the RISC-V vector extension.

I'll reword this section then. Technically, RISC-V does not even have SIMD support yet - the second revision of the vector extension ISA is only a proposal that is still being iterated on in the workshops.

The `shuffle!` macro returns a new vector that contains a shuffle of the elements in
one or two input vectors. That is, there are two versions:
* `shuffle!(vec, [indices...])`: one-vector version

This comment has been minimized.

@rkruppe

rkruppe Mar 22, 2018

Contributor

What form do the indices take? Can they be arbitrary constant expressions, or do they have to be literals? Obviously we want the former but this should be mentioned, since a natural way to implement a macro would be to match on literals.

This comment has been minimized.

@gnzlbg

gnzlbg Mar 23, 2018

Contributor

Can they be arbitrary constant expressions, or do they have to be literals?

I'll mention that arbitrary constant expressions are allowed.

@scottmcm

This comment has been minimized.

Member

scottmcm commented Mar 23, 2018

Minor request: Please define the m types in the guide explanation. I think I know what they are, and [iuf] are obvious, but since there's no m16 type in rust, it would be good to describe what it is and why it exists and why there isn't just boolx4.

What are the layout guarantees of these types? Are they repr(C)? Could they offer .as_array(&self) -> &[T;N] (or even Deref to arrays)?

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Mar 23, 2018

@rkruppe and @hsivonen I think this commit incorporates all of your feedback, but it would be great if you could review the changes in case I missed or misinterpreted something.

@scottmcm I've added more documentation about vector masks to the guide-level explanation in this commit. Let me know what you think.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Mar 23, 2018

@scottmcm

What are the layout guarantees of these types? Are they repr(C)? Could they offer .as_array(&self) -> &[T;N] (or even Deref to arrays)?

Those are really good questions. The short answer is that, as the ABI section mentions, their layout is unspecified.

The ABI section does however explicitly forbids their use on extern functions, which means that right now they are definitely not repr(C).

Also, neither this RFC nor AFAICT the std::arch RFC defines what the behavior of this is:

union A {
    data: [f32; 4],
    vec: f32x4,
}
let x: [f32; 4] = unsafe { A { vec: f32x4::splat(0.) }.data };

so for all practical purposes this is also currently unspecified behavior (triple-check: @alexcrichton is this statement correct or is this behavior specified somewhere?).

The only thing this RFC (and IIRC the std::arch RFC) specifies is the behavior of mem::transmute from one vector type to another vector type. For everything else one needs to use the memory load/store API ({load,store}_{aligned,unaligned}{,_unchecked}).

A future RFC specifying the layout and ABI of these types will certainly have to properly answer all these questions. Right now, there are a just a couple of things that could be done differently ABI-wise and that might produce a better ABI than what we currently have. So the purpose of leaving the ABI unspecified in these initial RFCs is to not limit those efforts in any way.

# Unresolved questions
[unresolved]: #unresolved-questions
### Interaction with scalable vectors

This comment has been minimized.

@rkruppe

rkruppe Mar 23, 2018

Contributor

Nit but I think "scalable" is very much an Arm-ism here. The more neutral and common term is probably "Cray-style".

This comment has been minimized.

@gnzlbg

gnzlbg Mar 23, 2018

Contributor

Yes I agree. In the introduction I mention:

packed: means that these vectors have a compile-time fixed size. It is the opposite of scalable or "Cray vectors", which are SIMD vector types with a dynamic size, that is, whose size is only known at run-time.

and I consistently used the term "scalable" throughout the RFC because I thought "Cray vectors" would be a less approachable term. I don't like the term "scalable" much either TBH.

The floating-point semantics follow the semantics of `min` and `max` for the
scalar `f32` and `f64` types. That is:
If either operand is a `NaN`, returns the other non-NaN operand. Returns `NaN`

This comment has been minimized.

@rkruppe

rkruppe Mar 23, 2018

Contributor

@gnzlbg This section also unnecessarily describes behavior that matches the scalar case.

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Mar 23, 2018

@gnzlbg

so for all practical purposes this is also currently unspecified behavior (triple-check: @alexcrichton is this statement correct or is this behavior specified somewhere?).

I'd personally expect such union-using behavior to work and operate as intended, I think like the non-portable SIMD types we'll want to at least define how large these are for unsafe operations. (aka is it safe to transmute f32x4 to __m128?

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Mar 23, 2018

I'd personally expect such union-using behavior to work and operate as intended, I think like the non-portable SIMD types we'll want to at least define how large these are for unsafe operations. (aka is it safe to transmute f32x4 to __m128?

I think there are a couple of issues.

The first one is whether one can transmute a vector type into an equally-sized non-vector type like an array or a tuple. The std::arch RFC specified that the vector types are Sized and their size, and this RFC specifies this as well. So to me this answer is a clear yes.

The other issue is whether these types are layout compatible with arrays and tuples. That is, whether we guarantee the following assert to pass everywhere:

union A { arr: [f32; 4],  vec: f32x4, tup: (f32,f32,f32,f32) }
let x: [f32; 4] = unsafe { A { vec: f32x4::new(0., 1., 2., 3.) }.arr };
assert_eq!(x[2], 2.);  // OK?
let y: (f32,f32,f32,f32) = unsafe { A { vec: f32x4::new(0., 1., 2., 3.) }.tup };
assert_eq!(y.2, 2.);  // OK?

These two cases currently work on all platforms, but this is incidental. For example, the layout of tuples is unspecified, so we can't guarantee this for tuples AFAICT.

The layout of arrays is specified so we could do this for arrays. For that, we probably need to write down that a vector type has the same layout as an array or something like this.


Also, it is worth pointing out that transmute is a semantic memcpy. While transmuting between vector times is zero-overhead in some cases, transmute from vector to array and vice-versa always requires an aligned/unaligned memory load/store. That's ok, but worth keeping this in mind.

* `indices` must be a `const` array of type `[usize; N]` where `N` is any
power-of-two in range `(0, 2 * {vec,vec0,vec1}::lanes()]`.
* the values of `indices` must be in range `[0, N)` for the one-vector version,
and in range `[0, 2N)` for the two-vector version.

This comment has been minimized.

@fbstj

fbstj Mar 23, 2018

is this precondition statically checked when possible? is it dynamically enforced when necessary? if not, what are the results of indexing outside the range?

EDIT: oh, answered here

This comment has been minimized.

@gnzlbg

gnzlbg Mar 23, 2018

Contributor

The line below this one states:

On precondition violation a type error is produced.

The current implementation produces a monomorphization time error, but that can be fixed.

@clarcharr

This comment has been minimized.

Contributor

clarcharr commented Mar 26, 2018

Curious if this RFC might provide enough of a justification for adding f16 to the language. I know that the half crate exists, but it's still harder to use considering how f16 literals can't actually be written.

@clarcharr

This comment has been minimized.

Contributor

clarcharr commented Mar 26, 2018

API question: should Wrapping<u32x4> and similar be allowed?

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Mar 26, 2018

Curious if this RFC might provide enough of a justification for adding f16 to the language. I know that the half crate exists, but it's still harder to use considering how f16 literals can't actually be written.

This RFC forwards that option to future RFCs proposing to add more std::arch intrinsics. Note, however, that while many architectures (x86 included) have intrinsics for working on vectors of f16s, most of them do not support operating on single f16 values.

EDIT: So just because some intrinsics allow manipulating vectors of f16s that might not be enough motivation for adding f16 as its own free-standing type.

API question: should Wrapping and similar be allowed?

I think so. Given the change towards "safety" by default, allowing Wrapping<{i,u}{w}x{N}> makes sense.

`pub unsafe fn wrapping_{div,rem}_unchecked(self, Self) -> Self`
methods which, on overflow, produce the correct mathematical result modulo `2^n`.
If any of the vector elements is divided by zero the behavior is undefined.

This comment has been minimized.

@clarcharr

clarcharr Mar 26, 2018

Contributor

Is there a particular reason to stray away from the existing overflow behaviour for SIMD types?

This comment has been minimized.

@gnzlbg

gnzlbg Mar 26, 2018

Contributor

What do you mean?

This comment has been minimized.

@clarcharr

clarcharr Mar 26, 2018

Contributor

There isn't a wrapping_div_unchecked for primitives; why have it for SIMD types?

This comment has been minimized.

@gnzlbg

gnzlbg Mar 26, 2018

Contributor

The method for primitives are in core::intrinsics (look for _unchecked).

`max` and `min` reductions clash with the vertical `a.max(b)` operation, and
therefore get a `h`-prefix.
An alternative would be to prefix all horizontal operations with, for example,

This comment has been minimized.

@clarcharr

clarcharr Mar 26, 2018

Contributor

I'm not very fond of the prefix here; it seems easy to gloss over. Perhaps min_lane and max_lane might be best?

While horizontal/vertical terminology are good for explaining methods, I'd rather not include it in their names.

This comment has been minimized.

@gnzlbg

gnzlbg Mar 27, 2018

Contributor

Horizontal is pretty much the term that everybody doing SIMD actually uses. The ISAs call them horizontal operations, the intrinsics use h to denote them, that's the keyword that the literature uses, etc.

What I don't like here is the inconsistency of some horizontal operations having a prefix, and some not having it. Eigen3 uses {min,max}Coeff and the C++ STL <algorithm> header uses {min,max}_element for the horizontal min operation, so maybe we could use {min,max}_element here as well.

This comment has been minimized.

@tanriol

tanriol Mar 30, 2018

You explain them as reductions, so something like reduce_to_max can also be used (which can also work for sum/product if desired).

@ralfbiedert

This comment has been minimized.

ralfbiedert commented Aug 6, 2018

@RalfJung the method looks really interesting for use inside of faster. From a user perspective using std::simd I'd hope I can avoid unsafe for "basic stuff" (such as adding two slices).

@ralfbiedert

This comment has been minimized.

ralfbiedert commented Aug 9, 2018

Quick update about the issue of safe but fast vector loading: I now experimented a bit (see crate simd_aligned) and realized it is not as bad as I thought:

  • Going from &[f32x4] to &[f32] is much nicer than trying the other way (in retrospect kind of obvious) and feels like "the right way" of doing it if I look at it from an application level.
  • It only needs two relatively isolated unsafe calls to create a number of safe abstractions on top.
  • These abstractions can easily be put in a thin 3rd party crate.

While there could still be a minimal packed_as_flat(data: &[Simd]) -> &[Simd::Element] convenience function in std::simd, I now wouldn't mind if there is not.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 9, 2018

Just keep in mind that the alignment for f32x4 differs from f32x8 which differs from f32x16.
So &[f32x4] -> &[f32] -> &[f32x8] won't work unless you use unaligned loads.

One day we will be able to just Vec<T, AlignedAlloc<Global, U32>> and get a vector that uses the Global allocator, but rounds up the Layout's align of T to a 32-byte boundary or whatever you need, but until then one will just need to be careful.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 28, 2018

So I just released version 0.2 of packed_simd with some major changes that will need to be updated in this RFC:

  • a single Simd<[T; N]> type - all types exposed in this RFC are still available, but are just type aliases, e.g., f32x4 is an alias for Simd<[f32; 4]>.
  • vectors of {i,u,m}size - these are required to handle operations on packed vectors of pointers (e.g. like offset on all elements of a vector). Type aliases for these are provided, but are not very readable, e.g., isizex8.
  • support for pointer types, e.g, Simd<[*const *mut f32; 4]> - the vectors of pointers support common operations like offset, add, etc. as well as (masked) gather and scatter operations.
  • some renames here and there (::from_slice_unaligned, ::write_to_slice_unaligned_unchecked, shuffle! does not accept a single vector anymore, a permute! macro that accepts a single vector has been added)
  • other new features (rotates, permute_dyn, more mathematical functions, ...)
  • many more examples: all examples from the old simd crate have been ported to packed_simd, and two ISPC examples have also been ported
  • packed vectors do not implement PartialOrd/Ord anymore - instead, there is a method that returns a wrapper type that implements a lexicographical order, and implements PartialOrd/Ord.
  • From<[T; N]> for Simd<[T; N]> implementations for easily converting between vectors and arrays

The new release also comes with:

  • more examples: all examples from the old simd crate have been ported to packed_simd, and two ISPC examples have been ported as well, showing how to use the library in combination with rayon
  • the floating-point vector support in LLVM turned to deliver very bad performance (all calls were turned into scalar calls). The library now supports using a short vector math library called sleef for this which significantly improves performance for some workloads (up to ~10x). Ideally it will support using Intel SVML, libmvec, and similar libraries in the future. There is also a project to re-implement sleef in Rust and avoid the "pain" of having to link a C-library.

Missing RFC features in packed_simd:

  • saturating arithmetic,
  • checked arithmetic,
  • default arithmetic being checked - currently wrapping arithmetic is the default,
  • using as for numeric casts - currently a .cast() method is used for performing lane-wise numeric casts (equivalent to as per vector lane)

Open questions:

  • Currently, packed vector of pointer types are only provided for the lengths 2, 4, and 8. The largest size for a vector of 8 packed pointers is 512-bit. The library exposes up to 512-bit wide long vectors with up to 64 elements, e.g., i8x64 (512-bit wide long). A vector of pointers with 64 pointer elements would, however, be 4096-bit wide long. It feels weird to not support vectors of pointers up to the maximal number of elements that the other vectors of the library expose, but at the same time, a 4096-bit wide vector is a very big vector.
  • Implementing as for vector types might not be a great idea. The .cast() method is a solution, but the RFC XXX discusses better finer-grained solutions that we could implement. I'd prefer to move the as part of the RFC to the unresolved question section, until that RFC is resolved, and then re-evaluate.
@rkruppe

This comment has been minimized.

Contributor

rkruppe commented Aug 28, 2018

@gnzlbg

permute!

permute_dyn

These aren't permutations, a permutation can't duplicate or drop elements.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 28, 2018

These aren't permutations, a permutation can't duplicate or drop elements.

Yeah, naming is hard. Permutations with repeating elements are called n-tuples, where n is the length of the tuple, and they allow dropping elements (https://en.wikipedia.org/wiki/Permutation#Permutations_with_repetition).

The only advantage of the permute names is that they sound familiar to those that know the vperm... family of Intel instructions. I'd like to find better names for these. Combinations with multiple elements are called multicombinations, so maybe we could use multipermutations here. I don't know. Feedback is welcome. Worst case we can just go back to using shuffle! for both cases, and call the dynamic variants shuffle_dyn(indices) and shuffle2_dyn(vec2, indices).

@Diggsey

This comment has been minimized.

Contributor

Diggsey commented Aug 28, 2018

I've always known that operation as swizzle

(https://www.khronos.org/opengl/wiki/Data_Type_(GLSL)#Swizzling)

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 29, 2018

I've created a thread to discuss the naming issue: rust-lang-nursery/packed_simd#102

Please submit a proposal as a comment, or vote on the existing ones (you can also discuss the names there).

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Aug 29, 2018

@gnzlbg thanks for the update! In terms of stabilizibility I think it'd be easiest to not stabilize Simd<T>, and just wanted to confirm but is it possible with the crate currently to stabilize f32x4 and such without stabilizing Simd<T>?

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 29, 2018

In terms of stabilizibility I think it'd be easiest to not stabilize Simd, and just wanted to confirm but is it possible with the crate currently to stabilize f32x4 and such without stabilizing Simd?

I don't know the exact rules for visibility, but maybe we could do something like:

mod __rustc_private {
    pub struct Simd<T>...;
}
#[stable(...)]
pub type f32x4 = Simd<[f32; 4]>;

#[unstable(packed_simd_generic)]
pub type Simd<T> = __rustc_private::Simd<T>;

What we couldn't easily do without significant effort (and maybe not even in a backwards compatible way), is to use newtypes for f32x4 and the other type aliases to completely hide Simd<T> because dereferencing vector of pointers generically kind of relies on all of these types being the same generic type: Simd<[*mut *mut f32; 4]> => Simd<[*mut f32; 4]> => Simd<[f32; 4]>. It probably could be implemented, but without having thought a lot about this I have a feeling that it would complicate the implementation quite a bit.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 29, 2018

Do the stable and unstable attributes work on impls ? If so, what we could do is limit which arrays Simd<T> accepts, so that if we wanted to e.g. stabilize f32x4 but not f32x8 we can stabilize the impl of Simd<[f32; 4]> but not stabilize the impl of Simd<[f32; 8]>.

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Aug 30, 2018

Ah unfortunately a type annotation can subvert stability (users would still have somewhat of access to Simd<T>. Additionally unstable attributes don't work on trait impls, but they work on inherent methods of normal impls.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 30, 2018

Additionally unstable attributes don't work on trait impls, but they work on inherent methods of normal impls.

So this would also be an option. We basically would expose Simd<T> and stabilize methods for some Ts, like Simd<T=[f32; 4]>. This means people would be able to use Simd<[f32; 8]> e.g. in a type alias, but since ::new wouldn't be stabilized for it, they cannot really construct any object of that type.

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Aug 30, 2018

@gnzlbg ok cool! I wonder, would it be possible to use #[repr(transparent)] to make them newtype wrappers? If that's the case then we could get a rock-solid guarantee about the stable API

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Aug 30, 2018

ok cool! I wonder, would it be possible to use #[repr(transparent)] to make them newtype wrappers? If that's the case then we could get a rock-solid guarantee about the stable API

The problem is that if we make them newtypes then f32x4(Simd<[f32; 4]>) is not Simd<[f32; 4]>. This can be problematic, for example, if you have Simd<[*mut f32; 4]> and load it from memory you get a Simd<[f32; 4]> which is a different type than f32x4 (which is a new type around Simd<T>). I can try to explore in a branch if the newtype approach can work. There might be other problems with the generic mask methods, shuffle macro (where the Simd<T> is differently accessed than NewType(Simd<T>)), etc. But maybe this all works out, one would just need to try it out.

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Sep 3, 2018

@gnzlbg ah ok. I do suspect though that trying to work with Simd<[T; N]> will likely effectively block any stabilization until const generics is stabilized at least, and it itself will likely be pretty contentious.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Sep 3, 2018

I do suspect though that trying to work with Simd<[T; N]> will likely effectively block any stabilization until const generics is stabilized at least, and it itself will likely be pretty contentious

I understand that such a type definition could be contentious, but it is very common in the ecosystem (e.g. SmallVec<[T; N]>, ArrayVec<[T; N]>, GenericArray<[T; N]>. Also, I don't think const generics will improve the API nor the implementation of Simd<[T; N]> much... if at all... in the same way that one doesn't need const generics to use SmallVec.

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Sep 4, 2018

@gnzlbg and I had some discussion on IRC (pasted below), but unfortunately not a ton of conclusions :(

11:10 < gnzlbg> acrichto: the only improvement I can think of that cost generics 
                would add to the ppv RFC is that instead of Simd<[T; N]> we will be 
                able to write Simd<T, N>
11:11 < gnzlbg> but that will need a lot beyond just woboats const generics proposal
11:11 <@acrichto> gnzlbg: right yeah, but it's also sort of unknown waters for simd 
                  and languages, I'm not sure if we'd be able to put it in libstd
11:11 <@acrichto> all I remember is like dozens of threads about this
11:11 <@acrichto> and would have to scour them again for arguments one way or another
11:11 < gnzlbg> my best proposal here is to expose Simd<[T; N]> in nightly only via 
                a ::__rustc_Simd<T>
11:12 < gnzlbg> the second best is to use new types to hide that
11:12 <@acrichto> I don't think the way we treat typedfes can hide the generic param
11:12 <@acrichto> in that it's just the same type to rustc
11:12 <@acrichto> and can affect things like inference and method resolution and such
11:12 < gnzlbg> yes, the error will say ::__rustc_Simd<...>
11:13 < gnzlbg> typedefs go away, so essentially we would be stabilizing a 
                __rustc_Simd<T> type, but making it illegal to use directly
11:13 <@acrichto> yeah but I'm not sure we could even do that
11:13 <@acrichto> like you could probably still use it somehow
11:13 <@acrichto> in a stable fashion
11:14 < gnzlbg> i mean, we can stabiliza Simd<T>
11:14 <@acrichto> in that our stability never was rock solid really, it's always 
                  best effort
11:14 < gnzlbg> its just a type
11:14 < gnzlbg> without methods
11:14 < gnzlbg> like Simd<T> it cannot do anything
11:14 <@acrichto> oh sure, but that's not what I'm worried about
11:14 <@acrichto> I'm worried about the interactions with the tyep arameter
11:14 <@acrichto> across blocks and such
11:14 < gnzlbg> the type is actually Simd<T: __rustc_Simd_array>
11:14 <@acrichto> I don't really have a great articulation
11:14 < gnzlbg> where the __rustc_Simd_array is a sealed trait
11:14 < gnzlbg> so its not exposed
11:14 <@acrichto> I just know that not having generics is super simple
11:14 <@acrichto> if verbose
11:14 < gnzlbg> ah yes
11:14 <@acrichto> and is much easier to stomach stabilizing
11:14 < gnzlbg> yeah 
11:15 < gnzlbg> I agree
11:15 < gnzlbg> i think that ideally we would only stabilize the 128-bit wide types
11:15 < gnzlbg> and see how that goes first
11:16 <@acrichto> we could try?
11:16 < gnzlbg> the 8-64 bit wide, and 256-512 bit wide types could be stabilized 
                later, but with 128-bit portable simd one can do a lot already
11:16 <@acrichto> I can't really personally render a judgement on whether a type 
                  parameter will work
11:16 <@acrichto> I'm just trying to act as a proxy for all the folks who commented 
                  on historical threads
11:16 <@acrichto> where I thought there was enough pushback to generics to not do it
11:16 <@acrichto> I could be misremembering though
11:16 < gnzlbg> i recall something like that
11:17 < gnzlbg> for this, the element type and length need to be part of the type if 
                we ever want to support packed vectors of pointers 
11:17 < gnzlbg> but if we are not stabilizing that, then its not necessary
11:17 < gnzlbg> and I'd try to avoid that initially
11:18 <@acrichto> I could see that yeah
11:22 < gnzlbg> acrichto: I'll try to make the current type aliases newtypes. It 
                might not be that much work - but it would be nice if the libs team 
                could discuss this
11:22 < gnzlbg> there are a couple of libraries in the ecosystem that do this 
                (SmallVec, ArrayVec, etc. come to mind)
11:22 < gnzlbg> and just because Simd<T> is generic, it does not mean that it 
                accepts every T
11:23 < gnzlbg> Simd<T: sealed_trait> lets us control exactly which Ts it can 
                accept, and stabilize these independently of one another
11:23 < gnzlbg> So Simd<[f32; 4]> could be stabilized without stabilizing Simd<[f32; 
                8]> etc.
11:24 < gnzlbg> I don't know how that would interact with type inference though
11:24 < gnzlbg> and I see that it could be problematic
11:35 <@acrichto> gnzlbg: from an api-design perspective in a vacuum I think we'd 
                  all agree that generics are the way to go
11:35 <@acrichto> (no way we'd ever say Vec<T> shouldn't exist)
11:35 <@acrichto> but SIMD I thought was a special case
11:36 <@acrichto> in the sense that the libs team probably wouldn't have many 
                  thoughts on this
11:36 <@acrichto> other than "let's consult those using lots of SIMD"
11:36 < gnzlbg> i'm open to better suggestions
11:36 <@acrichto> and it may be the case now that with the platform apis the genrics 
                  are fine
11:37 < gnzlbg> as in, we didn't use generics at first in std::simd, it just turned 
                out that I didn't figure out a way to implement some features 
                without them
11:37 < gnzlbg> and I still don't know how to implement them without them
11:37 < gnzlbg> but maybe others do / will
11:37 < gnzlbg> or in other words, the library doesn't use generics because it makes 
                the api nicer, but because afaik it currently cannot be implemented 
                without them
11:39 < gnzlbg> i wish i did not had to use generics actually, because that required 
                re-implementing the whole library :/
11:39 < gnzlbg> this is why i would be fine with hiding generics in the API
11:45 < rkruppe> gnzlbg: i think we talked about it before but remind me, what were 
                 these features that you couldn't figure out how to do without 
                 generics?
11:47 < gnzlbg> rkruppe: packed vectors of pointers
11:48 < rkruppe> going from <n x T*> to <n x T> or what
11:48 < gnzlbg> the other way around
11:48 < gnzlbg> ah no yeah, that way
11:48 < gnzlbg> <n x T*> to <n x T>, <n x T**> to <n x T*>, etc.
11:49 < rkruppe> that screams "associated type" at me
11:49 < gnzlbg> when going from <n x T**> to <n x T*> what do you put in the 
                associated type ?
11:50 < rkruppe> `<VectorType::Element as Deref>::Target`. well not literally Deref 
                 because raw pointers, but like Deref
11:51 < rkruppe> i.e. a helper trait implemented on pointers such that `<*const 
                 T>::Pointee = T`
11:51 < gnzlbg> so yes, kind of, since the result is also a vector
11:52 < gnzlbg> that's pretty much how it is currently implemented
11:52 < rkruppe> to be clear we're talking about the signature of gather?
11:52 < gnzlbg> yes, that type is used in that signature
11:53 < rkruppe> oh hm i think i see the problem
11:53 < gnzlbg> the thing is now you need to implement this trait for some concrete 
                types
11:53 < gnzlbg> *concrete vector types
11:53 < gnzlbg> and you have *mut f32 and *const f32 which deref into f32x2, f32x4, 
                ... depending on the vector length
11:54 < gnzlbg> so all of this are different types and can have different impls, so 
                far so good
11:54 < gnzlbg> the leafs are concrete types, so a pf32x4 will dereference into a 
                f32x4, which is nice
11:55 < gnzlbg> but now we wanted to also support dereferencing a ppf32x4 into a 
                pf32x4
11:55 < gnzlbg> (I think sunfish eddyb you and maybe some others talked about this, 
                that supporting vectors of pointers to pointers would be nice)
11:56 < gnzlbg> and that gives you an infinite amount of vector types that you have 
                to support
11:56 < rkruppe> `ppf32x4` is `Simd<[*const *const f32; 4]>`?
11:56 < gnzlbg> for example
11:56 < gnzlbg> but we can have `Simd<[*mut *const f32; 4]>` (all of these 
                combinations work
11:56 < rkruppe> i was actually thinking `ptrx4<T>` where T can be for example `f32` 
                 or `*mut *const f32`
11:57 < rkruppe> well maybe two variants of that type if you want both mut and const 
                 at the outer layer
11:57 < gnzlbg> i'd have to check the implementation but I think what I ended up 
                doing is having a x4<T> vector type where T can be anything
11:57 < gnzlbg> so you can have x4<*mut *const *mut *const f32>
11:58 < gnzlbg> this is all an internal implementation detail anyways, none of this 
                can be observed from outside
11:59 < rkruppe> yes that is how i understood the current solution
11:59 < gnzlbg> https://github.com/rust-lang-nursery/packed_simd/blob/master/src/vPtr.rs#L5
11:59 < gnzlbg> those types are actually exposed
11:59 < rkruppe> i am just trying to understand why it was necessary
12:00 < gnzlbg> and build on the x4 types (which are internal)
12:00 < gnzlbg> yeah, rkruppe so it was necessary to implement vector of pointers 
                for all pointers
12:00 < gnzlbg> not only for pointers to a non-pointer value, but also for pointers 
                to pointers
12:02 < rkruppe> the impls you linked seem to be generic over the pointee type tho? 
                 as in, are there really different impls for e.g. `x4<*const T>` and 
                 `x4<*const *const >`?
12:02 < rkruppe> [...] and `x4<*const *const T>`?
12:02 < gnzlbg> hmm no, they are only generic over the first *mut/*const T
12:02 < gnzlbg> depending on the first *mut/*const you can only read, or read/write
12:03 < gnzlbg> so the api of Simd<[*mut T; N]> is different than that of 
                Simd<[*const T; N]>
12:03 < gnzlbg> whether T is an f32, or another pointer (e.g. *mut f32) is irrelevant
12:04 < rkruppe> so then i don't understand why you can't have one concrete type 
                 (generic over `T`) corresponding to `Simd<[*const T; 4]>` and 
                 another corresponding to `Simd<[*mut T; 4]>`
12:05 < gnzlbg> rkruppe: we have, those are the types exposed
12:05 < gnzlbg> type cptrx2<T> = Simd<[*const T; 2]>
12:05 < rkruppe> i mean something like `struct mutptr4<T>(...);` and `struct 
                 constptr4<T>(...);`
12:06 < gnzlbg> yep, those are cptrx4 and mptrx4
12:06 < rkruppe> again i want to understand why you moved to having x4<T> / 
                 Simd<ARRAY> at all
12:06 < rkruppe> i fear the signature of gather/scatter would involve an associated 
                 type from hell but it still seems like it should be *possible*
12:07 < gnzlbg> rkruppe: ah because the generic impl that dereferences into the 
                target type is easier if mutptrx4 and f32x4 are the same type
12:07 < rkruppe> `fn gather(self) -> <Self as GatherResult>::It` where `It` is 
                 either another pointer vector or a vector of ints/floats
12:08 < gnzlbg> the impls for the pointer types are generic, there is only one 
                impl<T> for cptrx4<T> { ... } and the target type is just an 
                associated type Target = Simd<[T; 4]>
12:08 < rkruppe> yes, that is in many ways much nicer, but i don't see why it's 
                 *necessary*
12:08 < rkruppe> oh well
12:08 < gnzlbg> if T is just f32 at the leaf, then Target == f32x4 automatically, 
                while if T is *mut T, then Target is Simd<[*mut T; 4]>
12:08 < rkruppe> doesn't matter
12:08 < gnzlbg> it isn't necessary if you add all the impls automatically
12:08 < gnzlbg> *manually
12:09 < gnzlbg> the problem is, if you add a generic impl for all pointer types, and 
                you add an impl for the leaf primitive types
12:09 < gnzlbg> they will conflict
12:09 < rkruppe> no point rewriting the library once again and having the length as 
                 an explicit parameter is better for the future anyway
12:10 < gnzlbg> one might be able to work around this using a separate trait, but 
                then you want to have an impl for all T's that are pointers, and one 
                impl for all Ts that are not pointer
12:10 < rkruppe> gnzlbg: if you're talking the impl that provides gather, there 
                 would just be one`impl<T> cptr4<T>` and the selection between 
                 gather returning another vector pointer or a vector of leaf types 
                 would be made with associated types dispatched on T
12:11 < gnzlbg> i am talking about the impls for the trait that is used to find the 
                associated type
12:11 < rkruppe> that trait is implemented for `*const T`, `*mut T`, and the 
                 float/int types?
12:13 < gnzlbg> that might work I think
12:14 < gnzlbg> looking at the impl of gather currently it dispatches to a concrete 
                type
12:16 < gnzlbg> but i guess one can also add a trait with an associated type and 
                implement it for the vectors of pointers
12:18 < gnzlbg> rkruppe so the problem is that particular trait
12:18 < gnzlbg> lets call it trait VecDeref { type Target; } 
12:19 < gnzlbg> I can add  an impl<T> VecDeref for cptrx2<*mut T> { type Target = 
                cptrx2<T>; } 
12:19 < gnzlbg> and then I can add the manual impls for the concrete types, impl 
                VecDeref for cptrx2<f32> { type Target = f32x2; } 
12:20 < gnzlbg> so no, I don't think there is a problem, that should just work
12:21 < gnzlbg> I guess that can be retroffitted without many changes
12:21 < gnzlbg> so maybe we don't need to expose the Simd<[T; N]> type at all
@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Sep 4, 2018

oh also gisted a bit with @rkruppe which I didn't get a chance to follow over the weekend!

@glaebhoerl

This comment has been minimized.

Contributor

glaebhoerl commented Sep 5, 2018

FWIW my recollection of the internals thread(s) I participated in about this is that people were concerned about it being premature to speculate about Simd<[T; N]>, it wasn't clear whether or how well it could work, it'd be more conservative and prudent to just stabilize f32x4 et al first, unknown unknowns, and that sort of thing. Process oriented and meta-level arguments. I don't recall there having been specific technical objections to Simd<[T; N]>. "I don't remember that there were" is not the same as "I remember that there weren't", of course.

If @gnzlbg has implemented it and it turned out that it not only works but works better than the alternatives, then that answers some of those questions, I think?

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Sep 5, 2018

@glaebhoerl My main worry is that I don't know the answer to the following questions:

  • Can we put impl Foo { } blocks behind feature flags ? (to only stabilize some?)
  • Can we put impl Trait for Foo {} impls behind distinct feature flags ? (to only stabilize some?)

so I don't know whether it is possible to stabilize some part of the implementation without stabilizing all of it (and instantaneously stabilizing all future additions).

(EDIT: these questions are important because given a Simd<T>, what you can do with it depends on the concrete T provided, which traits it implements, etc. These are all implementation details of the library, but affect what the user can do with a Simd<T>)

If the answer to both questions is that yes, those things are possible, and trait impls behind feature flags do not affect type inference on stable, and on nightly when that particular feature flag is not active, then I think all my concerns would be resolved (cc @eddyb ?) .

@alexcrichton mentioned that similar discussions about generics have been had, and that other points about these issues have been raised. It might be worth it to dig those up: could someone share a link?

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Sep 5, 2018

@gnzlbg yes impl Foo { ... } blocks (and methods) can be separately stabilized. We cannot, however, selectively stabilize impl Trait for Foo { .. }.

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Sep 5, 2018

We cannot, however, selectively stabilize impl Trait for Foo { .. }.

So this might be the killer. To explain the problem, Simd<T> does not accept every T, it does not accept every array. For example Simd<[f32; 4]> is ok, but Simd<[f32; 3]> (and many others) is not.

The way that's implemented in the library is just using a trait to constrain T in the Simd implementation: Simd<T: sealed::SimdArray>. The private SimdArray trait is implemented in the library for some array times, like [f32; 4], but also some generic blanket impls like impl<T> SimdArray for [*const T; 4] where ... and similar.

This is problematic, because there are parts of the library that we don't want to stabilize at first, like 512-bit wide vectors, maybe vectors of pointers, etc. but we still want to be able to use and improve on nightly. The problem is that we can't do that, because we can't mark the impl Trait for Foo as an unstable impl behind a feature gate.

AFAICT const generics won't help. If we had a Simd<T, const N: usize> type we still have similar problems (e.g. restricting the Ns allowed, restricting T to not allow pointers at first, etc. ).

The only idea I have is that we could use conditional compilation to only put in core the pieces that are "ready" for stabilization, and that people would need to use the packed_simd crate for the "unstable" pieces. But that would mean that every time that we move something new to core it could become pretty much instastable, skipping the release train, which is a no go because the release train has detected many issues in the past when this was in std::simd.

@hsivonen

This comment has been minimized.

hsivonen commented Sep 6, 2018

@gnzlbg

Sorry about the slow response time. Looks great! Thank you. (I wish we could get safe wrappers around endianness-exposing bitcasts of vectors of the same bit width but different lane configuration sooner than later, though.)

PartialOrd<Self>: compares two vectors lexicographically.

Is it endianness-dependent? (Asking out of curiosity, not out of complaint, because endianness-exposing bitcasts were excluded from the RFC but endianness-dependent extract, new, etc. were included.)

@ralfbiedert

/// # Panics
///
/// If `slice.len() != Self::lanes()` or `&slice[0]` is not
/// aligned to an `align_of::<Self>()` boundary.
pub fn write_aligned(self, slice: &mut [element_type]);

Out of curiosity, have you checked if the compiler is smart enough to hoist the panic condition checks out of the loop when used with exact_chunks?

@gnzlbg

This comment has been minimized.

Contributor

gnzlbg commented Sep 6, 2018

Is it endianness-dependent?

No. AFAIK one cannot create endian-dependent behavior with packed_simd unless one writes unsafe code to e.g. transmute an u64 into an [u32; 2] or similar. These types of issues also affect primitive types like u64->[u32; 2] too, so they are not specific to vector types.

Out of curiosity, have you checked if the compiler is smart enough to hoist the panic condition checks out of the loop when used with exact_chunks?

Most of the recent work in the packed_simd examples and benchmarks has gone into beating ISPC in all benchmarks and manually removing these checks has played a non-negligible role in that. So what we have ended up doing is using the unsafe _unchecked methods pretty much in all kernels, but changing their implementation to include debug_assert!s so that they are checked in debug builds. That is, a method like vec.write_to_slice_aligned_unchecked(&[...]) does check that the target slice is large enough to hold vec and also suitably aligned if debug-assertions = true.

Having said this, there have been many changes that have improved performance in the recent weeks, so when all benchmark-related PRs have been merged it might be worth it to test this again by just enabling debug-assertions in release mode and running all benchmarks.

@alexcrichton

This comment has been minimized.

Member

alexcrichton commented Sep 6, 2018

@gnzlbg ah yeah that's why I was wondering if newtype wrappers would work, but I unfortunately don't know of a great solution :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment