Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add a scalable representation to allow support for scalable vectors #3268

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
191 changes: 191 additions & 0 deletions text/3268-repr-scalable.md
@@ -0,0 +1,191 @@
- Feature Name: repr_scalable
- Start Date: 2022-05-19
- RFC PR: [rust-lang/rfcs#3268](https://github.com/rust-lang/rfcs/pull/3268)
- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000)

# Summary
[summary]: #summary

Expanding the SIMD functionality to allow for runtime determined vector lengths.

# Motivation
[motivation]: #motivation

Without some support in the compiler it would be impossible to use the
[ACLE](https://developer.arm.com/architectures/system-architectures/software-standards/acle)
[SVE](https://developer.arm.com/documentation/102476/latest/) intrinsics from Arm.

This RFC will focus on the Arm vector extensions, and will use them for all examples. A large amount of what this
RFC covers is emitting the vscale attribute from LLVM, therefore other scalable vector extensions should work.
In an LLVM developer meeting it was mentioned that RISC-V would use what's accepted for Arm SVE for their vector extensions.
\[[see slide 17](https://llvm.org/devmtg/2019-04/slides/TechTalk-Kruppe-Espasa-RISC-V_Vectors_and_LLVM.pdf)\]

# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation

This is mostly an extension to [RFC 1199 SIMD Infrastructure](https://rust-lang.github.io/rfcs/1199-simd-infrastructure.html).
An understanding of that is expected from the reader of this. In addition to that, a basic understanding of
[Arm SVE](https://developer.arm.com/documentation/102476/latest/) is assumed.

Existing SIMD types are tagged with a `repr(simd)` and contain an array or multiple fields to represent the size of the
vector. Scalable vectors have a size known (and constant) at run-time, but unknown at compile time. For this we propose a
new kind of exotic type, denoted by an additional `repr()`. This additional representation, `scalable`,
accepts an integer to determine the number of elements per granule. See the definitions in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"granule" is mentioned here but not defined anywhere else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

granule is the name i made-up for the <4 x i32> part of the LLVM IR scalable vector type <vscale x 4 x i32>, idk what it's actually called.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LLVM, a scalable type is represented as an (ElementCount NumElts, Type EltTy). An ElementCount is represented by (IsScalable, MinNumElts). Maybe it would be good if called it the minimum number of elements instead of granule?

[the reference-level explanation](#reference-level-explanation) for more information.

e.g. for a scalable vector f32 type the following could be its representation:

```rust
#[repr(simd, scalable(4))]
pub struct svfloat32_t {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused on where scalable(4) comes into play here? I was looking at the svfloat32_t type in C, which is really backed by the builtin type __SVInt64_t and I couldn't find how that type was tied to a minimum element count of 4.

Am I missing where C SVE intrinsics tie svfloat32_t to a minimum number of elements? Or is this something that you are proposing Rust does that is missing in C?

Copy link
Member

@RalfJung RalfJung Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be related to the fact that the LLVM representation of the type is <vscale x 4 x f32>, which means that we assume the hardware scales in units of 128bits (that fit 4 f32). On hardware with a different scaling unit, this will be suboptimal -- or maybe even not work, if the scaling unit is smaller than 128 bits. IOW, this type is pretty non-portable.

That's my understanding based on reading the LLVM LangRef; maybe I got it all wrong. Unfortunately the RFC doesn't explain enough to be able to say -- it assumes a bunch of background on how these scalable vector types work in LLVM / hardware.

_ty: [f32],
}
```
`_ty` is purely a type marker, used to get the element type for the LLVM backend.


This new class of type has the following properties:
* Not `Sized`, but it does exist as a value type.
* These can be returned from functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to indicate that we need support for "unsized (r)values" to use this feature. Unfortunately the current state of unsized values is "they are a complete mess, and don't even have a consistent MIR-level semantics".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently have support for returning unsized from functions in Rust. I would like for this RFC to better detail the impact that this will have on implementing scalable vectors in Rust. I hope I can provide some helpful information below.

If we look at this example, we can see that C/C++ can handle:

  1. Scalable types as function params
  2. Scalable types as local variables
  3. Scalable types as return values

How does C/C++ do it?

These types in C/C++ are both sizeless and scalable sized. It seems that they invoke either of these properties where it is convenient. For example if you try to take sizeof on a scalable type:

<source>:8:5: error: invalid application of 'sizeof' to sizeless type 'vint32m8_t' (aka '__rvv_int32m8_t')
    8 |     sizeof(vint32m8_t);

Another example of the scalable type being sizeless is in ASTContext::getTypeInfoImpl:

    // Because the length is only known at runtime, we use a dummy value
    // of 0 for the static length.
#define SVE_VECTOR_TYPE(Name, MangledName, Id, SingletonId, NumEls, ElBits,    \
                        IsSigned, IsFP, IsBF)                                  \
  case BuiltinType::Id:                                                        \
    Width = 0;

But on the other hand, clang also treats these types as having a scalable size which can be resolved at runtime. There is a function getBuiltinVectorTypeInfo. In this function you can see how a BuiltinVectorTypeInfo object gets created using ElementCount::getScalable:

#define SVE_ELTTY(ELTTY, ELTS, NUMVECTORS)                                     \
  {ELTTY, llvm::ElementCount::getScalable(ELTS), NUMVECTORS};

// ... snip

#define RVV_VECTOR_TYPE_INT(Name, Id, SingletonId, NumEls, ElBits, NF,         \
                            IsSigned)                                          \
  case BuiltinType::Id:                                                        \
    return {getIntTypeForBitwidth(ElBits, IsSigned),                           \
            llvm::ElementCount::getScalable(NumEls), NF};

Then in SemaChecking.cpp, there are function calls such as areCompatibleSveTypes, checkRVVTypeSupport, CheckImplicitConverssion which type check treating these types as having a scalable size.

When it comes to code-gen to LLVM IR, Rust unsized types have been tricky because it can be difficult to lower unsized types, especially when it comes to return types. But that isn't the case with scalable types. Rust scalable types can be mapped to LLVM scalable types. I think this may allow us to sidestep a lot of the complications that come with supporting general unsized types in Rust. Using the godbolt example above we see that the C scalable/sizeless types lowered as LLVM scalable types:

  %7 = load i64, ptr %4, align 8
  %8 = call <vscale x 16 x i32> @foo(__rvv_int32m8_t, unsigned long)(<vscale x 16 x i32> %6, i64 noundef %7)
  store <vscale x 16 x i32> %8, ptr %5, align 4

Relying on Builtins

One important point I want to make here is that C/C++ is limiting scalable/sizeless types to builtins. For example, you cant define your own scalable type. In addition you cant define data structures using existing builtin scalable types:

// This is an error
struct  foo {
vint32m8_t b;
vint32m8_t a;
};

As a result, the scope of handling these types is greatly reduced. As I pointed out above, functions like areCompatibleSveTypes, checkRVVTypeSupport know how to type check specifically on these scalable types. There is explicit lowering of intrinsics that operate on these types. I believe that by restricting support to only care for handling unsized scalable builtins, then we may not have to concern ourselves with what a mess general unsized types are in Rust.

What does this mean for Rust

I hope that this RFC can clarify what it will look like to add support for scalable vectors, in the context of unsized in Rust. Some questions I would like to clarify:

  • Will we support unsized fn params, unsized local variables, and unsized return values in general, or will we limit the scope to scalable types? I am leaning towards the latter, especially because supporting unsized return values might be a massive undertaking, if it is possible at all. I think if you choose the former, then we should have an RFC on adding that feature to the language. I've started inquiring about that topic on this Zulip thread in attempt to understand if any work had been done yet.
  • Will scalable types be builtin or can people define their own scalable types in their own Rust programs? If we choose the builtin path, I would like this RFC to discuss adding builtins under Prior Art.
  • If we sometimes treat these types as unsized and sometimes treat them as having scalable sized, what features will we need to include? Would we require something like #![feature(unsized_fn_params, unsized_locals, unsized_ret_vals)], #![feature(scalable_types)]`, or both?

* Heap allocation of these types is not possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C, heap allocation depends on malloc which takes a size. You can't call sizeof on the an unsized type in C. So it is a compiler error to write malloc(sizeof(vint8mf8_t)). In this sense, unsized types may seem non-heap-allocatable.

However, I took a look at the RISC-V "V" C intrinsics trying to understand whether this had to be the case. On RISC-V a vector register has a size, even if it is unknown at compile time (due to the vscale). However, the __riscv_vlenb C intrinsic could be used to write programs that determine the size of the vector register associated with a type at runtime. As a result, it should be possible to do something like this. Using pseudo-code:

vscale = __riscv_vlenb() / 64;
// helper func that returns the minimum vector size (i.e. size without vscale or multiplied by a vscale of 1)
min_vec_size = get_min_size(vint8mf8_t);
vint8mf8_t *heap_allocated_scalable = malloc(to_bytes_from_bits(vscale * min_vec_size));

So while it may be a little convoluted (and target dependent) to allocate these types on the heap, I think it is possible. Maybe it would be better to drop this as a requirement but note that initially there will not be support for allocating these types on the heap.

* Can be passed by value, reference and pointer.
* The types can't have a `'static` lifetime.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, so svfloat32_t: 'static is not true? But there's no lifetime in this type so this statement must be true. What is this about?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be poorly phrased by me, I was referring to the fact these cant exist as a static variable. I can update the RFC to make that clearer.

* These types can be loaded and stored to/from memory for spilling to the stack,
and to follow any calling conventions.
* Can't be stored in a struct, enum, union or compound type.
Comment on lines +50 to +55
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a wild list of restrictions, and the RFC does not explain why they are needed. Further down it seems like really these types are just "slices where the length is determined by a run-time constant". Slices don't have most of these restrictions, so why do scalable SIMD types need them?

* This includes single field structs with `#[repr(trasparent)]`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what should happen when I do

#[repr(transparent)]
struct Wrap<T>(T);

type MyTy = Wrap<svfloat32_t>;

Are scalable SIMD types not allowed to instantiate generic parameters? Are there new post-monomorphization errors for when a generic instantiation turns out to break rules like this?

* This also means that closures can't capture them by value.
* Traits can be implemented for these types.
* These types are `Unpin`.

A simple example that an end user would be able to write for summing of two arrays using functions from the ACLE
for SVE is shown below:

```rust
unsafe {
let step = svcntw() as usize;
for i in (0..SIZE).step_by(step) {
let a = data_a.as_ptr().add(i);
let b = data_b.as_ptr().add(i);
let c = &mut data_c as *mut f32;
let c = c.add(i);

let pred = svwhilelt_b32(i as _, SIZE as _);
let sva = svld1_f32(pred, a);
let svb = svld1_f32(pred, b);
let svc = svadd_f32_m(pred, sva, svb);

svst1_f32(svc, pred, c);
}
}
```
As can be seen by that example the end user wouldn't necessarily interact directly with the changes that are
proposed by this RFC, but might use types and functions that depend on them.

# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation

This will focus on LLVM. No investigation has been done into the alternative codegen back ends. At the time of
Copy link
Member

@RalfJung RalfJung Apr 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should focus on Rust, not LLVM. In other words, it should fully describe the behavior of these types without mentioning anything LLVM-specific. This is a Rust langauge RFC after all, so its effect needs to be described in terms of what happens on the level of Rust.

It is okay to also explain how this maps to LLVM, but you cannot expect the reader to know anything about LLVM -- so the text needs to make sense to someone who knows nothing about LLVM.

writing I believe cranelift doesn't support scalable vectors ([current proposal](https://github.com/bytecodealliance/rfcs/pull/19)),
and the GCC backend is not mature enough to be thinking about this.

Most of the complexity of SVE will be handled by LLVM and the `vscale` modifier that is applied to vector types. Therefore
changes for this should be fairly minimal for Rust. From the LLVM side this is as simple as calling `LLVMScalableVectorType`
rather than `LLVMVectorType`.
Comment on lines +88 to +94
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that unfortunately, some time has elapsed since this was first proposed, I'd like to see this RFC address a little bit more about how a non-LLVM backend might handle this. It doesn't have to dwell deeply on this, but I would like to see cross-referencing with cranelift's Dynamic Vectors implementation so that we know, before we stabilize anything, if the design will be tractable to implement by codegen that isn't "use LLVM". LLVM has injected limitations that are not contingent on the capabilities of the CPUs in question, so what other arbitrary limitations will we need to account for?

More than just codegen, it is very convenient if Miri understands how things operate, so it can model what is UB or uninit (poison/undef). So I would like it if these intrinsics were defined as something Miri can recognize and execute during interpretation of a Rust program, as opposed to just linking raw LLVM intrinsics, even if it's just "use a Rust intrinsic which expands into raw LLVMIR, which does roughly ${description}".


For a Scalable Vector Type LLVM takes the form `<vscale x elements x type>`.
* `elements` multiplied by sizeof(`type`) gives the smallest allowed register size and the increment size.
* `vscale` is a run time constant that is used to determine the actual vector register size.

For example, with Arm SVE the scalable vector register (Z register) size has to
be a multiple of 128 bits and a power of 2 (via a retrospective change to the
architecture), therefore for `f32`, `elements` would always be four. At run time
`vscale` could be 1, 2, 4, 8, 16 which would give register sizes of 128, 256,
512, 1024 and 2048. While SVE now has the power of 2 restriction, `vscale` could
be any value providing it gives a legal vector register size for the
architecture.
Copy link
Member

@RalfJung RalfJung Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a pretty bad API for portable efficient vector programming. I thought the point was to not have to know the vector size supported by the hardware, so I could e.g. use <vscale x i32> to get a vector of i32 that's the ideal size for this hardware. But now it seems like I still have to know the hardware I am writing for so that I can use <vscale x 4 x i32> on ARM while using e.g. <vscale x 8 x i32> on some target where vscale measures multiples of 256 bits.

Ideally for Rust we should have a version of this that does not require me to know the hardware's "vector scaling unit" (i.e. the size that corresponds to an LLVM vscale of 1).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I could e.g. use to get a vector of i32

A scalable type has a minimum size component. For example <vscale x 4 x i32>

But now it seems like I still have to know the hardware I am writing for

I'm not sure thats true in all instances. In LLVM, vector types go through type legalization in SelectionDAG or GlobalISel, which are components responsible for translating IR into target specific instructions. In cases where SelectionDAG or GlobalISel see a vector type that is not supported, the legalizer will try to put it into a form that the hardware can support. One example of this is on RISC-V where all fixed vectors are legalized into scalable vectors.

Ideally for Rust we should have a version of this that does not require me to know the hardware's "vector scaling unit" (i.e. the size that corresponds to an LLVM vscale of 1).

As LLVM scalable types exist today, we don't know what vscale is until runtime. So you are not required to know the hardware's scaling unit at compile time.

(i.e. the size that corresponds to an LLVM vscale of 1).

This sounds like a suggestion to use fixed sized vectors instead of scalable vectors in cases where your really need it.

Copy link
Member

@RalfJung RalfJung Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything I am saying is based on the LangRef: "For scalable vectors, the total number of elements is a constant multiple (called vscale) of the specified number of elements; vscale is a positive integer that is unknown at compile time and the same hardware-dependent constant for all scalable vectors at run time. The size of a specific scalable vector type is thus constant within IR, even if the exact size in bytes cannot be determined until run time.".

IOW, this is not a minimum size. <vscale x 4 x i32> means "some constant times 4 x i32". And if you also have a <vscale x 2 x i32> then that's the same constant times 2 x i32". So, <vscale x 4 x i32> will always be exactly twice as large as <vscale x 2 x i32>. If the ARM chip has vectors of size 512bit, then vscale=4 and <vscale x 2 x i32> will be only 256bit in size, so half the vector width was wasted. One therefore has to carefully pick the unit that is being scaled to match the hardware.

As LLVM scalable types exist today, we don't know what vscale is until runtime. So you are not required to know the hardware's scaling unit at compile time.

I was talking about the scalable vector unit, not the scalable vector factor. (I am making up terms here as LangRef doesn't give me good terms to work with.) On ARM, the "unit" is 128bit large. The factor then determines the actual size of the vector registers, in units of 128bit. So a factor of 4 means the registers are 512 bit large. With the interface provided by LLVM, one has to know the unit (not the factor!) at compiletime to generate optimal code.

Or maybe I got it all wrong. But the LangRef description is not compatible with your claim that the 4 in vscale x 4 x i32 is a minimum.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was talking about the scalable vector unit

Do you mind giving a definition of what a unit is? Is that the fixed components of the vector type? For <vscale x 4 x i32> the unit is 4 x i32?

... With the interface provided by LLVM, one has to know the unit (not the factor!) at compile time to generate optimal code.

I'm not so sure about ARM, but I know that RISC-V can generate code for all different "units" regardless the runtime vscale value. You can pick whatever "unit" you'd like to use.

But the LangRef description is not compatible with your claim that the 4 in <vscale x 4 x i32> is a minimum.

It is a minimum because the smallest runtime value of vscale is 1.

Copy link
Member

@RalfJung RalfJung Apr 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a minimum because the smallest runtime value of vscale is 1.

If describing it as a minimum is a sufficient description, then <vscale x 2 x i32> and <vscale x 4 x i32> should both be vectors of size 128bit (if the platform has registers of that size), right? I am asking for "at least 2 (or 4) i32, but ideally as many as the hardware provides".

But that's not correct, according to LangRef. Ergo, saying it is a minimum is misleading. The type is not defined as "at least that big", it is defined as "the hardware-specific scaling factor times that base size". If you pick the base size too small (smaller than the scaling unit of the hardware), you will waste register space. If you pick it too big, presumably LLVM complains.

Do you mind giving a definition of what a unit is?

It's how much you get when the factor is 1. I am talking about a hardware property here. ARM defines that if vscale is 1 then the registers are 128bit large, ergo the ARM scalable vector unit is 128bit -- IOW, the size of ARM scalable vectors is measured in multiples of 128bit.

LLVM vscale types also have a unit, as you say it is the part after vscale x. If that unit does not have the same size as the hardware unit then things seem weird.


The scalable representation accepts the number of `elements` rather than the compiler calculating it, which serves
two purposes. The first being that it removes the need for the compiler to know about the user defined types and how to calculate
the required `element` count. The second being that some of these scalable types can have different element counts. For instance,
the predicates used in SVE have different element counts in LLVM depending on the types they are a predicate for.

As mentioned previously `vscale` is a runtime constant. With SVE the vector length can be changed at runtime (e.g. by a
[prctl()](https://www.kernel.org/doc/Documentation/arm64/sve.txt) call in Linux). However, since this would require a change
to `vscale`, this is considered undefined behaviour in Rust. This is consistent with C and C++ implementations.

## Unsized rules
These types aren't `Sized`, but they need to exist in local variables, and we
need to be able to pass them to, and return them from functions. This means
adding an exception to the rules around returning unsized types in Rust. There
are also some traits (`Copy`) that have a bound on being `Sized`.

We will implement `Copy` for these types within the compiler, without having to
implement the traits when the types are defined.

This RFC also changes the rules so that function return values can be `Copy` or
`Sized` (or both). Once returning of unsized is allowed this part of the rule
would be superseded by that mechanism. It's worth noting that, if any other
types are created that are `Copy` but not `Sized` this rule would apply to
those.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that Rust has generics, so I can e.g. write a function fn foo<T: Copy>(x: &T) -> T. The RFC seems to say this is allowed, because the return type is Copy. But for most types T and most ABIs this can't be implemented.

You can't just say in a sentence that you allow unsized return values. That's a major language feature that needs significant design work on its own.

I think what you actually want is some extremely special cases where specifically these scalable vector types are allowed as return values, but in a non-compositional way. There is no precedent for anything like this in Rust so it needs to be fairly carefully described and discussed.


# Drawbacks
[drawbacks]: #drawbacks

## Target Features
One difficulty with this type of approach is typically vector types require a
target feature to be enabled. Currently, a trait implementation can't enable a
target feature, so some traits can't be implemented correctly without setting `-C
target-feature` via rustc.

However, that isn't a reason to not do this, it's a pain point that another RFC
can address.

# Prior art
[prior-art]: #prior-art

This is a relatively new concept, with not much prior art. C has gone a very
similar way to this by using sizeless incomplete types to represent the SVE
types. Aligning with C here means that most of the documentation that already
exists for the intrinsics in C should still be applicable to Rust.

# Future possibilities
[future-possibilities]: #future-possibilities

## Relaxing restrictions
Some of the restrictions that have been placed on these types could possibly be
relaxed at a later time. This could be done in a backwards compatible way. For
instance, we could perhaps relax the rules around placing these in other
types. It could be possible to allow a struct to contain these types by value,
with certain rules such as requiring them to be the last element(s) of the
struct. Doing this could then allow closures to capture them by value.

## Portable SIMD
For this to work with portable SIMD in the way that portable SIMD is currently
implemented, a const generic parameter would be needed in the
`repr(scalable)`. Creating this dependency would probably be a source of bugs
from an implementation point of view as it would require support for symbols
within the literals.

One potential for having portable SIMD working in its current style would be to have a trait as follows:
```rust
pub trait RuntimeScalable {
type Increment;
}
```

Which the compiler can use to get the `elements` and `type` from.

The above representation could then be implemented as:
```rust
#[repr(simd, scalable)]
pub struct svfloat32_t {}
impl RuntimeScalable for svfloat32_t {
type Increment = [f32; 4];
}
```

Given the differences in how scalable SIMD works with current instruction sets it's worth experimenting with
architecture specific implementations first. Therefore portable scalable SIMD should be fully addressed with
another RFC as there should be questions as to how it's going to work with adjusting the active lanes (e.g.
predication).