Should intrinsic have to model tail elements? #27

zakk0610 · 2020-06-23T16:30:52Z

current intrinsic RFC does not model tail element,
but someone maybe want to set the tail undisturbed like riscv/riscv-v-spec#157 (comment).
I think maybe we can only provide additional tail operand for reduction intrinsic functions to model tail elements.
any idea?

nick-knight · 2020-06-23T17:13:01Z

My opinion is that all functionality exposed to the RVV assembly programmer should also be exposed to the RVV intrinsics programmer; this includes the vta and vma settings. Since these bits are part of vtype (along with vsew and vlmul), it raises the question of whether they should also be encoded into the intrinsic names, passed in as arguments, etc.

rdolbeau · 2020-06-24T09:50:03Z

My 2 cents on the subject...

In most case when coding at high level, unused values in the registers (both inactive and tail) are irrelevant and the user doesn't care [1] what's in it. For peel/tail they are not loaded, not computed and not stored. For masked-out values (e.g. convergence loops), it's the same. The one obvious big exception is conditionals in loop, where usually the old value should be kept (either because the condition if false for 'if/then', or we're in the 'else' merging with the 'then' for 'if/then/else').

So in my opinion the default behavior should be 'keep whatever is in there' (undisturbed), as it should be quite enough for most use cases (as it satisfies both 'don't care' and 'merge' semantic). That's also why vundefined() is in there - it explicits the 'don't care' by inputing undefined value in the 'merge' semantic. For user in need of a specific value instead of undefined(), they can easily use an explicit intrinsic or a value as an input.

So in v0.9-speak, default current behavior should be either {mu, tu} (if the maskedoff input is valid) or {ma,ta} (if it's vundefined() or not present in the intrinsics, thus implicitly undefined, e.g. the unmasked intrinsics).

The question is, how to also model {ma,tu} and {mu,ta} in the intrinsics... and how does that interact with the above? In the current design, having {,tu} or {mu,} only works if there is a valid maskedoff parameter, otherwise the user can't specify the values to be left undisturbed... but from my reading of v0.9, the non-masked variant should be eligible for {*,tu} (not that {mu,tu} makes much sense w/o a mask...).

A simple solution would be to add a set of 'tail agnostic' intrinsics to the masked ones. So the behavior would be:

a) no maskedoff parameter, or maskedoff is vundefined(): {ma,ta}
b) valid maskedoff parameter, no tail specification: {mu, tu}
c) valid maskedoff parameter, tail agnostic: {mu, ta}

This does not give access to {ma,tu} unfortunately - though to be honest I don't see a use case at the moment for that combination. Would hardware be able to do better on {ma,tu} than on {mu,tu}? [and case b) and c) could be reversed to default to _ta and require a specification for _tu, which probably makes more sense in practice]

Another solution is to add vta/vma to vsetvl() but a) I don't like hidden state in high-level code and b) {mu,tu}, {mu,ta} and {ma,tu} don't make much sense with vundefined() or any intrinsics without a maskedoff parameter...

A third solution would be to assume for {ma,} or {mu,} as above, but rely on the hidden state for {,ta} and {,tu}. It's sort of half-and-half...

[the heavyweight solution is to have 4 variant of the intrinsics: _ma_ta (not maskedoff parameter needed), _mu_tu, _ma_tu and _mu_ta (all three with a mandatory maskedoff parameter), drop vundefined(), and let the user do all the work... perhaps with just _ta (no maskedoff) and _tu (mandatory maskedoff) for non-masked intrinsics... the _ma or _mu bit would take over the current _m prefix for the masked version].

[1] Except for security reasons perhaps (avoiding information leakage), but that's out-of-scope here I think.

rofirrim · 2020-06-25T09:47:58Z

In our prototype (which does not account for float16 and fractional LMUL yet) we already have ~7500 intrinsics. Multiplying them by 4 to cover all the combinations is a bit disheartening.

One option is to add a tail operand to the unmasked intrinsics. This way:

unmasked operations with a tail that is vundefined() would map to {*, ta}. There is no mask so either mu or ma would be OK, ma is the least defined and so it imposes the least behaviour and might be preferred. So {ma,ta}.
unmasked operations with a tail that is not a result from vundefined() would map to {ma,tu}. As Romain mentioned earlier {ma,tu} is not obviously useful but again we might want to prefer, if possible, modes that impose the least defined behaviour (i.e. mu is a valid implementation of ma).

Then, for masked operations:

If their maskedoff operand is vundefined() they would map to {ma,ta}.
If their maskedoff operand is not vundefined() they would map to {mu,ta}.

An issue with this approach is that {mu,tu} is not directly "expressible": we might want a guarantee that the tail is preserved that is stronger than "the implementation of ta is tu in this machine".

A way to address this is to add a new version of the masked intrinsics (with suffix, say, _mt), with an undisturbed operand (rather than maskedoff). We can allow vundefined() in that operand but it would not be different to the existing masked versions:

If their undisturbed operand is vundefined() they would map to {ma,ta}.
If their undisturbed operand is not vundefined() they would map to {mu,tu}.

We can look at it in the other way round, depending on what are the user needs (note that if something is not needed to be preserved it would still be correct, though unnecessary, to preserve it).

Masked?	Needs tail preserved	Needs maskedoff preserved	Intrinsic
No	No	N/A	`vadd_vv_<ty>_vl(vundefined(), a, b, vl)`
No	Yes	N/A	`vadd_vv_<ty>_vl(tail, a, b, vl)`
Yes	No	No	`vadd_vv_<ty>_m_vl(vundefined(), mask, a, b, vl)`¹
Yes	No	Yes	`vadd_vv_<ty>_m_vl(maskedoff, mask, a, b, vl)`
Yes	Yes	No	`vadd_vv_<ty>_mt_vl(undisturbed, mask, a, b, vl)`²
Yes	Yes	Yes	`vadd_vv_<ty>_mt_vl(undisturbed, mask, a, b, vl)`

¹vadd_vv_<ty>_mt_vl(vundefined(), va, vb, vl) could be allowed as an equivalent intrinsic.
²we believe this case is not very common, so implementing it as {mu,tu} might be a reasonable trade-off.

A downside with this approach is that now we have to specify the tail operand, which in many cases is going to be vundefined(). We could mitigate this by making the unmasked operations always {ma,ta} and then forcing the user to use _mt with an "all ones" mask if the tail must be preserved. The compiler could then soften the operation to a {ma,tu} unmasked operation.

The table above would look like this now:

Masked?	Needs tail preserved	Needs maskedoff preserved	Intrinsic
No	No	N/A	`vadd_vv_<ty>_vl(a, b, vl)`
No	Yes	N/A	`vadd_vv_<ty>_mt_vl(undisturbed, all-ones, a, b, vl)`
Yes	No	No	`vadd_vv_<ty>_m_vl(vundefined(), mask, a, b, vl)`
Yes	No	Yes	`vadd_vv_<ty>_m_vl(maskedoff, mask, a, b, vl)`
Yes	Yes	No	`vadd_vv_<ty>_mt_vl(undisturbed, mask, a, b, vl)`³
Yes	Yes	Yes	`vadd_vv_<ty>_mt_vl(undisturbed, mask, a, b, vl)`

³we believe this case is not very common, so implementing it as {mu,tu} might be a reasonable trade-off.

I might prefer the second approach: preserving the tail is not that usual to justify impacting all the unmasked operations. Also the softening required may also be useful in other circumstances so it seems it is something a compiler will want to implement anyways.

zakk0610 · 2020-06-29T06:51:48Z

current RFC has almost 20k intrinsic functions and 5k functions with function overloading (include Zvlsseg extension), definitely we don't want to support 4 variant of the intrinsics...
although @rofirrim 's second approach would increase the 1/3 number of functions, but I agree it's a good way to model tail element. ex. avoiding additional merging instruction if providing tail and maskedoff arguments, keep non-mask non-tail intrinsic function simple, etc.
I think the only one question is, would hardware be able to do better on {ma,tu} than on {mu,tu}? (@rdolbeau had mentioned above)
If the answer is yes, compiler can not generate {ma,tu} setting in second approach.

Hsiangkai · 2020-08-07T07:25:44Z

Agree with @rofirrim's second approach. Provide another set of intrinsics ending with _mt_vl to cover tail undisturbed cases.

rofirrim · 2020-09-09T19:35:06Z

My colleagues and I were looking today at some code that uses vfmacc and I realised that the intrinsics for instructions that use one of the input registers as output registers (such as vfmacc, vmacc, etc.) won't need any additional parameter in the _mt versions.

The reason is that the clobbered input is already what is going to be "undisturbed" in the final result.

So the intrinsic C prototype of the _mt version is exactly the same as the agnostic ones. They will be emitted using a different governing vsetvli though.

Does this make sense?

zakk0610 · 2020-09-10T00:30:52Z

Yes, it makes sense to me.

topperc · 2020-12-30T01:21:00Z

I'd like to continue the discussion of this. Sifive has been working on upstreaming the IR intrinsics into llvm.org and we'll we working on the C intrinsics soon. I'd like to make sure what we're doing is scalable to the future.

What we have in our internal branch is to always use a tail undisturbed policy. But many intrinsics don't take an argument for the undisturbed value so it is not controlled by the user. It will just be whatever is in the register that gets picked by the register allocator. It would also create a false dependency on implementations with vector register renaming. So this doesn't seem like a good long term behavior.

As we started upstreaming we initially used tail agnostic. A recent change was made upstream to use tail undisturbed for any masked intrinsic, fma intrinsic, or vcompress intrinsic and tail agnostic since for those intrinsics that user would have control and might want to preserve tail elements. We have not added detection of vundefined yet so we always use mask undisturbed.

I feel like this has put our implementation into a state that's not easy to explain as there's nothing in the name to tell the user which intrinsics use tail undisturbed and which don't. So I think I like @rofirrim's second approach where the tail undisturbed policy is explicit in each intrinsic.

I think what I would like to say is that the currently defined C interface is always tail agnostic and we can add new tail undisturbed intrinsics with a consistent naming convention. I think that matches the behavior of the non-"mt" intrinsics in @rofirrim second approach. We would need to redefine the vcompress intrinsic to drop the "maskedoff" argument which #55 proposes to rename. The current definition would be used for an t version.

I do worry that since tail undisturbed is a valid implementation of tail agnostic in hardware, that a user could use a masked or FMA intrinsic and accidentally become dependent on tail elements being preserved on today's hardware. But have it break in the future.

What do others think?

David-Horner · 2020-12-30T14:12:38Z

I share your concern. The major concern raised during design was application dependence on tail state. Especially with agnostic implementation validation is challenging, thus the recommendation was not to zero as a simple default fill value, but rather use a poison value of ones. As the allowed values are indeed agnostic, any pattern can be provided, but what is not allowed is residual from previous calculations to any of the register set. I advocated for a pseudo random fill value, however, the simplification for verification of ones superseded the "force software to understand what agnostic means". The saving point for my sanity was the opportunity for the compiler/assembler to provide 1) specify agnostic by default 2) make it the default option for associative operations with agnostic tail/mask that which is vs1,vs2 is undefined. (e.g. vadd.vv v4,v8,v12 could equivalently generate vadd.vv v4,v12,v8) 2) -Wflags warn of use of uninitialized tail elements in tail-undisturbed operations. 3) -Xflag - for any tail or mask agnostic operation randomly do any of following before transforming the operation into a tail/mask undisturbed; fully populate the target register with a) another random (but not same) register set (fast on renaming machines) b) a splat of a random X register c) "randomizing" self ops (e.g. viota.m with a random mask) The first three support high performance checking/operation. The last is a performance hit for most implementations.

…

On 2020-12-29 8:21 p.m., Craig Topper wrote: I'd like to continue the discussion of this. Sifive has been working on upstreaming the IR intrinsics into llvm.org and we'll we working on the C intrinsics soon. I'd like to make sure what we're doing is scalable to the future. What we have in our internal branch is to always use a tail undisturbed policy. But many intrinsics don't take an argument for the undisturbed value so it is not controlled by the user. It will just be whatever is in the register that gets picked by the register allocator. It would also create a false dependency on implementations with vector register renaming. So this doesn't seem like a good long term behavior. As we started upstreaming we initially used tail agnostic. A recent change was made upstream to use tail undisturbed for any masked intrinsic, fma intrinsic, or vcompress intrinsic and tail agnostic since for those intrinsics that user would have control and might want to preserve tail elements. We have not added detection of vundefined yet so we always use mask undisturbed. I feel like this has put our implementation into a state that's not easy to explain as there's nothing in the name to tell the user which intrinsics use tail undisturbed and which don't. So I think I like @rofirrim <https://github.com/rofirrim>'s second approach where the tail undisturbed policy is explicit in each intrinsic. I think what I would like to say is that the currently defined C interface is always tail agnostic and we can add new tail undisturbed intrinsics with a consistent naming convention. I think that matches the behavior of the non-"mt" intrinsics in @rofirrim <https://github.com/rofirrim> second approach. We would need to redefine the vcompress intrinsic to drop the "maskedoff" argument which #55 <#55> proposes to rename. The current definition would be used for an /t/ version. I do worry that since tail undisturbed is a valid implementation of tail agnostic in hardware, that a user could use a masked or FMA intrinsic and accidentally become dependent on tail elements being preserved on today's hardware. But have it break in the future. What do others think? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFAWIKJQNH7EYUS43TKTUXLSXJ6AVANCNFSM4OF2XQUQ>.

ebahapo · 2020-12-30T18:47:51Z

I am against encoding the mask and tail preservation policies explicitly in the intrinsics or builtins. Rather, I favor using new function attributes to specify such policies, defaulting to mu and ta.

topperc · 2020-12-30T19:04:40Z

I am against encoding the mask and tail preservation policies explicitly in the intrinsics or builtins. Rather, I favor using new function attributes to specify such policies, defaulting to mu and ta.

I think @rofirrim's proposal only encodes the tail policy into the intrinsic name. The mask policy is based on vundefined being passed.

Tail undisturbed requires at least one additional operand for many of the intrinsics to provide the undisturbed value. So I don't think we accomplish that with just a function attribute unless we make that extra operand always be there and ignored under tail agnostic.

rofirrim · 2020-12-30T19:16:33Z

I wonder if in the line of @ebahapo suggestion we could give an alternative interpretation to the intrinsic based on such attribute (btw I assume Evandro means __attribute__ but correct me if it is not that kind of attribute, or maybe it is an attribute in the call site such as [[...]]?).

For instance one of the operands (e.g. the first) could used for the tail undisturbed. It turns something like vsetvl ... ta,ma; vdest ← vop vsrc1, vsrc2 into something closer to vsetvl ... tu,ma; vdest ← vop vdest vsrc2 (assuming unmasked operations here). This way we avoid the explosion of intrinsics at expense of a slightly more complex semantics as now it would depend on the presence of that attribute.

If our expectation is that tail agnostic is the preferred default, this attribute acting as a modifier might be a reasonable approach.

I think this also disambiguates the case for three-input operations such as FMA, does it? Perhaps I'm missing some nuance here.

zakk0610 · 2021-01-06T07:23:22Z

current RFC has almost 20k intrinsic functions and 5k functions with function overloading (include Zvlsseg extension), definitely we don't want to support 4 variant of the intrinsics...

In fact, there are too much intrinsic functions for RVV programming..
Maybe we should teach intrinsic users how to write a vectorized program with tail agnostic api. (Ex. giving a reduction loop example)
It means rvv intrinsics programming model only support tail agnostic behavior because ta (always?) has a better performance.
I think the one thing users really care about is how to write a high performance rvv code.

tony-cole · 2021-05-10T15:28:29Z

I would like a way to set tu (Tail Undisturbed) in the intrinsics as I use it in this (and other) code: https://godbolt.org/z/cdcavjTs1

Can anyone think of a fast way of doing these operatons with ta?

I'm OK with ta most of the time, but tu is useful for in-situ operations.

In this conversation: riscv/riscv-v-spec#664 (comment) @aswaterman says:

On some implementations, the tail-undisturbed operations will have noticeable overhead, as @solomatnikov mentioned. This overhead generally shouldn't be extreme (often <= 1 cycle per instruction in practice) but will add up.

I've also commented: riscv/riscv-v-spec#664 (comment)

topperc · 2021-05-26T21:58:15Z

In order to mitigate the increase in intrinsics, I'd like to propose adding a tail policy operand to all of the masked intrinsics.

The table would then become. The define name can be renamed/shortened in the implementation.

Masked?	Needs tail preserved	Needs maskedoff preserved	Intrinsic
No	No	N/A	vadd_vv_<ty>_vl(a, b, vl)
No	Yes	N/A	vadd_vv_<ty>_m_vl(undisturbed, all-ones, a, b, vl, TAIL_UNDISTURBED)
Yes	No	No	vadd_vv_<ty>_m_vl(vundefined(), mask, a, b, vl, TAIL_AGNOSTIC)
Yes	No	Yes	vadd_vv_<ty>_m_vl(maskedoff, mask, a, b, vl, TAIL_AGNOSTIC)
Yes	Yes	No	vadd_vv_<ty>_m_vl(undisturbed, mask, a, b, vl, TAIL_UNDISTURBED)
Yes	Yes	Yes	vadd_vv_<ty>_m_vl(undisturbed, mask, a, b, vl, , TAIL_UNDISTURBED)

This still suffers from an inability to specify tu+ma. It would give tu+mu, but tu+ma is likely rare.

Since not all instructions have masked forms we would also need tail undisturbed versions with an extra source operand for at least these instructions

vmv.v.v/vmv.v.x/vmv.v.i
vmerge.vvm/vmerge.vxm/vmerge.vim/vfmerge.vfm 
vadc.vvm/vadc.vxm/vadc.vim
vsbc.vvm/vsbc.vxm

Should we add tail policy to multiply accumulate intrinsics like vfmacc or just say they are always tail agnostic and use the masked version to access tail undisturbed?

rofirrim · 2021-05-27T06:33:59Z

In order to mitigate the increase in intrinsics, I'd like to propose adding a tail policy operand to all of the masked intrinsics.

This looks like a reasonable trade-off without having to introduce m and mt versions as I suggested earlier.

This still suffers from an inability to specify tu+ma. It would give tu+mu, but tu+ma is likely rare.

Again reasonable for me (I have not seen an example where tu+ma is useful yet)

Since not all instructions have masked forms we would also need tail undisturbed versions with an extra source operand for at least these instructions

vmv.v.v/vmv.v.x/vmv.v.i
vmerge.vvm/vmerge.vxm/vmerge.vim/vfmerge.vfm 
vadc.vvm/vadc.vxm/vadc.vim
vsbc.vvm/vsbc.vxm

I'd be inclined to do so.

Bikeshedding: the new suffix for the intrinsics in the table above could be mt(rather than just m) as the extra operands are mask and tail policy. These ones as they don't have mask the suffix could be just t.

Should we add tail policy to multiply accumulate intrinsics like vfmacc or just say they are always tail agnostic and use the masked version to access tail undisturbed?

I'd say the second option as it looks to me as more consistent with the rest of intrinsics.

lattner · 2021-06-03T17:52:20Z

Pulling part of the conversation from a compiler thread, I have a few concerns here:

Vector processors are going to have very different constraints - an OoO core doesn't want register dependencies / partial dependence stalls because it doesn't know if a vector register is carrying data from its previous value or being fully over written. An in order core doesn't have any problem with it. Further future processors may have other constraints that will be different in the decades ahead.
Despite this, we need the C programming model (and the corresponding LLVM IR intrinsic semantics) to remain stable, even though the cores have different performance characteristics. We don't want low-level C intrinsics to paper over the difference, we want to give power to the programmer.
On the third hand, VLS codegen is in a gray zone. That is effectively a higher level abstraction on top of the VLA model, and vsetvli insertion etc for it can absolutely know about the target CPU and generate efficient code for it with different strategies. This is what -mtune etc are for.

I think that all of this together argues that we need:

A predictable VLA intrinsic model and a C API that provides full control over the hardware, even at the expense of programmability, ergonomics, and "compiler optimizability". The compiler wouldn't be able to reschedule or optimize things in general, because all the operations would depend on mutable global state registers.
A useable VLS instruction/intrinsic model that is higher level, effectively allowing C programmers and higher level compilers (MLIR etc) to delegate this aspect of codegen to LLVM. This would be optimizable by the compiler. This isn't just the standard LLVM IR / OpenCL generic vector instructions -- we'd likely also need to add RVV-specific intrinsics for VLS.
It's fine to have clang compiler flags that affect codegen and interpretation of these things, but they should affect the IR generated, not the interpretation of the IR.

Does this make sense or do y'all disagree with this?

-Chris

tony-cole · 2021-06-17T13:29:49Z

I use Tail Undisturbed for many things, for some examples see: riscv/riscv-v-spec#664 (comment)

I don't like the idea of adding all 4 tail options to all the intrinsics, that's madness, and I don't like the compiler choosing which tail policy to use either. I need to be able to specify the tail policy when I need to.

It was mentioned above, but surely the simplest way to do this is have all intrinsics set to a default and the programmer can switch to a specific mode either for a specific instruction (using vsetvl*()) or change the mode (maybe by a compiler switch, #pragma or pseudo function call).

Proposal either:

Add ma/mu & ta/tu parameters to vsetvl*() (This will break existing code); or
Add ma/mu & ta/tu parameters to new vsetvl*() intrinsics e.g. vsetvl*_tail(); or
Add new vsetvl intrinsics with the tail policy specified in the function names, e.g. vsetvl_*_m?_t?() and vsetvlmax_*_m?_t?()

When the programmer wants to use a specific tail policy, they simply place the required tail policy vsetvl intrinsic before the instruction requiring it.

I currently have to do this using an assembly insert, as follows in LLVM:

vl = vsetvl_e16m4(blkCnt);

/* The following seems to force tu policy for the vwadd_wv instruction, only if vsetvl was previously set to the same (_e16m4) */
asm("vsetvli %0, %1, e16,m4,tu,mu"
    : "=r" (vl)
    : "r"  (blkCnt) );

vecAccQ = vwadd_wv_i32m8(vecAccQ, vecAccW, blkCnt); /* Widen again to Q31 and accumulate the vectors */
        
/*
The above outputs double vsetvli, but at least tu is set:
   5f0cc: 57 f7 a6 04   vsetvli a4, a3, e16,m4,ta,mu
   5f0d0: 57 f7 a6 00   vsetvli a4, a3, e16,m4,tu,mu
   5f0d4: 57 28 8e d6   vwadd.wv        v16, v8, v28
*/

kito-cheng · 2021-06-17T13:59:21Z

Adding an extra argument to vsetvli intrinsic means we have same issue as implicit VL model: that introduce an implicit use-def relation between vsetvl and other vector operations.

See #60 for more detail for implicit VL vs explicit VL

topperc · 2021-06-17T15:26:13Z

@tony-cole Which compiler are you using? Adding a vsetvl via inline assembly probably shouldn't work based on the programming model we were trying to implement. If we allow inline assembly to work we can't be sure sew/lmul have been set correctly which we were trying to have the compiler manage.

In order to have the compile manage sew/lmul from the operation intrinsic, the tail policy and mask policy also need to be inferable from operation intrinsic. This way the compiler can provide each instruction with the vsetvl it needs.

My proposal was only to add the tail policy to the intrinsic not the mask policy too. And only to the masked intrinsics and a few other cases. Tail undisturbed only works for intrinsics that have an operand to provide the value for the undisturbed elements. In your example, the vwadd.wv has different destination(v16) and first source register(v8) so the undisturbed elements come from v16 rather than v8.

tony-cole · 2021-06-21T11:14:21Z

@topperc I'm using LLVM 13.0.0, I find it produces better code than the GCC version and I believe it's the one with the more support(?).

In your example, the vwadd.wv has different destination(v16) and first source register(v8) so the undisturbed elements come from v16 rather than v8.

This is the functionality I require.

After reading #60 I see that the LLVM way of doing it (Explicit VL) can produce better optimisations. But as programmer using the vector intrinsics, I need a way to specify ta/tu and ma/mu for all instruction where they can make a difference to the result.

Is there a way to do this easily?

For instance, in C++, we could have default parameters specifying ta/mu, but we can change them if required by adding the additional tu/ma parameters.

In C we could use the variable function or macro parameters (..., __VA_ARGS__ and __VA_OPT__), these could be used to effect a similar default parameter functionality for the intrinsics? This will keep compatibility with existing code, but allow the tail policies to be changed when required in the Explicit way.

topperc · 2021-06-21T16:27:55Z

@topperc I'm using LLVM 13.0.0, I find it produces better code than the GCC version and I believe it's the one with the more support(?).

In your example, the vwadd.wv has different destination(v16) and first source register(v8) so the undisturbed elements come from v16 rather than v8.

This is the functionality I require.

Are you saying the v16 is correct or are you agreeing with me? Your source says vecAccQ = vwadd_wv_i32m8(vecAccQ, vecAccW, blkCnt); so I assume the vecAccQ register for both the destination and first source.

Here's where we are in LLVM today.
-Intrinsics are implemented as builtin aliases rather than macros to keep the header size and compile time down. We're looking at making the header file even smaller.
-LLVM always uses mu today. We should change to ma based on maskedoff being vundefined, but I don't think that's been implemented yet.
-LLVM use tail undisturbed for all masked intrinsics except reductions(which don't only write element 0) and intrinsics that produce mask results like compares. This was chosen as the conservatively correct thing to do because there is a destination register operand and we don't know if the user wants the tail elements preserved. You can pass an all ones mask to these to get a tail undisturbed instruction today, but you'll tie up v0 with the mask. Passing vundefined will switch to ta/mu, though it should be ta/ma.
-LLVM also uses tail undisturbed for vfmacc, vfmadd, vmadd, vmacc, etc. instructions where one source is also the destination due to the instruction encoding. Again we chose the most conservative behavior. Ideally the compiler would switch between vfmacc/vfmadd to improve register allocation, but we can't with tail undisturbed.
-Reductions, vslide*, vcompress, vmv.s.x vfmv.s.f all use tail undisturbed unless vundefined is passed to the dest operand.

A patch #101 has been posted to make some interface changes.
-All masked intrinsics that currently default to tu gain an argument to specify the tail behavior. No default value because we can't implement it in C and the intrinsics aren't macros. We'd also have to default to tu to maintain compatibility, but that's wrong for most uses. So forcing the user to explicitly state their intention is good, but breaks compatibility. Hopefully this is acceptable at this stage since the V extension hasn't been ratified yet.
-vfmacc, vfmadd, vmadd, vmacc will also gain a tail policy argument.
-Unmasked vslide will continue to infer tail policy from dest operand. Masked will get it from the tail policy argument.
-vmv, vmerge, vadc, vsbc gain a dest operand to control tail policy and the tail element value.

We still need to address the intrinsics that don't have a dest operand and don't use a mask. The user can use the masked version to get the tail undisturbed policy at the cost of all ones in v0 and the register pressure that entails. We could teach the compiler to optimize this pattern, or we can add a dest operand to all unmasked intrinsics, or add new intrinsics that have the extra operand. If we got with adding new intrinsics we should do the same for vmv, vmerge, vadc, vsbc, I think.

tony-cole · 2021-06-22T10:59:19Z

Thank you for the info on how the policy is currently selected, is this documented somewhere?

@topperc I'm using LLVM 13.0.0, I find it produces better code than the GCC version and I believe it's the one with the more support(?).

In your example, the vwadd.wv has different destination(v16) and first source register(v8) so the undisturbed elements come from v16 rather than v8.

This is the functionality I require.

Are you saying the v16 is correct or are you agreeing with me? Your source says vecAccQ = vwadd_wv_i32m8(vecAccQ, vecAccW, blkCnt); so I assume the vecAccQ register for both the destination and first source.

Yes, vecAccQ is both the destination and first source, but the compiler uses a different destination register (V16) and moves it back to the source (V8) for the next iteration later on:

Here is my loop code (that accumulates a vector, ready for reduction-sum then mean):

    blkCnt = blockSize;

    /* Perform the accumulates */
    while (blkCnt > 0U)
    {
        vl = vsetvl_e32m4(blkCnt)

        vecIn  = vle32_v_i32m4(pSrc, blkCnt);

        /* The following vwadd_wv instruction requires tail undisturbed policy for this function to work correctly when 
            blockSize > vlmax and (blockSize % vlmax) != 0 */
        /* This seems to force tu policy for the vwadd_wv instruction on the current LLVM compiler */
        asm("vsetvli %0, %1, e32,m4,tu,mu"  : "=r" (vl)  : "r"  (blkCnt) );

        vecAccW = vwadd_wv_i64m8(vecAccW, vecIn, blkCnt); /* Accumulate the vectors */

        pSrc   += vl;
        blkCnt -= vl;
    }

   5f056: 57 f7 26 05   vsetvli a4, a3, e32,m4,ta,mu
   5f05a: 07 6e 05 02   vle32.v v28, (a0)             # Load new vector data in to V28
   5f05e: 57 f7 26 01   vsetvli a4, a3, e32,m4,tu,mu  # The asm insert adds this vsetvl (for tu policy)
   5f062: 57 28 8e d6   vwadd.wv        v16, v8, v28  # Widen V28, accumulate with V8 and store in V16 (using tu policy)
   5f066: 93 17 27 00   slli    a5, a4, 2
   5f06a: 99 8e         sub     a3, a3, a4
   5f06c: 3e 95         add     a0, a0, a5
   5f06e: 57 b4 03 9f   vmv8r.v v8, v16               # Move V16 back to V8
   5f072: f5 f2         bnez    a3, 0x5f056 <arm_mean_q31+0x16>

I want Tail Undisturbed policy so the tail of vecAccW (V16) remains undisturbed (the tail holds previously accumulated data I want to keep). V16 (dest) gets moved back to V8 (source) for the next iteration later on.

It would be better if the compiler used the same register for source and destination (is that a hardware restriction to have different source and destination registers for this instruction?) and then maybe the compiler will switch to tu automatically?

The problem is automatic tu/mu selection is not explicit. From the C level I can't tell what is going on and have to check to the assembly output. Also, will the output be the same in the future? Maybe future compiler releases will change the policy and break my code...

Using a mask register will add code and cycles to set up the mask each loop, or if the setup is taken out of the loop, then extra code is required to conditionally switch between unmasked and masked instructions. Either way not optimal.

Just a thought off the top of my head (not thought this through though): As it's the tail (or masked data) in the destination that we are interested in keeping, could we have additional types (e.g. vint64m8_tumu_t) that signal to the compiler the tail policy is undisturbed (for both tail and mask)? These would have to be fully interchangeable with the original versions so not to add any more intrinsics. Maybe implement with cleaver use of typeof() and _Generic()?

topperc · 2021-06-22T15:56:15Z

Thank you for the info on how the policy is currently selected, is this documented somewhere?

@topperc I'm using LLVM 13.0.0, I find it produces better code than the GCC version and I believe it's the one with the more support(?).
In your example, the vwadd.wv has different destination(v16) and first source register(v8) so the undisturbed elements come from v16 rather than v8.
This is the functionality I require.
Are you saying the v16 is correct or are you agreeing with me? Your source says vecAccQ = vwadd_wv_i32m8(vecAccQ, vecAccW, blkCnt); so I assume the vecAccQ register for both the destination and first source.

Yes, vecAccQ is both the destination and first source, but the compiler uses a different destination register (V16) and moves it back to the source (V8) for the next iteration later on:

Here is my loop code (that accumulates a vector, ready for reduction-sum then mean):
    blkCnt = blockSize;

    /* Perform the accumulates */
    while (blkCnt > 0U)
    {
        vl = vsetvl_e32m4(blkCnt)

        vecIn  = vle32_v_i32m4(pSrc, blkCnt);

        /* The following vwadd_wv instruction requires tail undisturbed policy for this function to work correctly when 
            blockSize > vlmax and (blockSize % vlmax) != 0 */
        /* This seems to force tu policy for the vwadd_wv instruction on the current LLVM compiler */
        asm("vsetvli %0, %1, e32,m4,tu,mu"  : "=r" (vl)  : "r"  (blkCnt) );

        vecAccW = vwadd_wv_i64m8(vecAccW, vecIn, blkCnt); /* Accumulate the vectors */

        pSrc   += vl;
        blkCnt -= vl;
    }

   5f056: 57 f7 26 05   vsetvli a4, a3, e32,m4,ta,mu
   5f05a: 07 6e 05 02   vle32.v v28, (a0)             # Load new vector data in to V28
   5f05e: 57 f7 26 01   vsetvli a4, a3, e32,m4,tu,mu  # The asm insert adds this vsetvl (for tu policy)
   5f062: 57 28 8e d6   vwadd.wv        v16, v8, v28  # Widen V28, accumulate with V8 and store in V16 (using tu policy)
   5f066: 93 17 27 00   slli    a5, a4, 2
   5f06a: 99 8e         sub     a3, a3, a4
   5f06c: 3e 95         add     a0, a0, a5
   5f06e: 57 b4 03 9f   vmv8r.v v8, v16               # Move V16 back to V8
   5f072: f5 f2         bnez    a3, 0x5f056 <arm_mean_q31+0x16>
I want Tail Undisturbed policy so the tail of vecAccW (V16) remains undisturbed (the tail holds previously accumulated data I want to keep). V16 (dest) gets moved back to V8 (source) for the next iteration later on.

Do v8 and v16 contain the same value on the first iteration of the loop? Wouldn't the code require that with the current register allocation? Or does it not matter because the first iteration has the largest VL and you won't read elements past that VL after the loop?

It would be better if the compiler used the same register for source and destination (is that a hardware restriction to have different source and destination registers for this instruction?) and then maybe the compiler will switch to tu automatically?

It's not a hardware restriction. It's the compiler not knowing what's going on.

The problem is automatic tu/mu selection is not explicit. From the C level I can't tell what is going on and have to check to the assembly output. Also, will the output be the same in the future? Maybe future compiler releases will change the policy and break my code...

Using a mask register will add code and cycles to set up the mask each loop, or if the setup is taken out of the loop, then extra code is required to conditionally switch between unmasked and masked instructions. Either way not optimal.

Just a thought off the top of my head (not thought this through though): As it's the tail (or masked data) in the destination that we are interested in keeping, could we have additional types (e.g. vint64m8_tumu_t) that signal to the compiler the tail policy is undisturbed (for both tail and mask)? These would have to be fully interchangeable with the original versions so not to add any more intrinsics. Maybe implement with cleaver use of typeof() and _Generic()?

That wouldn't support cases where the user wants the undisturbed elements to come from a third register. We need to provide an extra agument to the intrinsic to give the user full control.

tony-cole · 2021-06-29T12:02:13Z

Do v8 and v16 contain the same value on the first iteration of the loop? Wouldn't the code require that with the current register allocation?

No and no.

Or does it not matter because the first iteration has the largest VL and you won't read elements past that VL after the loop?

Note: The accumulator vecAccW (v16) is set to zero before the loop (not shown in the above code, sorry!).

Yes it doesn't matter:

If the memory array length (blockSize) is <= VLMAX then there is only 1 iteration and the array length (blockSize) is used for the following vector reduction instruction (vredsum).

If the memory array length (blockSize) is > VLMAX and (blockSize % VLMAX) == 0 then there is more than one iteration. On the last iteration the complete vector register length (VLMAX) is written to and VLMAX is used for the following vector reduction instruction (vredsum).

If the memory array length (blockSize) is > VLMAX and (blockSize % VLMAX) != 0 then there is more than one iteration and on the last iteration the complete vector register (VLMAX) is not written to, this is where I need Tail Undisturbed as I want to preserve the previous tail elements for the following vector reduction (vredsum) instruction on all (VLMAX) elements (not shown in the code above).

It would be better if the compiler used the same register for source and destination (is that a hardware restriction to have different source and destination registers for this instruction?) and then maybe the compiler will switch to tu automatically?

It's not a hardware restriction. It's the compiler not knowing what's going on.

How do I tell the compiler what's going on? Is the asm() insert incorrect or missing information?

The problem is automatic tu/mu selection is not explicit. From the C level I can't tell what is going on and have to check to the assembly output. Also, will the output be the same in the future? Maybe future compiler releases will change the policy and break my code...

Using a mask register will add code and cycles to set up the mask each loop, or if the setup is taken out of the loop, then extra code is required to conditionally switch between unmasked and masked instructions. Either way not optimal.

Just a thought off the top of my head (not thought this through though): As it's the tail (or masked data) in the destination that we are interested in keeping, could we have additional types (e.g. vint64m8_tumu_t) that signal to the compiler the tail policy is undisturbed (for both tail and mask)? These would have to be fully interchangeable with the original versions so not to add any more intrinsics. Maybe implement with cleaver use of typeof() and _Generic()?

That wouldn't support cases where the user wants the undisturbed elements to come from a third register. We need to provide an extra agument to the intrinsic to give the user full control.

I think it would support it. The tail/mask agnostic/undisturbed policy is always for the destination register, so if the undisturbed elements are to come from a third register, then that third register must be the destination register (so it's type could be, say, vint64m8_tumu_t to specify undisturbed).

See: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#333-vector-tail-agnostic-and-vector-mask-agnostic-vta-and-vma

3.3.3. Vector Tail Agnostic and Vector Mask Agnostic vta and vma
These two bits modify the behavior of destination tail elements and destination inactive masked-off elements respectively during the execution of vector instructions.

Also from the same section is this note:

NOTE: To maintain backward compatibility in the short term and reduce software churn in the move to 0.9, when these flags are not specified on a vsetvli, they should default to mask-undisturbed/tail-undisturbed. The use of vsetvli without these flags should be deprecated, however, such that the specifying a flag setting becomes mandatory. If anything, the default should be tail-agnostic/mask-agnostic, so software has to specify when it cares about the non-participating elements, but given the historical meaning of the instruction prior to introduction of these flags, it is safest to always require them in future assembly code.

So, the compiler should default to tu/mu until a method of specifying tu/mu is available.

nick-knight · 2021-06-29T15:47:20Z

This is just another manifestation of the same problem we are seeing elsewhere (e.g., #84). The V-extension defines several control and status registers which are inadequately exposed to the C programmer.

The vl register is exposed via "explicit-VL" arguments;
The vlenb register is exposed by calling vsetvlmax* intrinsics;
The vsew and vlmul fields of the vtype register are partially[*] exposed by hardcoding them into intrinsic function names;
The vta and vma fields of the vtype register are partially[*] exposed by complicated heuristics that we're debating here;
The vill field of the vtype register is not exposed;
The vstart register is not exposed;
The vxrm and vxsat registers (also accessible through vcsr) are not exposed;
If floating-point is present, the frm and fflags registers (also accessible through fcsr) are not exposed.

[*] I say "partially" because their exposure is static, whereas the assembly language allows them to be dynamic. For example, the intrinsics programmer cannot access the vsetvl instruction (which inputs dynamic vtype).

topperc · 2021-06-29T16:11:39Z

Do v8 and v16 contain the same value on the first iteration of the loop? Wouldn't the code require that with the current register allocation?

No and no.

Or does it not matter because the first iteration has the largest VL and you won't read elements past that VL after the loop?

Note: The accumulator vecAccW (v16) is set to zero before the loop (not shown in the above code, sorry!).

Yes it doesn't matter:

If the memory array length (blockSize) is <= VLMAX then there is only 1 iteration and the array length (blockSize) is used for the following vector reduction instruction (vredsum).

If the memory array length (blockSize) is > VLMAX and (blockSize % VLMAX) == 0 then there is more than one iteration. On the last iteration the complete vector register length (VLMAX) is written to and VLMAX is used for the following vector reduction instruction (vredsum).

If the memory array length (blockSize) is > VLMAX and (blockSize % VLMAX) != 0 then there is more than one iteration and on the last iteration the complete vector register (VLMAX) is not written to, this is where I need Tail Undisturbed as I want to preserve the previous tail elements for the following vector reduction (vredsum) instruction on all (VLMAX) elements (not shown in the code above).

It would be better if the compiler used the same register for source and destination (is that a hardware restriction to have different source and destination registers for this instruction?) and then maybe the compiler will switch to tu automatically?

It's not a hardware restriction. It's the compiler not knowing what's going on.

How do I tell the compiler what's going on? Is the asm() insert incorrect or missing information?

Right now the closet way to tell the compiler what is going on is to use vwadd_wv_i32m8_m(vmset_m_b4(bclkCnt), vecAccQ, vecAccQ, vecAccW, blkCnt) or create a VLMAX all ones mask outside the loop. The vwadd_wv_i32m8_m intrinsic defaults to tu the important thing is that it has a maskedoff operand which must be the same register as the destination. I know this is not ideal and we're very likely going to change vwadd_wv_i32m8_m to default to ta so it is going to break. So the only forward compatible way right now is to use vid.v and vmslt.vx to create a mask and not use tail undisturbed. I'm very sorry about this. I think intrinsic interface was defined before tu existed and tail elements were always zeroed.

The problem is automatic tu/mu selection is not explicit. From the C level I can't tell what is going on and have to check to the assembly output. Also, will the output be the same in the future? Maybe future compiler releases will change the policy and break my code...
Using a mask register will add code and cycles to set up the mask each loop, or if the setup is taken out of the loop, then extra code is required to conditionally switch between unmasked and masked instructions. Either way not optimal.
Just a thought off the top of my head (not thought this through though): As it's the tail (or masked data) in the destination that we are interested in keeping, could we have additional types (e.g. vint64m8_tumu_t) that signal to the compiler the tail policy is undisturbed (for both tail and mask)? These would have to be fully interchangeable with the original versions so not to add any more intrinsics. Maybe implement with cleaver use of typeof() and _Generic()?

That wouldn't support cases where the user wants the undisturbed elements to come from a third register. We need to provide an extra agument to the intrinsic to give the user full control.

I think it would support it. The tail/mask agnostic/undisturbed policy is always for the destination register, so if the undisturbed elements are to come from a third register, then that third register must be the destination register (so it's type could be, say, vint64m8_tumu_t to specify undisturbed).

Like any C/C++ function, the behavior of the intrinsic must defined completely by the values and types of its operands. The assignment operator is considered a separate operation its type cannot be used to define the behavior. It's not required to be there, you could pass the result of the intrinsic directly to another instrinsic or another function call. So we need the value containing the tail undisturbed elements to be passed to the intrinsic. This requires a tail undisturbed vadd intrinsic to take 3 operands instead of 2.

See: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#333-vector-tail-agnostic-and-vector-mask-agnostic-vta-and-vma

3.3.3. Vector Tail Agnostic and Vector Mask Agnostic vta and vma
These two bits modify the behavior of destination tail elements and destination inactive masked-off elements respectively during the execution of vector instructions.

Also from the same section is this note:

NOTE: To maintain backward compatibility in the short term and reduce software churn in the move to 0.9, when these flags are not specified on a vsetvli, they should default to mask-undisturbed/tail-undisturbed. The use of vsetvli without these flags should be deprecated, however, such that the specifying a flag setting becomes mandatory. If anything, the default should be tail-agnostic/mask-agnostic, so software has to specify when it cares about the non-participating elements, but given the historical meaning of the instruction prior to introduction of these flags, it is safest to always require them in future assembly code.

So, the compiler should default to tu/mu until a method of specifying tu/mu is available.

The compiler does default to tu/mu on every intrinsic that has a dest operand, a maskedoff operand, or is an fma instruction. That operand is needed to tell the compiler which register to pick for the destination. The data flow must be explicit in the C code.

tony-cole · 2021-07-06T10:29:37Z

In the meantime, please can we have a #pragma and/or command line switch to default all instructions to Tail and Mask Undisturbed?

Also, I don't think the compiler should be deciding on the mask/tail policy, I think it should default to the safe Undisturbed policy because as a programmer I expect registers to remain stable and not have some of their data changed without me deciding that's what I want it to do. So, my vote it to make all instruction default Mask/Tail Undisturbed and then have mechanisms to allow programmers to specify otherwise for specific speedups.

zakk0610 · 2021-07-14T02:40:39Z

I believe compiler could select the best tail policy for users because clear api design is doing what user want to do.
It's why we need a new API #101 to model tail policy.

I think #pragma is not workable because different policy have different number of argument. Usually tu need an additional argument to specific tail value.

Hsiangkai · 2021-07-14T07:45:21Z

In the meantime, please can we have a #pragma and/or command line switch to default all instructions to Tail and Mask Undisturbed?

Also, I don't think the compiler should be deciding on the mask/tail policy, I think it should default to the safe Undisturbed policy because as a programmer I expect registers to remain stable and not have some of their data changed without me deciding that's what I want it to do. So, my vote it to make all instruction default Mask/Tail Undisturbed and then have mechanisms to allow programmers to specify otherwise for specific speedups.

We have vl and mask to control which elements are relevant in the vector operations. In most cases, users should not care about the tail elements. We are defining new set of intrinsics with _t and _mt suffixes to control tail policy. With these intrinsics, we could use tail agnostic as the default policy for _m intrinsics.

eopXD · 2022-07-29T08:31:49Z

We have merged RFC #137 that is implemented in GCC and implementing in LLVM that is expected to land soon. This issue is expected to be resolved for v1.0. Closing the issue.

zakk0610 mentioned this issue Dec 29, 2020

maskedoff argument to vcompress intrinsic might not be the best name #55

Closed

topperc mentioned this issue Jun 2, 2021

Intrinsic without side-effect may be deleted by DCE unexpectedly. #96

Closed

Hsiangkai mentioned this issue Jun 17, 2021

Add an argument for tail agnostic/undisturbed. #101

Merged

nick-knight mentioned this issue Jul 6, 2021

drop tail undisturbed/mask agnostic riscv/riscv-v-spec#664

Closed

nick-knight mentioned this issue Sep 23, 2021

Non-rounding averaging add and subtract riscv/riscv-v-spec#739

Open

eopXD closed this as completed Jul 29, 2022

eopXD added the Resolve for v1.0 Feature or problems we will close before the v1.0 release label Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should intrinsic have to model tail elements? #27

Should intrinsic have to model tail elements? #27

zakk0610 commented Jun 23, 2020

nick-knight commented Jun 23, 2020

rdolbeau commented Jun 24, 2020 •

edited

Loading

rofirrim commented Jun 25, 2020

zakk0610 commented Jun 29, 2020

Hsiangkai commented Aug 7, 2020

rofirrim commented Sep 9, 2020

zakk0610 commented Sep 10, 2020

topperc commented Dec 30, 2020

David-Horner commented Dec 30, 2020 via email

ebahapo commented Dec 30, 2020

topperc commented Dec 30, 2020

rofirrim commented Dec 30, 2020

zakk0610 commented Jan 6, 2021

tony-cole commented May 10, 2021

topperc commented May 26, 2021

rofirrim commented May 27, 2021 •

edited

Loading

lattner commented Jun 3, 2021 •

edited

Loading

tony-cole commented Jun 17, 2021

kito-cheng commented Jun 17, 2021

topperc commented Jun 17, 2021

tony-cole commented Jun 21, 2021

topperc commented Jun 21, 2021

tony-cole commented Jun 22, 2021

topperc commented Jun 22, 2021 •

edited

Loading

tony-cole commented Jun 29, 2021

nick-knight commented Jun 29, 2021

topperc commented Jun 29, 2021

tony-cole commented Jul 6, 2021

zakk0610 commented Jul 14, 2021

Hsiangkai commented Jul 14, 2021

eopXD commented Jul 29, 2022

Should intrinsic have to model tail elements? #27

Should intrinsic have to model tail elements? #27

Comments

zakk0610 commented Jun 23, 2020

nick-knight commented Jun 23, 2020

rdolbeau commented Jun 24, 2020 • edited Loading

rofirrim commented Jun 25, 2020

zakk0610 commented Jun 29, 2020

Hsiangkai commented Aug 7, 2020

rofirrim commented Sep 9, 2020

zakk0610 commented Sep 10, 2020

topperc commented Dec 30, 2020

David-Horner commented Dec 30, 2020 via email

ebahapo commented Dec 30, 2020

topperc commented Dec 30, 2020

rofirrim commented Dec 30, 2020

zakk0610 commented Jan 6, 2021

tony-cole commented May 10, 2021

topperc commented May 26, 2021

rofirrim commented May 27, 2021 • edited Loading

lattner commented Jun 3, 2021 • edited Loading

tony-cole commented Jun 17, 2021

kito-cheng commented Jun 17, 2021

topperc commented Jun 17, 2021

tony-cole commented Jun 21, 2021

topperc commented Jun 21, 2021

tony-cole commented Jun 22, 2021

topperc commented Jun 22, 2021 • edited Loading

tony-cole commented Jun 29, 2021

nick-knight commented Jun 29, 2021

topperc commented Jun 29, 2021

tony-cole commented Jul 6, 2021

zakk0610 commented Jul 14, 2021

Hsiangkai commented Jul 14, 2021

eopXD commented Jul 29, 2022

rdolbeau commented Jun 24, 2020 •

edited

Loading

rofirrim commented May 27, 2021 •

edited

Loading

lattner commented Jun 3, 2021 •

edited

Loading

topperc commented Jun 22, 2021 •

edited

Loading