-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should intrinsic have to model tail elements? #27
Comments
My opinion is that all functionality exposed to the RVV assembly programmer should also be exposed to the RVV intrinsics programmer; this includes the |
My 2 cents on the subject... In most case when coding at high level, unused values in the registers (both inactive and tail) are irrelevant and the user doesn't care [1] what's in it. For peel/tail they are not loaded, not computed and not stored. For masked-out values (e.g. convergence loops), it's the same. The one obvious big exception is conditionals in loop, where usually the old value should be kept (either because the condition if false for 'if/then', or we're in the 'else' merging with the 'then' for 'if/then/else'). So in my opinion the default behavior should be 'keep whatever is in there' (undisturbed), as it should be quite enough for most use cases (as it satisfies both 'don't care' and 'merge' semantic). That's also why vundefined() is in there - it explicits the 'don't care' by inputing undefined value in the 'merge' semantic. For user in need of a specific value instead of undefined(), they can easily use an explicit intrinsic or a value as an input. So in v0.9-speak, default current behavior should be either {mu, tu} (if the maskedoff input is valid) or {ma,ta} (if it's vundefined() or not present in the intrinsics, thus implicitly undefined, e.g. the unmasked intrinsics). The question is, how to also model {ma,tu} and {mu,ta} in the intrinsics... and how does that interact with the above? In the current design, having {,tu} or {mu,} only works if there is a valid maskedoff parameter, otherwise the user can't specify the values to be left undisturbed... but from my reading of v0.9, the non-masked variant should be eligible for {*,tu} (not that {mu,tu} makes much sense w/o a mask...). A simple solution would be to add a set of 'tail agnostic' intrinsics to the masked ones. So the behavior would be: a) no maskedoff parameter, or maskedoff is vundefined(): {ma,ta} This does not give access to {ma,tu} unfortunately - though to be honest I don't see a use case at the moment for that combination. Would hardware be able to do better on {ma,tu} than on {mu,tu}? [and case b) and c) could be reversed to default to _ta and require a specification for _tu, which probably makes more sense in practice] Another solution is to add vta/vma to vsetvl() but a) I don't like hidden state in high-level code and b) {mu,tu}, {mu,ta} and {ma,tu} don't make much sense with vundefined() or any intrinsics without a maskedoff parameter... A third solution would be to assume for {ma,} or {mu,} as above, but rely on the hidden state for {,ta} and {,tu}. It's sort of half-and-half... [the heavyweight solution is to have 4 variant of the intrinsics: _ma_ta (not maskedoff parameter needed), _mu_tu, _ma_tu and _mu_ta (all three with a mandatory maskedoff parameter), drop vundefined(), and let the user do all the work... perhaps with just _ta (no maskedoff) and _tu (mandatory maskedoff) for non-masked intrinsics... the _ma or _mu bit would take over the current _m prefix for the masked version]. [1] Except for security reasons perhaps (avoiding information leakage), but that's out-of-scope here I think. |
In our prototype (which does not account for float16 and fractional LMUL yet) we already have ~7500 intrinsics. Multiplying them by 4 to cover all the combinations is a bit disheartening. One option is to add a
Then, for masked operations:
An issue with this approach is that A way to address this is to add a new version of the masked intrinsics (with suffix, say,
We can look at it in the other way round, depending on what are the user needs (note that if something is not needed to be preserved it would still be correct, though unnecessary, to preserve it).
1 A downside with this approach is that now we have to specify the The table above would look like this now:
3we believe this case is not very common, so implementing it as I might prefer the second approach: preserving the tail is not that usual to justify impacting all the unmasked operations. Also the softening required may also be useful in other circumstances so it seems it is something a compiler will want to implement anyways. |
current RFC has almost 20k intrinsic functions and 5k functions with function overloading (include Zvlsseg extension), definitely we don't want to support 4 variant of the intrinsics... |
Agree with @rofirrim's second approach. Provide another set of intrinsics ending with |
My colleagues and I were looking today at some code that uses The reason is that the clobbered input is already what is going to be "undisturbed" in the final result. So the intrinsic C prototype of the Does this make sense? |
Yes, it makes sense to me. |
I'd like to continue the discussion of this. Sifive has been working on upstreaming the IR intrinsics into llvm.org and we'll we working on the C intrinsics soon. I'd like to make sure what we're doing is scalable to the future. What we have in our internal branch is to always use a tail undisturbed policy. But many intrinsics don't take an argument for the undisturbed value so it is not controlled by the user. It will just be whatever is in the register that gets picked by the register allocator. It would also create a false dependency on implementations with vector register renaming. So this doesn't seem like a good long term behavior. As we started upstreaming we initially used tail agnostic. A recent change was made upstream to use tail undisturbed for any masked intrinsic, fma intrinsic, or vcompress intrinsic and tail agnostic since for those intrinsics that user would have control and might want to preserve tail elements. We have not added detection of vundefined yet so we always use mask undisturbed. I feel like this has put our implementation into a state that's not easy to explain as there's nothing in the name to tell the user which intrinsics use tail undisturbed and which don't. So I think I like @rofirrim's second approach where the tail undisturbed policy is explicit in each intrinsic. I think what I would like to say is that the currently defined C interface is always tail agnostic and we can add new tail undisturbed intrinsics with a consistent naming convention. I think that matches the behavior of the non-"mt" intrinsics in @rofirrim second approach. We would need to redefine the vcompress intrinsic to drop the "maskedoff" argument which #55 proposes to rename. The current definition would be used for an t version. I do worry that since tail undisturbed is a valid implementation of tail agnostic in hardware, that a user could use a masked or FMA intrinsic and accidentally become dependent on tail elements being preserved on today's hardware. But have it break in the future. What do others think? |
I share your concern.
The major concern raised during design was application dependence on
tail state.
Especially with agnostic implementation validation is challenging, thus
the recommendation was not to zero as a simple default fill value, but
rather use a poison value of ones.
As the allowed values are indeed agnostic, any pattern can be provided,
but what is not allowed is residual from previous calculations to any of
the register set.
I advocated for a pseudo random fill value, however, the simplification
for verification of ones superseded the "force software to understand
what agnostic means".
The saving point for my sanity was the opportunity for the
compiler/assembler to provide
1) specify agnostic by default
2) make it the default option for associative operations with agnostic
tail/mask that which is vs1,vs2 is undefined. (e.g. vadd.vv v4,v8,v12
could equivalently generate vadd.vv v4,v12,v8)
2) -Wflags warn of use of uninitialized tail elements in
tail-undisturbed operations.
3) -Xflag - for any tail or mask agnostic operation randomly do any of
following before transforming the operation into a tail/mask undisturbed;
fully populate the target register with
a) another random (but not same) register set (fast on
renaming machines)
b) a splat of a random X register
c) "randomizing" self ops (e.g. viota.m with a random mask)
The first three support high performance checking/operation.
The last is a performance hit for most implementations.
…On 2020-12-29 8:21 p.m., Craig Topper wrote:
I'd like to continue the discussion of this. Sifive has been working
on upstreaming the IR intrinsics into llvm.org and we'll we working on
the C intrinsics soon. I'd like to make sure what we're doing is
scalable to the future.
What we have in our internal branch is to always use a tail
undisturbed policy. But many intrinsics don't take an argument for the
undisturbed value so it is not controlled by the user. It will just be
whatever is in the register that gets picked by the register
allocator. It would also create a false dependency on implementations
with vector register renaming. So this doesn't seem like a good long
term behavior.
As we started upstreaming we initially used tail agnostic. A recent
change was made upstream to use tail undisturbed for any masked
intrinsic, fma intrinsic, or vcompress intrinsic and tail agnostic
since for those intrinsics that user would have control and might want
to preserve tail elements. We have not added detection of vundefined
yet so we always use mask undisturbed.
I feel like this has put our implementation into a state that's not
easy to explain as there's nothing in the name to tell the user which
intrinsics use tail undisturbed and which don't. So I think I like
@rofirrim <https://github.com/rofirrim>'s second approach where the
tail undisturbed policy is explicit in each intrinsic.
I think what I would like to say is that the currently defined C
interface is always tail agnostic and we can add new tail undisturbed
intrinsics with a consistent naming convention. I think that matches
the behavior of the non-"mt" intrinsics in @rofirrim
<https://github.com/rofirrim> second approach. We would need to
redefine the vcompress intrinsic to drop the "maskedoff" argument
which #55 <#55>
proposes to rename. The current definition would be used for an /t/
version.
I do worry that since tail undisturbed is a valid implementation of
tail agnostic in hardware, that a user could use a masked or FMA
intrinsic and accidentally become dependent on tail elements being
preserved on today's hardware. But have it break in the future.
What do others think?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAWIKJQNH7EYUS43TKTUXLSXJ6AVANCNFSM4OF2XQUQ>.
|
I am against encoding the mask and tail preservation policies explicitly in the intrinsics or builtins. Rather, I favor using new function attributes to specify such policies, defaulting to |
I think @rofirrim's proposal only encodes the tail policy into the intrinsic name. The mask policy is based on vundefined being passed. Tail undisturbed requires at least one additional operand for many of the intrinsics to provide the undisturbed value. So I don't think we accomplish that with just a function attribute unless we make that extra operand always be there and ignored under tail agnostic. |
I wonder if in the line of @ebahapo suggestion we could give an alternative interpretation to the intrinsic based on such attribute (btw I assume Evandro means For instance one of the operands (e.g. the first) could used for the tail undisturbed. It turns something like If our expectation is that tail agnostic is the preferred default, this attribute acting as a modifier might be a reasonable approach. I think this also disambiguates the case for three-input operations such as FMA, does it? Perhaps I'm missing some nuance here. |
In fact, there are too much intrinsic functions for RVV programming.. |
I would like a way to set tu (Tail Undisturbed) in the intrinsics as I use it in this (and other) code: https://godbolt.org/z/cdcavjTs1 Can anyone think of a fast way of doing these operatons with ta? I'm OK with ta most of the time, but tu is useful for in-situ operations. In this conversation: riscv/riscv-v-spec#664 (comment) @aswaterman says:
I've also commented: riscv/riscv-v-spec#664 (comment) |
In order to mitigate the increase in intrinsics, I'd like to propose adding a tail policy operand to all of the masked intrinsics. The table would then become. The define name can be renamed/shortened in the implementation.
This still suffers from an inability to specify tu+ma. It would give tu+mu, but tu+ma is likely rare. Since not all instructions have masked forms we would also need tail undisturbed versions with an extra source operand for at least these instructions
Should we add tail policy to multiply accumulate intrinsics like vfmacc or just say they are always tail agnostic and use the masked version to access tail undisturbed? |
This looks like a reasonable trade-off without having to introduce
Again reasonable for me (I have not seen an example where
I'd be inclined to do so. Bikeshedding: the new suffix for the intrinsics in the table above could be
I'd say the second option as it looks to me as more consistent with the rest of intrinsics. |
Pulling part of the conversation from a compiler thread, I have a few concerns here:
I think that all of this together argues that we need:
Does this make sense or do y'all disagree with this? -Chris |
I use Tail Undisturbed for many things, for some examples see: riscv/riscv-v-spec#664 (comment) I don't like the idea of adding all 4 tail options to all the intrinsics, that's madness, and I don't like the compiler choosing which tail policy to use either. I need to be able to specify the tail policy when I need to. It was mentioned above, but surely the simplest way to do this is have all intrinsics set to a default and the programmer can switch to a specific mode either for a specific instruction (using Proposal either:
When the programmer wants to use a specific tail policy, they simply place the required tail policy vsetvl intrinsic before the instruction requiring it. I currently have to do this using an assembly insert, as follows in LLVM:
|
Adding an extra argument to vsetvli intrinsic means we have same issue as implicit VL model: that introduce an implicit use-def relation between vsetvl and other vector operations. See #60 for more detail for implicit VL vs explicit VL |
@tony-cole Which compiler are you using? Adding a vsetvl via inline assembly probably shouldn't work based on the programming model we were trying to implement. If we allow inline assembly to work we can't be sure sew/lmul have been set correctly which we were trying to have the compiler manage. In order to have the compile manage sew/lmul from the operation intrinsic, the tail policy and mask policy also need to be inferable from operation intrinsic. This way the compiler can provide each instruction with the vsetvl it needs. My proposal was only to add the tail policy to the intrinsic not the mask policy too. And only to the masked intrinsics and a few other cases. Tail undisturbed only works for intrinsics that have an operand to provide the value for the undisturbed elements. In your example, the vwadd.wv has different destination(v16) and first source register(v8) so the undisturbed elements come from v16 rather than v8. |
@topperc I'm using LLVM 13.0.0, I find it produces better code than the GCC version and I believe it's the one with the more support(?).
This is the functionality I require. After reading #60 I see that the LLVM way of doing it (Explicit VL) can produce better optimisations. But as programmer using the vector intrinsics, I need a way to specify ta/tu and ma/mu for all instruction where they can make a difference to the result. Is there a way to do this easily? For instance, in C++, we could have default parameters specifying ta/mu, but we can change them if required by adding the additional tu/ma parameters. In C we could use the variable function or macro parameters ( |
Are you saying the v16 is correct or are you agreeing with me? Your source says Here's where we are in LLVM today. A patch #101 has been posted to make some interface changes. We still need to address the intrinsics that don't have a |
Thank you for the info on how the policy is currently selected, is this documented somewhere?
Yes, vecAccQ is both the destination and first source, but the compiler uses a different destination register (V16) and moves it back to the source (V8) for the next iteration later on: Here is my loop code (that accumulates a vector, ready for reduction-sum then mean):
I want Tail Undisturbed policy so the tail of vecAccW (V16) remains undisturbed (the tail holds previously accumulated data I want to keep). V16 (dest) gets moved back to V8 (source) for the next iteration later on. It would be better if the compiler used the same register for source and destination (is that a hardware restriction to have different source and destination registers for this instruction?) and then maybe the compiler will switch to tu automatically? The problem is automatic tu/mu selection is not explicit. From the C level I can't tell what is going on and have to check to the assembly output. Also, will the output be the same in the future? Maybe future compiler releases will change the policy and break my code... Using a mask register will add code and cycles to set up the mask each loop, or if the setup is taken out of the loop, then extra code is required to conditionally switch between unmasked and masked instructions. Either way not optimal. Just a thought off the top of my head (not thought this through though): As it's the tail (or masked data) in the destination that we are interested in keeping, could we have additional types (e.g. vint64m8_tumu_t) that signal to the compiler the tail policy is undisturbed (for both tail and mask)? These would have to be fully interchangeable with the original versions so not to add any more intrinsics. Maybe implement with cleaver use of typeof() and _Generic()? |
Do v8 and v16 contain the same value on the first iteration of the loop? Wouldn't the code require that with the current register allocation? Or does it not matter because the first iteration has the largest VL and you won't read elements past that VL after the loop?
It's not a hardware restriction. It's the compiler not knowing what's going on.
That wouldn't support cases where the user wants the undisturbed elements to come from a third register. We need to provide an extra agument to the intrinsic to give the user full control. |
No and no.
Note: The accumulator vecAccW (v16) is set to zero before the loop (not shown in the above code, sorry!). Yes it doesn't matter: If the memory array length (blockSize) is <= VLMAX then there is only 1 iteration and the array length (blockSize) is used for the following vector reduction instruction (vredsum). If the memory array length (blockSize) is > VLMAX and (blockSize % VLMAX) == 0 then there is more than one iteration. On the last iteration the complete vector register length (VLMAX) is written to and VLMAX is used for the following vector reduction instruction (vredsum). If the memory array length (blockSize) is > VLMAX and (blockSize % VLMAX) != 0 then there is more than one iteration and on the last iteration the complete vector register (VLMAX) is not written to, this is where I need Tail Undisturbed as I want to preserve the previous tail elements for the following vector reduction (vredsum) instruction on all (VLMAX) elements (not shown in the code above).
How do I tell the compiler what's going on? Is the asm() insert incorrect or missing information?
I think it would support it. The tail/mask agnostic/undisturbed policy is always for the destination register, so if the undisturbed elements are to come from a third register, then that third register must be the destination register (so it's type could be, say, vint64m8_tumu_t to specify undisturbed).
Also from the same section is this note:
So, the compiler should default to tu/mu until a method of specifying tu/mu is available. |
This is just another manifestation of the same problem we are seeing elsewhere (e.g., #84). The V-extension defines several control and status registers which are inadequately exposed to the C programmer.
[*] I say "partially" because their exposure is static, whereas the assembly language allows them to be dynamic. For example, the intrinsics programmer cannot access the |
Right now the closet way to tell the compiler what is going on is to use
Like any C/C++ function, the behavior of the intrinsic must defined completely by the values and types of its operands. The assignment operator is considered a separate operation its type cannot be used to define the behavior. It's not required to be there, you could pass the result of the intrinsic directly to another instrinsic or another function call. So we need the value containing the tail undisturbed elements to be passed to the intrinsic. This requires a tail undisturbed vadd intrinsic to take 3 operands instead of 2.
The compiler does default to tu/mu on every intrinsic that has a dest operand, a maskedoff operand, or is an fma instruction. That operand is needed to tell the compiler which register to pick for the destination. The data flow must be explicit in the C code. |
In the meantime, please can we have a Also, I don't think the compiler should be deciding on the mask/tail policy, I think it should default to the safe Undisturbed policy because as a programmer I expect registers to remain stable and not have some of their data changed without me deciding that's what I want it to do. So, my vote it to make all instruction default Mask/Tail Undisturbed and then have mechanisms to allow programmers to specify otherwise for specific speedups. |
I believe compiler could select the best tail policy for users because clear api design is doing what user want to do. I think |
We have vl and mask to control which elements are relevant in the vector operations. In most cases, users should not care about the tail elements. We are defining new set of intrinsics with |
We have merged RFC #137 that is implemented in GCC and implementing in LLVM that is expected to land soon. This issue is expected to be resolved for v1.0. Closing the issue. |
current intrinsic RFC does not model tail element,
but someone maybe want to set the tail undisturbed like riscv/riscv-v-spec#157 (comment).
I think maybe we can only provide additional tail operand for reduction intrinsic functions to model tail elements.
any idea?
The text was updated successfully, but these errors were encountered: