Skip to content
This repository has been archived by the owner on Mar 20, 2024. It is now read-only.

Non-rounding averaging add and subtract #739

Open
dsharletg opened this issue Sep 22, 2021 · 14 comments
Open

Non-rounding averaging add and subtract #739

dsharletg opened this issue Sep 22, 2021 · 14 comments
Labels
Resolve after v1.0 Does not need to be resolved for v1.0 draft

Comments

@dsharletg
Copy link

I noticed that in the v1.0 spec, rounding add and subtract averaging instructions are present (vaadd[u] and vasub[u]), but there do not appear to be any non-rounding versions. Both are very useful, because they can be used in trees to produce averaging trees of minimal bias. For example, (a + 2b + c + 2)/4 = (b + (a + c)/2 + 1)/2 = vaadd(b, missing_non_rounding_averaging(a, c)). With only the rounding vaadd, we can only compute (a + 2b + c + 3)/4 = vaadd(b, vaadd(a, c)), which has worse error.

@nick-knight
Copy link
Contributor

Is it possible to vary the fixed-point rounding mode
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#38-vector-fixed-point-rounding-mode-register-vxrm
to simulate the desired behavior?

@dsharletg
Copy link
Author

I didn't realize the roundoff_signed functions were configurable. However, I still don't think that helps. We have large trees of these averaging ops to implement particular kernels, with the rounding mode of each one varying and carefully chosen to minimize bias and error. It seems like it would be slow/impractical to change the rounding mode frequently (possibly for each individual instruction).

@nick-knight
Copy link
Contributor

Let me try asking a different way. What are the precise semantics of your proposed missing_non_rounding_averaging(a, c)?

@dsharletg
Copy link
Author

dsharletg commented Sep 22, 2021

We would basically just want ceil_avg(a, b) = (a + b + 1)>>1 and floor_avg(a, b) = (a + b)>>1, so we can use both of these together in larger expressions. For example:

floor_avg(floor_avg(a, c), b) = (a + 2b + c + 0)>>2
floor_avg(ceil_avg(a, c), b) = (a + 2b + c + 1)>>2
ceil_avg(floor_avg(a, c), b) = (a + 2b + c + 2)>>2
ceil_avg(ceil_avg(a, c), b) = (a + 2b + c + 3)>>2

@nick-knight
Copy link
Contributor

nick-knight commented Sep 23, 2021

Sorry, I'm still not completely clear. You seem to expect rational expressions like (a+b)/2 to be interpreted as integers, so you must be assuming some kind of rounding rule when a+b is odd. Here are four possibilities:

  • Round to positive infinity. This can be implemented by vaadd with vxrm == rnu
  • Round to negative infinity. This can be implemented by vaadd with vxrm == rdn
  • Round towards zero. I don't know a single-instruction sequence.
  • Round away from zero. I don't know a single-instruction sequence.

I admit I don't have experience in computing "averaging trees of minimal bias". I'll step back and let others attempt to tackle your question.

@dsharletg
Copy link
Author

dsharletg commented Sep 23, 2021

In my examples, ceil_avg is equivalent to vaadd with vxrm == rnu, and floor_avg is equivalent to vaadd with vxrm == rdn.

The issue is I think we would need to flipping the rounding mode very frequently, which seems like a performance issue. In general, statefulness like this in instructions is quite a bit of a pain to deal with IMHO.

@dsharletg
Copy link
Author

I edited the previous comment to just use shifts (which I assume to have no clever rounding, just truncating).

@dsharletg
Copy link
Author

I see now that shifts are also affected by this rounding register... :( IMO, this really makes the instruction set harder to use for a lot of the code I'm used to writing on ARM and other targets.

@nick-knight
Copy link
Contributor

nick-knight commented Sep 23, 2021

Toggling of the vxrm CSR is performed by scalar instructions, which can often be scheduled far in advance of the associated vector instructions. This is comparable to changing the vector length (vl CSR), and the overheads of that have been discussed at length elsewhere. Moreover, although I don't know the details of your use-case, I suspect the pattern of vxrm changes are often known statically, so can be encoded by an immediate (csrwi). I believe this property could be particularly relevant in the case of out-of-order implementations.

I certainly agree that it would reduce static and dynamic code size if vxrm were encoded directly in the instruction. However, in my opinion, it is unsound reasoning to conclude a performance issue based solely on instruction count, at least without a particular implementation in mind.

@dsharletg
Copy link
Author

It's not really just about performance. There's also the practical matter of it just being kind of a pain to have global state and instructions that are not pure functions:

  • Assembly is harder to read. To know the behavior, you have to find the last time the control register was set, which may be before some control flow.
  • This is an extra constraint for the compiler and HW to consider when attempting to reorder instructions.
  • The only reasonable thing for compilers to do is emit code to change the state register every time one of the affected instructions is used (and then maybe try to optimize redundant setting of the state registers?).

@nick-knight
Copy link
Contributor

I completely agree with your practical concerns. This was already an issue for scalar floating-point's dynamic rounding mode and exception flags (frm and fflags CSRs).

Almost every vector instruction depends on global state --- the vl and vtype CSRs --- and this is already a major headache for the C intrinsics. The current LLVM implementation emits a state change (vset{i}vli) before every instruction, and then performs a succession of dizzying passes in an attempt to elide redundant ones. I spend a fraction of my working hours filing tickets about corner-cases where LLVM's vset{i}vli-elision fails. So your point certainly resonates with me.

As far as I'm aware, the C intrinsics interface hasn't even agreed on how to handle vxrm:
riscv-non-isa/rvv-intrinsic-doc#36
riscv-non-isa/rvv-intrinsic-doc#46
riscv-non-isa/rvv-intrinsic-doc#84
riscv-non-isa/rvv-intrinsic-doc#27 (comment)

@kasanovic
Copy link
Collaborator

Understand and agree with the complaint generally (except that saturation mode is part of instruction encoding). Rounding mode was put in CSR just for encoding space reasons. Also note that in a vector machine, the mode change is only single-cycle occupancy whereas vector operation could run for quite a few cycles, so actual execution time overhead can be low/zero. There were requests for more fixed-point support and plan was to add more later as extensions, and these could include static round versions, but these could chew up a lot of encoding space unless carefully selected. Compiler teams have had success removing extraneous vsetvli instructions using the suggested approach (as Nick mentions), so would definitely try that path.

@kasanovic kasanovic added the Resolve after v1.0 Does not need to be resolved for v1.0 draft label Nov 9, 2021
arcbbb added a commit to arcbbb/riscv-elf-psabi-doc that referenced this issue Mar 18, 2022
There are discussions in riscv/riscv-v-spec#739.
The reason that ISA makes the vx round mode only available in CSR is for encoding space.
I think we can make it non-preserved across calls to use it efficiently.
arcbbb added a commit to arcbbb/riscv-elf-psabi-doc that referenced this issue Mar 18, 2022
There are discussions in riscv/riscv-v-spec#739.
The reason that ISA makes the vx round mode only available in CSR is for encoding space.
I think we can make it non-preserved across calls to use it efficiently.
kito-cheng pushed a commit to riscv-non-isa/riscv-elf-psabi-doc that referenced this issue Jun 20, 2022
After discussion in #287, we realized that keeping the convention of vxsat and vxrm to be unspecified made the vector extension become unusable on the software side, because that might become an ABI breakage once we define that later, so we would like to define that within psABI 1.0 release.

Here is two potential proposals for vxrm and vxsat in #256:
- Same definition as fcsr, and followed the same convention as fenv.
- Define as "not preserved across calls and unspecified upon entry".

Based on the discussion in #256 and few off list discussions, we have better understanding of the usage for vxrm and vxsat registers now, so we propose take the second proposal: “not preserved across calls and the value is unspecified upon entry” as the ABI for the vxrm and vxsat according following reasons:

- Using same policy as floating point environment might cause unnecessary overhead for maintain the convention described in C language specification : "a function call does not alter its caller's floating-point control modes, clear its caller’s floating-point status flags, nor depend on the state of its caller’s floating-point status flags unless the function is so documented"[1]; all existing libraries are *NOT* documented as might change fixed point rounding mode, that means we must backup and restore the content of vxrm and vxsat at those library functions when we implement those functions with vector fixed-point operations to ensure we didn’t violate the convention.

- Another key difference between fixed-point rounding mode and floating point environment is fixed-point didn't have it own default rounding mode, and define one isn't meaningful because fixed-point rounding modes are likely to be setting to specific mode for most fixed-point algorithm, and might be changed frequently within a single algorithm implementation[2] - unlike floating point rounding modes are unlikely to change in most situation.

Based on the above reason, we believe that is the best choice for vxrm and vxsat.

[1] ISO/IEC 9899:2011: 7.6.3
[2] riscv/riscv-v-spec#739
kito-cheng pushed a commit to riscv-non-isa/riscv-elf-psabi-doc that referenced this issue Jun 20, 2022
…fied upon entry

After discussion in #287, we realized that keeping the convention of vxsat and vxrm to be unspecified made the vector extension become unusable on the software side, because that might become an ABI breakage once we define that later, so we would like to define that within psABI 1.0 release.

Here is two potential proposals for vxrm and vxsat in #256:
- Same definition as fcsr, and followed the same convention as fenv.
- Define as "not preserved across calls and unspecified upon entry".

Based on the discussion in #256 and few off list discussions, we have better understanding of the usage for vxrm and vxsat registers now, so we propose take the second proposal: “not preserved across calls and the value is unspecified upon entry” as the ABI for the vxrm and vxsat according following reasons:

- Using same policy as floating point environment might cause unnecessary overhead for maintain the convention described in C language specification : "a function call does not alter its caller's floating-point control modes, clear its caller’s floating-point status flags, nor depend on the state of its caller’s floating-point status flags unless the function is so documented"[1]; all existing libraries are *NOT* documented as might change fixed point rounding mode, that means we must backup and restore the content of vxrm and vxsat at those library functions when we implement those functions with vector fixed-point operations to ensure we didn't violate the convention.

- Another key difference between fixed-point rounding mode and floating point environment is fixed-point didn't have it own default rounding mode, and define one isn't meaningful because fixed-point rounding modes are likely to be setting to specific mode for most fixed-point algorithm, and might be changed frequently within a single algorithm implementation[2] - unlike floating point rounding modes are unlikely to change in most situation.

Based on the above reason, we believe that is the best choice for vxrm and vxsat.

[1] ISO/IEC 9899:2011: 7.6.3
[2] riscv/riscv-v-spec#739
kito-cheng pushed a commit to kito-cheng/riscv-elf-psabi-doc that referenced this issue Jun 20, 2022
…fied upon entry

After discussion in riscv-non-isa#287, we realized that keeping the convention of vxsat and vxrm to be unspecified made the vector extension become unusable on the software side, because that might become an ABI breakage once we define that later, so we would like to define that within psABI 1.0 release.

Here is two potential proposals for vxrm and vxsat in riscv-non-isa#256:
- Same definition as fcsr, and followed the same convention as fenv.
- Define as "not preserved across calls and unspecified upon entry".

Based on the discussion in riscv-non-isa#256 and few off list discussions, we have better understanding of the usage for vxrm and vxsat registers now, so we propose take the second proposal: “not preserved across calls and the value is unspecified upon entry” as the ABI for the vxrm and vxsat according following reasons:

- Using same policy as floating point environment might cause unnecessary overhead for maintain the convention described in C language specification : "a function call does not alter its caller's floating-point control modes, clear its caller’s floating-point status flags, nor depend on the state of its caller’s floating-point status flags unless the function is so documented"[1]; all existing libraries are *NOT* documented as might change fixed point rounding mode, that means we must backup and restore the content of vxrm and vxsat at those library functions when we implement those functions with vector fixed-point operations to ensure we didn't violate the convention.

- Another key difference between fixed-point rounding mode and floating point environment is fixed-point didn't have it own default rounding mode, and define one isn't meaningful because fixed-point rounding modes are likely to be setting to specific mode for most fixed-point algorithm, and might be changed frequently within a single algorithm implementation[2] - unlike floating point rounding modes are unlikely to change in most situation.

Based on the above reason, we believe that is the best choice for vxrm and vxsat.

[1] ISO/IEC 9899:2011: 7.6.3
[2] riscv/riscv-v-spec#739
@Findecanor
Copy link

Findecanor commented May 9, 2023

BTW. Table 5 in section 3.8 of the spec about the fixed-point rounding modes is misleading, and I think should be clarified.

While rounding mode "rdn" truncates the bits, in the context of division in 2's complement arithmetic, it is flooring, not truncating. It is what you'd want to use to get "no rounding" after shift, like srl/sra in the base set.

Rounding mode "rnu" is the "truncating" mode, giving vaadd[u] behaviour like [v]div[u].
Edit: My mistake. "rtz" rounds up only when the value is negative.

@nick-knight
Copy link
Contributor

Agree there is confusion about the term truncation, whose meaning changes when passing from sign-magnitude to two's complement.

Disagree about the relation between RNU and RTZ: they are distinct.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Resolve after v1.0 Does not need to be resolved for v1.0 draft
Projects
None yet
Development

No branches or pull requests

4 participants