Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop optimizing under the assumption of default denormal behavior #123123

Open
DemiMarie opened this issue Mar 27, 2024 · 13 comments
Open

Stop optimizing under the assumption of default denormal behavior #123123

DemiMarie opened this issue Mar 27, 2024 · 13 comments
Labels
A-floating-point Area: Floating point numbers and arithmetic C-discussion Category: Discussion or questions that doesn't represent real issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@DemiMarie
Copy link
Contributor

On some platforms (QNX on Arm comes to mind, but there are probably others) flush-to-zero and denormals-are-zero are turned on by default and cannot be disabled. Furthermore, PipeWire turns on FTZ and DAZ because it is the only way to get realtime behavior (otherwise, denormals are handled by microcode assists or traps that are far too slow).

Right now, this means that one cannot use Rust for realtime code that performs floating point operations, which I think is silly. Instead, I think Rust should treat compile-time operations on floating-point numbers as partial functions. When a denormal or NaN is hit, Rust should defer the operation until runtime, when the actual behavior is known.

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Mar 27, 2024
@workingjubilee
Copy link
Contributor

It has been decades since hardware implementations of subnormal numbers with latency within close range of other hardware floating point instructions were described, and at least a decade since they were implemented in hardware and became pervasive. On the other hand, data-dependent timing has become so omnipresent, even with integral operations, that one must now deliberately enable flags in hardware to make timings data-independent. So what you say is not the full story, and neither is what I just said, and it might be best if these problems were described in a more parameterized way, so that we can quantify these cases better when considering them and have a complete picture.

In particular, I suspect what you describe as a feature of QNX is moreso a feature of particular hardware implementations that QNX has historically supported. I am more willing to believe I am Simply Wrong on that count, but specifically, some 32-bit Arm FPUs are always-FTZ, if I remember correctly, and even then it's a "...sortof", often.

@the8472
Copy link
Member

the8472 commented Mar 27, 2024

It has been decades since hardware implementations of subnormal numbers with latency within close range of other hardware floating point instructions were described

Perhaps not everyone got the memo? We've had reports of SIMD performance tanking by a factor of 10 on Intel CPUs when denormals are involved. #116359 and https://users.rust-lang.org/t/unexplained-order-of-magnitude-drop-in-performance/100406

@DemiMarie
Copy link
Contributor Author

It has been decades since hardware implementations of subnormal numbers with latency within close range of other hardware floating point instructions were described

Perhaps not everyone got the memo? We've had reports of SIMD performance tanking by a factor of 10 on Intel CPUs when denormals are involved. #116359 and https://users.rust-lang.org/t/unexplained-order-of-magnitude-drop-in-performance/100406

Indeed, I’m pretty sure that many FPUs simply don’t bother handling denormals in hardware. You either enable FTZ and DAZ, or you trap to microcode (x86) or privileged code (Arm, RISC-V) for slow software emulation.

Also, even if FTZ/DAZ isn’t necessary on new hardware (which I am not at all certain of), people will keep doing it because they want to support old hardware. Since some libraries (like PipeWire!) will do it behind your back, I think this needs to be fixed in Rust.

In PipeWire, I believe that a denormal represents a sound that is far below the threshold of hearing. In this context, gradual underflow is not at all important, whereas avoiding latency spikes is important. The hardware designers expose FTZ/DAZ for a reason, and PipeWire is absolutely correct to turn them on.

In particular, I suspect what you describe as a feature of QNX is moreso a feature of particular hardware implementations that QNX has historically supported. I am more willing to believe I am Simply Wrong on that count, but specifically, some 32-bit Arm FPUs are always-FTZ, if I remember correctly, and even then it's a "...sortof", often.

That just means that QNX doesn’t provide the software emulation code. If you try to disable FTZ/DAZ (“RunFast mode” in Arm parlace) you either find that you cannot (if the disabling requires OS support) or you get exceptions.

RunFast mode (FTZ/DAZ/no exceptions) is what Arm CPUs are optimized for. Turning it off results in a performance penalty (due to reduced reordering opportunities in hardware) even if one doesn’t encounter denormals at all. Furthermore, Rust isn’t just intended for applications. It’s also intended for libraries, and those might be called with or without FTZ/DAZ. Whether or not this is UB, people will do it anyway, because it will solve real-world performance problems in their applications. It’s much better for Rust to accept that FTZ/DAZ exist, and be prepared for them.

Proposed concrete semantics

Rust assumes that operations on normal floating point numbers obey IEEE 754 with round-to-nearest. However, Rust does not assume that operations on denormal floating point numbers, infinities, or NaNs obey IEEE 754. This is to support systems that require Flush To Zero (FTZ)/Denormals Are Zero (DAZ), either for performance reasons or because FTZ/DAZ is all that is supported.

Exceptions to IEEE 754

The following exceptions to IEEE 754 behavior are permitted:

  1. Denormal input(s) may or may not be flushed to zero before completing the operation.
  2. If the output is a denormal number, it may or may not be flushed to zero. The check for the output being denormal may be performed before or after rounding, or it may be performed both places.
  3. If either input is NaN, the NaN payload may or may not be preserved.
  4. If the output is NaN, the payload is unspecified.

Interaction with FFI/inline assembler

There are two reasonable options here I can think of:

  1. Floating point operations are allowed to be reordered passed FFI calls or inline assembler. Therefore, if an FFI call or inline assembler changes the floating-point control word in a way that affects the FTZ/DAZ flags, the change may take affect at an unspecified point in the past or future. This may cause non-deterministic results, but will not cause undefined behavior.
  2. Floating point operations are not allowed to be reordered passed FFI calls or inline assembler.

@workingjubilee
Copy link
Contributor

It has been decades since hardware implementations of subnormal numbers with latency within close range of other hardware floating point instructions were described

Perhaps not everyone got the memo? We've had reports of SIMD performance tanking by a factor of 10 on Intel CPUs when denormals are involved. #116359 and https://users.rust-lang.org/t/unexplained-order-of-magnitude-drop-in-performance/100406

The trap-to-handle-in-kernel behavior mentioned results in more orders of magnitude of slowdown than merely 10. It also should not be the case that setting the x86-64-v2 featureset removes the slowdown if the problem is fundamentally denormal numbers. It seems more likely to be a problem in handling the high-half of the float registers i.e. SIMD registers, and false dependencies (possibly combined with other issues).

@workingjubilee
Copy link
Contributor

Indeed, I’m pretty sure that many FPUs simply don’t bother handling denormals in hardware. You either enable FTZ and DAZ, or you trap to microcode (x86) or privileged code (Arm, RISC-V) for slow software emulation.

AArch64 CPUs do, evidently.

And software emulation is often required for floats anyways on the 32-bit CPUs, such that your choice is between "software emulation" and "software emulation". More precisely, often even when they have an FPU, it only handles 32-bit floats, and doesn't handle 64-bit floats, which causes the software calls in any case. It may, for instance, be more useful to change Rust's presumptive-float-inference behavior, on such CPUs, to presumptively infer a 32-bit float instead.

@DemiMarie
Copy link
Contributor Author

Indeed, I’m pretty sure that many FPUs simply don’t bother handling denormals in hardware. You either enable FTZ and DAZ, or you trap to microcode (x86) or privileged code (Arm, RISC-V) for slow software emulation.

AArch64 CPUs do, evidently.

Do all of them?

In any case, I still think Rust should be able to handle environments where FTZ and DAZ are in use. There are too many things that set it.

@workingjubilee
Copy link
Contributor

Possibly not all of them! But if we're willing to talk about CPUs in broad generalities and infer based on a few bug reports, I'm willing to take the aforementioned bug in https://users.rust-lang.org/t/unexplained-order-of-magnitude-drop-in-performance/100406 as suggestive, yes.

It seems more likely to be a problem in handling the high-half of the float registers i.e. SIMD registers, and false dependencies (possibly combined with other issues).

I hadn't even looked at #116359 and lo, I was correct on it being a high-half issue... though I didn't fully expect it to be the high half of xmm, honestly. Amusingly poor form on Intel's part.

In any case, I still think Rust should be able to handle environments where FTZ and DAZ are in use. There are too many things that set it.

With that I agree, but I think it is probably inadvisable to jump to assuming a global change to Rust's float semantics is the best way to achieve it.

@DemiMarie
Copy link
Contributor Author

In any case, I still think Rust should be able to handle environments where FTZ and DAZ are in use. There are too many things that set it.

With that I agree, but I think it is probably inadvisable to jump to assuming a global change to Rust's float semantics is the best way to achieve it.

Code running in these environments might not know that is going to run in these environments. It might be running as a library in an application that sets the mode, for instance.

@workingjubilee
Copy link
Contributor

The same can be applied to every setting in the floating-point CSRs. Are you proposing we suspend all floating-point optimizations whatsoever?

@DemiMarie
Copy link
Contributor Author

The same can be applied to every setting in the floating-point CSRs. Are you proposing we suspend all floating-point optimizations whatsoever?

I thought FTZ and DAZ were used by way more applications than all the others combined.

@workingjubilee
Copy link
Contributor

It is very common in interval arithmetic, which offers rigorous computation with strictly bounded errors, even in the face of the floating point... "quirks".

Even without that domain of mathematics, manipulation of the MXCSR is performed in order to have a fast fused-multiply-add implementation that upholds the "1 rounding" constraint on SSE2-only CPUs. Thus it occurs in almost every application that is linked against libm and executes on an old-enough x86 CPU. This is done in a scoped fashion which does not impede optimizing the rest of the program, which is why I think it is better to first consider what can be done without simply relegating the problem to global state.

It is also often the case that usage of FTZ/DAZ is unintentional at best, due to people mindlessly copying their build configuration from a coroutine library.

Finally, audio engineers often do simply handle subnormals by e.g. zeroing the float when they do iterative operations that produce a subnormal. This is because altering the rest of the global program to fix a bug that arises in a single loop is not necessarily desirable.

And Pipewire does not do the thing you described any more, though they do allow it to be configured. This is because it crashed programs that linked it in. What you assert as Pipewire's behavior only describes a span of about 4 months.

@DemiMarie
Copy link
Contributor Author

I don’t think it’s reasonable to expect code that manipulates FPU flags to be written entirely in assembly. For instance, Unity requires FTZ/DAZ for its physics engine.

@workingjubilee
Copy link
Contributor

That is not what I was suggesting should be the case, either, though I realize that is the current state of affairs.

@jieyouxu jieyouxu added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. A-floating-point Area: Floating point numbers and arithmetic C-discussion Category: Discussion or questions that doesn't represent real issues. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-floating-point Area: Floating point numbers and arithmetic C-discussion Category: Discussion or questions that doesn't represent real issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants