Stop optimizing under the assumption of default denormal behavior #123123

DemiMarie · 2024-03-27T07:05:35Z

On some platforms (QNX on Arm comes to mind, but there are probably others) flush-to-zero and denormals-are-zero are turned on by default and cannot be disabled. Furthermore, PipeWire turns on FTZ and DAZ because it is the only way to get realtime behavior (otherwise, denormals are handled by microcode assists or traps that are far too slow).

Right now, this means that one cannot use Rust for realtime code that performs floating point operations, which I think is silly. Instead, I think Rust should treat compile-time operations on floating-point numbers as partial functions. When a denormal or NaN is hit, Rust should defer the operation until runtime, when the actual behavior is known.

workingjubilee · 2024-03-27T08:18:38Z

It has been decades since hardware implementations of subnormal numbers with latency within close range of other hardware floating point instructions were described, and at least a decade since they were implemented in hardware and became pervasive. On the other hand, data-dependent timing has become so omnipresent, even with integral operations, that one must now deliberately enable flags in hardware to make timings data-independent. So what you say is not the full story, and neither is what I just said, and it might be best if these problems were described in a more parameterized way, so that we can quantify these cases better when considering them and have a complete picture.

In particular, I suspect what you describe as a feature of QNX is moreso a feature of particular hardware implementations that QNX has historically supported. I am more willing to believe I am Simply Wrong on that count, but specifically, some 32-bit Arm FPUs are always-FTZ, if I remember correctly, and even then it's a "...sortof", often.

the8472 · 2024-03-27T13:15:41Z

It has been decades since hardware implementations of subnormal numbers with latency within close range of other hardware floating point instructions were described

Perhaps not everyone got the memo? We've had reports of SIMD performance tanking by a factor of 10 on Intel CPUs when denormals are involved. #116359 and https://users.rust-lang.org/t/unexplained-order-of-magnitude-drop-in-performance/100406

DemiMarie · 2024-03-27T14:28:24Z

It has been decades since hardware implementations of subnormal numbers with latency within close range of other hardware floating point instructions were described

Perhaps not everyone got the memo? We've had reports of SIMD performance tanking by a factor of 10 on Intel CPUs when denormals are involved. #116359 and https://users.rust-lang.org/t/unexplained-order-of-magnitude-drop-in-performance/100406

Indeed, I’m pretty sure that many FPUs simply don’t bother handling denormals in hardware. You either enable FTZ and DAZ, or you trap to microcode (x86) or privileged code (Arm, RISC-V) for slow software emulation.

Also, even if FTZ/DAZ isn’t necessary on new hardware (which I am not at all certain of), people will keep doing it because they want to support old hardware. Since some libraries (like PipeWire!) will do it behind your back, I think this needs to be fixed in Rust.

In PipeWire, I believe that a denormal represents a sound that is far below the threshold of hearing. In this context, gradual underflow is not at all important, whereas avoiding latency spikes is important. The hardware designers expose FTZ/DAZ for a reason, and PipeWire is absolutely correct to turn them on.

In particular, I suspect what you describe as a feature of QNX is moreso a feature of particular hardware implementations that QNX has historically supported. I am more willing to believe I am Simply Wrong on that count, but specifically, some 32-bit Arm FPUs are always-FTZ, if I remember correctly, and even then it's a "...sortof", often.

That just means that QNX doesn’t provide the software emulation code. If you try to disable FTZ/DAZ (“RunFast mode” in Arm parlace) you either find that you cannot (if the disabling requires OS support) or you get exceptions.

RunFast mode (FTZ/DAZ/no exceptions) is what Arm CPUs are optimized for. Turning it off results in a performance penalty (due to reduced reordering opportunities in hardware) even if one doesn’t encounter denormals at all. Furthermore, Rust isn’t just intended for applications. It’s also intended for libraries, and those might be called with or without FTZ/DAZ. Whether or not this is UB, people will do it anyway, because it will solve real-world performance problems in their applications. It’s much better for Rust to accept that FTZ/DAZ exist, and be prepared for them.

Proposed concrete semantics

Rust assumes that operations on normal floating point numbers obey IEEE 754 with round-to-nearest. However, Rust does not assume that operations on denormal floating point numbers, infinities, or NaNs obey IEEE 754. This is to support systems that require Flush To Zero (FTZ)/Denormals Are Zero (DAZ), either for performance reasons or because FTZ/DAZ is all that is supported.

Exceptions to IEEE 754

The following exceptions to IEEE 754 behavior are permitted:

Denormal input(s) may or may not be flushed to zero before completing the operation.
If the output is a denormal number, it may or may not be flushed to zero. The check for the output being denormal may be performed before or after rounding, or it may be performed both places.
If either input is NaN, the NaN payload may or may not be preserved.
If the output is NaN, the payload is unspecified.

Interaction with FFI/inline assembler

There are two reasonable options here I can think of:

Floating point operations are allowed to be reordered passed FFI calls or inline assembler. Therefore, if an FFI call or inline assembler changes the floating-point control word in a way that affects the FTZ/DAZ flags, the change may take affect at an unspecified point in the past or future. This may cause non-deterministic results, but will not cause undefined behavior.
Floating point operations are not allowed to be reordered passed FFI calls or inline assembler.

workingjubilee · 2024-03-27T19:28:30Z

It has been decades since hardware implementations of subnormal numbers with latency within close range of other hardware floating point instructions were described

Perhaps not everyone got the memo? We've had reports of SIMD performance tanking by a factor of 10 on Intel CPUs when denormals are involved. #116359 and https://users.rust-lang.org/t/unexplained-order-of-magnitude-drop-in-performance/100406

The trap-to-handle-in-kernel behavior mentioned results in more orders of magnitude of slowdown than merely 10. It also should not be the case that setting the x86-64-v2 featureset removes the slowdown if the problem is fundamentally denormal numbers. It seems more likely to be a problem in handling the high-half of the float registers i.e. SIMD registers, and false dependencies (possibly combined with other issues).

workingjubilee · 2024-03-27T19:32:36Z

Indeed, I’m pretty sure that many FPUs simply don’t bother handling denormals in hardware. You either enable FTZ and DAZ, or you trap to microcode (x86) or privileged code (Arm, RISC-V) for slow software emulation.

AArch64 CPUs do, evidently.

And software emulation is often required for floats anyways on the 32-bit CPUs, such that your choice is between "software emulation" and "software emulation". More precisely, often even when they have an FPU, it only handles 32-bit floats, and doesn't handle 64-bit floats, which causes the software calls in any case. It may, for instance, be more useful to change Rust's presumptive-float-inference behavior, on such CPUs, to presumptively infer a 32-bit float instead.

DemiMarie · 2024-03-27T19:45:04Z

Indeed, I’m pretty sure that many FPUs simply don’t bother handling denormals in hardware. You either enable FTZ and DAZ, or you trap to microcode (x86) or privileged code (Arm, RISC-V) for slow software emulation.

AArch64 CPUs do, evidently.

Do all of them?

In any case, I still think Rust should be able to handle environments where FTZ and DAZ are in use. There are too many things that set it.

workingjubilee · 2024-03-27T19:50:59Z

Possibly not all of them! But if we're willing to talk about CPUs in broad generalities and infer based on a few bug reports, I'm willing to take the aforementioned bug in https://users.rust-lang.org/t/unexplained-order-of-magnitude-drop-in-performance/100406 as suggestive, yes.

It seems more likely to be a problem in handling the high-half of the float registers i.e. SIMD registers, and false dependencies (possibly combined with other issues).

I hadn't even looked at #116359 and lo, I was correct on it being a high-half issue... though I didn't fully expect it to be the high half of xmm, honestly. Amusingly poor form on Intel's part.

In any case, I still think Rust should be able to handle environments where FTZ and DAZ are in use. There are too many things that set it.

With that I agree, but I think it is probably inadvisable to jump to assuming a global change to Rust's float semantics is the best way to achieve it.

DemiMarie · 2024-03-27T20:04:17Z

In any case, I still think Rust should be able to handle environments where FTZ and DAZ are in use. There are too many things that set it.

With that I agree, but I think it is probably inadvisable to jump to assuming a global change to Rust's float semantics is the best way to achieve it.

Code running in these environments might not know that is going to run in these environments. It might be running as a library in an application that sets the mode, for instance.

workingjubilee · 2024-03-27T20:06:02Z

The same can be applied to every setting in the floating-point CSRs. Are you proposing we suspend all floating-point optimizations whatsoever?

DemiMarie · 2024-03-27T20:09:03Z

The same can be applied to every setting in the floating-point CSRs. Are you proposing we suspend all floating-point optimizations whatsoever?

I thought FTZ and DAZ were used by way more applications than all the others combined.

workingjubilee · 2024-03-27T21:03:46Z

It is very common in interval arithmetic, which offers rigorous computation with strictly bounded errors, even in the face of the floating point... "quirks".

Even without that domain of mathematics, manipulation of the MXCSR is performed in order to have a fast fused-multiply-add implementation that upholds the "1 rounding" constraint on SSE2-only CPUs. Thus it occurs in almost every application that is linked against libm and executes on an old-enough x86 CPU. This is done in a scoped fashion which does not impede optimizing the rest of the program, which is why I think it is better to first consider what can be done without simply relegating the problem to global state.

It is also often the case that usage of FTZ/DAZ is unintentional at best, due to people mindlessly copying their build configuration from a coroutine library.

Finally, audio engineers often do simply handle subnormals by e.g. zeroing the float when they do iterative operations that produce a subnormal. This is because altering the rest of the global program to fix a bug that arises in a single loop is not necessarily desirable.

And Pipewire does not do the thing you described any more, though they do allow it to be configured. This is because it crashed programs that linked it in. What you assert as Pipewire's behavior only describes a span of about 4 months.

DemiMarie · 2024-03-27T21:12:48Z

I don’t think it’s reasonable to expect code that manipulates FPU flags to be written entirely in assembly. For instance, Unity requires FTZ/DAZ for its physics engine.

workingjubilee · 2024-03-27T21:13:50Z

That is not what I was suggesting should be the case, either, though I realize that is the current state of affairs.

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop optimizing under the assumption of default denormal behavior #123123

Stop optimizing under the assumption of default denormal behavior #123123

DemiMarie commented Mar 27, 2024

workingjubilee commented Mar 27, 2024

the8472 commented Mar 27, 2024

DemiMarie commented Mar 27, 2024

workingjubilee commented Mar 27, 2024

workingjubilee commented Mar 27, 2024

DemiMarie commented Mar 27, 2024

workingjubilee commented Mar 27, 2024

DemiMarie commented Mar 27, 2024

workingjubilee commented Mar 27, 2024

DemiMarie commented Mar 27, 2024

workingjubilee commented Mar 27, 2024

DemiMarie commented Mar 27, 2024

workingjubilee commented Mar 27, 2024

Stop optimizing under the assumption of default denormal behavior #123123

Stop optimizing under the assumption of default denormal behavior #123123

Comments

DemiMarie commented Mar 27, 2024

workingjubilee commented Mar 27, 2024

the8472 commented Mar 27, 2024

DemiMarie commented Mar 27, 2024

Proposed concrete semantics

Exceptions to IEEE 754

Interaction with FFI/inline assembler

workingjubilee commented Mar 27, 2024

workingjubilee commented Mar 27, 2024

DemiMarie commented Mar 27, 2024

workingjubilee commented Mar 27, 2024

DemiMarie commented Mar 27, 2024

workingjubilee commented Mar 27, 2024

DemiMarie commented Mar 27, 2024

workingjubilee commented Mar 27, 2024

DemiMarie commented Mar 27, 2024

workingjubilee commented Mar 27, 2024