Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler emits inappropriate SSE2 instructions for f64 math #116359

Open
HEnquist opened this issue Oct 2, 2023 · 4 comments
Open

Compiler emits inappropriate SSE2 instructions for f64 math #116359

HEnquist opened this issue Oct 2, 2023 · 4 comments
Labels
A-codegen Area: Code generation A-floating-point Area: Floating point numbers and arithmetic A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. O-x86_64 Target: x86-64 processors (like x86_64-*) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@HEnquist
Copy link

HEnquist commented Oct 2, 2023

In my audio DSP application, the simple function that applies a biquad filter to a waveform sometimes runs much slower than expected.
The results are correct, just the performance is affected.
This only happens on Intel CPUs, when compiling with the default target settings.
See the discussion and investigation here: https://users.rust-lang.org/t/unexplained-order-of-magnitude-drop-in-performance

The affected function:

    fn process_single(&mut self, input: f64) -> f64 {
        let out = self.s1 + self.coeffs.b0 * input;
        self.s1 = self.s2 + self.coeffs.b1 * input - self.coeffs.a1 * out;
        self.s2 = self.coeffs.b2 * input - self.coeffs.a2 * out;
        out
    }

    #[inline(never)]
    fn process_waveform(&mut self, waveform: &mut [f64]) {
        for item in waveform.iter_mut() {
            *item = self.process_single(*item);
        }
    }

This is taken from the example published here: https://github.com/HEnquist/sse_subnormal_issue

The slowness has been traced down to subnormal numbers. This slows down Intel CPUs by an order of magnitude, while AMD cpus are nearly unaffected.
There are however no subnormals in any of the involved variables. The problem comes from that the compiler uses an SSE2 vector multiplication to do a single multiplication, without ensuring that the unused lane in the registers contain proper floating point values. The speed depends on what happens to be in the xmm4 register when entering the function. When it's a "bad" value, performance is terrible.

This is the assembly of the process_waveform function:

00000000000085f0 <_ZN7xmmtest6Biquad16process_waveform17hff1442b2f1181681E>:
    85f0:	f2 0f 10 47 20       	movsd  0x20(%rdi),%xmm0
    85f5:	f2 0f 10 57 28       	movsd  0x28(%rdi),%xmm2
    85fa:	66 0f 10 0f          	movupd (%rdi),%xmm1
    85fe:	66 0f 10 5f 10       	movupd 0x10(%rdi),%xmm3
    8603:	66 0f 16 67 30       	movhpd 0x30(%rdi),%xmm4
    8608:	31 c0                	xor    %eax,%eax
    860a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
    8610:	f2 0f 10 2c c6       	movsd  (%rsi,%rax,8),%xmm5
    8615:	66 0f 28 f0          	movapd %xmm0,%xmm6
    8619:	f2 0f 59 f5          	mulsd  %xmm5,%xmm6
    861d:	f2 0f 58 f1          	addsd  %xmm1,%xmm6
    8621:	f2 0f 11 34 c6       	movsd  %xmm6,(%rsi,%rax,8)
    8626:	48 8d 48 01          	lea    0x1(%rax),%rcx
    862a:	66 0f 28 fa          	movapd %xmm2,%xmm7
    862e:	f2 0f 59 fd          	mulsd  %xmm5,%xmm7
    8632:	66 0f 15 c9          	unpckhpd %xmm1,%xmm1
    8636:	66 0f 58 cf          	addpd  %xmm7,%xmm1
    863a:	66 0f 14 fd          	unpcklpd %xmm5,%xmm7
    863e:	66 0f 59 fc          	mulpd  %xmm4,%xmm7
    8642:	66 0f c6 cf 02       	shufpd $0x2,%xmm7,%xmm1
    8647:	66 0f 14 f6          	unpcklpd %xmm6,%xmm6
    864b:	66 0f 59 f3          	mulpd  %xmm3,%xmm6
    864f:	66 0f 5c ce          	subpd  %xmm6,%xmm1
    8653:	48 89 c8             	mov    %rcx,%rax
    8656:	48 81 f9 00 04 00 00 	cmp    $0x400,%rcx
    865d:	75 b1                	jne    8610 <_ZN7xmmtest6Biquad16process_waveform17hff1442b2f1181681E+0x20>
    865f:	66 0f 11 0f          	movupd %xmm1,(%rdi)
    8663:	c3                   	ret
    8664:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
    866b:	00 00 00 
    866e:	66 90                	xchg   %ax

The problem occurs in xmm4. This register only occurs twice, first when loading a value into the high lane, which leaves the lower lane unchanged:

8603:	66 0f 16 67 30       	movhpd 0x30(%rdi),%xmm4

Then it is multiplied with xmm7, storing the result in xmm7:

863e:	66 0f 59 fc          	mulpd  %xmm4,%xmm7

Finally, only the high lane is copied from xmm7 to xmm1, and the garbage result in the lower lane is never used:

8642:	66 0f c6 cf 02       	shufpd $0x2,%xmm7,%xmm1

This gives the correct result, but the unused lower lane will multiply whatever was in xmm4 before entering the function. There is a high likelihood that this is not a normal floating point value, for example uint64 value 2 gets interpreted as a very small subnormal number (9.88e-324). And this makes the multiplication very slow.

I suppose it uses the MULPD instruction to keep using SSE and avoiding mixing SSE and normal FPU instructions. But when doing this, I would expect it to ensure that any unused lane contains good values that don't cause any trouble.

Meta

rustc --version --verbose:

rustc 1.75.0-nightly (e0d7ed1f4 2023-10-01)
binary: rustc
commit-hash: e0d7ed1f453fb54578cc96dfea859b0e7be15016
commit-date: 2023-10-01
host: x86_64-unknown-linux-gnu
release: 1.75.0-nightly
LLVM version: 17.0.2

and

rustc 1.72.1 (d5c2e9c34 2023-09-13)
binary: rustc
commit-hash: d5c2e9c342b358556da91d61ed4133f6f50fc0c3
commit-date: 2023-09-13
host: x86_64-unknown-linux-gnu
release: 1.72.1
LLVM version: 16.0.5

Backtrace

(compiles without issue, nothing to see)

@HEnquist HEnquist added the C-bug Category: This is a bug. label Oct 2, 2023
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Oct 2, 2023
@workingjubilee workingjubilee added A-codegen Area: Code generation T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 3, 2023
@Noratrieb Noratrieb added I-slow Issue: Problems and improvements with respect to performance of generated code. O-x86_64 Target: x86-64 processors (like x86_64-*) A-floating-point Area: Floating point numbers and arithmetic and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Oct 3, 2023
spfenwick added a commit to spfenwick/camilladsp that referenced this issue Oct 7, 2023
by zeroing all sse2 registers on x86-64 architectures at start of processing loop
@HEnquist
Copy link
Author

I found a little something that may help investigating this.
When using rustc 1.70 and later to compile the small example, setting codegen-units = 1 makes it avoid the issue. That makes things a little harder since that is also what happens when compiling with --emit asm. Using 1.69 or older makes it emit the bad instructions also with codegen-units = 1.

@nikic nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Oct 10, 2023
@nikic
Copy link
Contributor

nikic commented Oct 10, 2023

Pretty sure we have an upstream LLVM issue for this, but I can't find it.

@HEnquist
Copy link
Author

I have searched the open llvm issues for all relevant words I can think of, and have not found anything that looks related to this.

When using rustc 1.70 and later to compile the small example, setting codegen-units = 1 makes it avoid the issue.

It turns out that this is incorrect. The issue is still there, it just shows up in a different form that I didn't see at first. I can post the details if that's helpful.

@HEnquist
Copy link
Author

I made a workaround that seems to avoid the problem in my project. I simply made a separate version of the troublesome function using sse2 intrinsics:
https://github.com/HEnquist/camilladsp/blob/e5a1ee35fa01a9ad3794a4d450b8e1d428921af4/src/biquad.rs#L424
Maybe other affected projects (are there any?) can use the same trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-floating-point Area: Floating point numbers and arithmetic A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. O-x86_64 Target: x86-64 processors (like x86_64-*) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants