Same code is slow in Rust, but is fast in C using clang 3.9 #39446

pftbest · 2017-02-01T16:19:19Z

Simple code with loops and integer arithmetic is 15-20% slower in Rust, than the same exact (byte to byte exact) C code compiled by clang-3.9. I suspect this may be some issue in LLVM that is triggered by Rust.
Rust code: https://gist.github.com/pftbest/5c18e458cddd6a055878503c08a38848
C code: https://gist.github.com/pftbest/85ac44272eecb365ad62b7fbc4f72115
Rust version:

rustc 1.16.0-nightly (df8debf6d 2017-01-25)
binary: rustc
commit-hash: df8debf6d9afc431adbbd8311dcaf2b70eb9762e
commit-date: 2017-01-25
host: x86_64-unknown-linux-gnu
release: 1.16.0-nightly
LLVM version: 3.9

clang version:

clang version 3.9.1 (tags/RELEASE_391/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Results from first PC (16% difference):

$ rustc -C opt-level=3 main.rs
$ time ./main
200574
60372774
120544974
180717174
240889374
301061574
309886830

real    0m0.829s
user    0m0.827s
sys     0m0.000s

$ clang -O3 main.c
$ time ./a.out
200574
60372774
120544974
180717174
240889374
301061574
309886830

real    0m0.694s
user    0m0.693s
sys     0m0.000s

Results from second PC (26% difference):

$ time ./main > /dev/null
real    0m1.593s
user    0m1.587s
sys     0m0.000s

time ./a.out > /dev/null
real    0m1.170s
user    0m1.167s
sys     0m0.000s

The text was updated successfully, but these errors were encountered:

cuviper · 2017-02-17T23:30:40Z

Note that optimized Rust treats signed overflow as wrapping, whereas in C it's undefined behavior. I found that clang -O3 -fwrapv kills most of its advantage, which suggests it is indeed taking advantage of that signed-overflow UB for optimizations.

GCC is about the same with or without -fwrapv, FWIW.

cuviper · 2017-02-17T23:50:08Z

Here's one of the hotspots with clang -O3:

        x_x = (x * x) / 200;
mov    %edx,%ebp            
imul   %ebp,%ebp            
imul   $0x51eb851f,%rbp,%rbp
shr    $0x26,%rbp

And the same with clang -O3 -fwrapv:

        x_x = (x * x) / 200;
mov    %eax,%edx            
imul   %edx,%edx            
movslq %edx,%rdx            
imul   $0x51eb851f,%rdx,%rbp
mov    %rbp,%rdx            
shr    $0x3f,%rdx           
sar    $0x26,%rbp           
add    %edx,%ebp

In both cases, it's optimizing the division into multiplication with fixups. But it looks like in the first case, with x * x assumed not to overflow and therefore always positive, it generated a much simpler replacement for effectively unsigned division.

The same line from rustc is basically identical to the clang -fwrapv case.

GCC outputs the same for me either way.

        x_x = (x * x) / 200;
mov    %edi,%r13d           
imul   %edi,%r13d           
mov    %r13d,%eax           
sar    $0x1f,%r13d          
imul   %ebx                 
sar    $0x6,%edx            
mov    %edx,%ecx            
sub    %r13d,%ecx

pftbest · 2017-02-18T08:56:13Z

Ok, if this confirmed to be a limitation caused by wrapping semantics, should be there an option in rustc to enable undefined behavior? It will be interesting to see how this affects other rust applications that do heavy computations.

ranma42 · 2017-02-18T10:42:49Z

@pftbest you might try adding unsafe { std::intrinsics::assume(x * x >= 0); } right before the x_x = (x * x) / 200;. In the playground it seems sufficient to get the optimised code.

pftbest · 2017-02-18T12:20:29Z

@ranma42 Your suggestion helped, but it wasn't enough. I've added this intrinsic to both x*x and y*y, and it reduced the run time from 1.59s to 1.37s but it is still slower than 1.17s I get from C code. Also I tried to add -fwrapv flag to clang, and this increased the time to 1.48s for C code.

ranma42 · 2017-02-18T13:10:49Z

@pftbest my suggestion was specific for the x*x (and y*y) fragment, but there might be other hot spots that need tuning to get optimised as you would like.

pftbest · 2017-02-18T13:37:50Z

@ranma42 yes, I know, that is why I was asking if there is a switch in the compiler.
This is not even my code, I was talking to a co-worker one day about Rust, how it is great and fast as C. In response he gave me this piece of code and asked what is wrong with it, and I had no good answer.

cuviper · 2017-02-18T22:50:07Z

Another way to "trick" rustc without using unsafe is to actually make it unsigned division.

--- main.rs.orig	2017-02-18 14:41:31.993231078 -0800
+++ main.rs	2017-02-18 14:43:25.862194979 -0800
@@ -43,8 +43,8 @@ fn main() {
         y_y = 0;
         i = 0;
         while (i < max_iter && x_x + y_y <= 800) {
-          x_x = (x * x) / 200;
-          y_y = (y * y) / 200;
+          x_x = ((x * x) as u32 / 200) as i32;
+          y_y = ((y * y) as u32 / 200) as i32;
           if (x_x + y_y > 800) {
             the_char = 48 + i;
             if (i > 9) {
@@ -53,9 +53,9 @@ fn main() {
           } else {
             temp = x_x - y_y + x0;
             if ((x < 0 && y > 0) || (x > 0 && y < 0)) {
-              y = (-1 * ((-1 * (x * y)) / 100)) + y0;
+              y = (-1 * ((-1 * (x * y)) as u32 / 100) as i32) + y0;
             } else {
-              y = x * y / 100 + y0;
+              y = ((x * y) as u32 / 100) as i32 + y0;
             }
             x = temp;
           }

cuviper · 2017-02-21T22:24:00Z

FWIW, I filed a GCC performance bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

pftbest · 2017-04-01T10:56:16Z

Here is another classic example of overflow optimizations:

pub fn times_ten(num: isize) -> isize {
    let a = num * 30;
    let b = a / 3;
    b
}

gives:

example::times_ten:
        imul    rax, rdi, 30
        movabs  rcx, 6148914691236517206
        imul    rcx
        mov     rax, rdx
        shr     rax, 63
        lea     rax, [rax + rdx]
        ret

while C code knows its just times 10:

times_ten(int):                          # @times_ten(int)
        add     edi, edi
        lea     eax, [rdi + 4*rdi]
        ret

Looks like when RFC 0560 was changed no one considered the performance impact of this semantics. Or maybe they did consider, but did not mention it in the RFC.

coder543 · 2017-05-09T13:25:10Z

Just bumping this issue, it looks like GCC has issued a performance fix for this: https://www.phoronix.com/scan.php?page=news_item&px=GCC-Inefficiency-Fix-67

th0br0 · 2017-06-26T18:24:10Z

I was able to achieve similar performance gains by casting to usize.
However, this has led to the rather peculiar issue where the code builds and works fine in --release but panicks with attempted to add with overflow in --debug.

I'm running on latest nightly, but the issue is reproducable with stable on the playground:

https://play.rust-lang.org/?gist=0d67cb96632865292cdeb338ca463027&version=nightly&backtrace=0

I'm not sure whether I should open this as a separate bug or whether this is part of this one.
Note the lack of warnings; if I replace the state[1] with (-1), rustc will complain.

codyps · 2017-06-26T18:27:32Z

@th0br0 you'll need to use Wrapping<T> or .wrapping_add() to avoid the debug assertion (which is intentionally only on by default in non-release builds).

pftbest · 2017-06-26T18:44:19Z

The panic you see is intended behavior, so there is no bug here.

Also I don't see how changing isize to usize would give any performance in your case, because they compile to the same assembly code:

https://godbolt.org/g/DQrYkK

th0br0 · 2017-06-26T19:06:46Z

@pftbest the change was less a matter of isize/usize but rather explicitly casting each component of the addition to usize. Sorry that I didn't make that clear in my previous post.

see the different assembly code: (cast_once was what I had in the past)
https://godbolt.org/g/EeV9At

pftbest · 2017-06-26T19:26:06Z

Ok, I see it now, the reason this would be fast in C because it will implicitly cast each component to isize but Rust will not. In Rust you would have to manually cast them to isize to get the same performance.

steveklabnik · 2018-12-25T20:04:13Z

Triage: it's been a year and a half, and seems like this isn't a bug, just differences in semantics. Closing!

TianyiShi2001 · 2020-11-12T20:19:15Z

In case anyone is still interested in this issue:

as of 2020-11-12, the benchmark program using i32 is faster than isize. Here are the benchmarks on my computer:

rust i32: 0.94s
rust isize: 1.06s
clang -fwrapv: 0.90s

steveklabnik added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Feb 1, 2017

th0br0 mentioned this issue Jun 26, 2017

curl: final optimisations iotaledger/iota.legacy.rs#52

Merged

Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jul 26, 2017

steveklabnik closed this as completed Dec 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Same code is slow in Rust, but is fast in C using clang 3.9 #39446

Same code is slow in Rust, but is fast in C using clang 3.9 #39446

pftbest commented Feb 1, 2017

cuviper commented Feb 17, 2017

cuviper commented Feb 17, 2017

pftbest commented Feb 18, 2017

ranma42 commented Feb 18, 2017

pftbest commented Feb 18, 2017

ranma42 commented Feb 18, 2017

pftbest commented Feb 18, 2017

cuviper commented Feb 18, 2017

cuviper commented Feb 21, 2017

pftbest commented Apr 1, 2017

coder543 commented May 9, 2017

th0br0 commented Jun 26, 2017 •

edited

Loading

codyps commented Jun 26, 2017

pftbest commented Jun 26, 2017

th0br0 commented Jun 26, 2017 •

edited

Loading

pftbest commented Jun 26, 2017

steveklabnik commented Dec 25, 2018

TianyiShi2001 commented Nov 12, 2020

Same code is slow in Rust, but is fast in C using clang 3.9 #39446

Same code is slow in Rust, but is fast in C using clang 3.9 #39446

Comments

pftbest commented Feb 1, 2017

cuviper commented Feb 17, 2017

cuviper commented Feb 17, 2017

pftbest commented Feb 18, 2017

ranma42 commented Feb 18, 2017

pftbest commented Feb 18, 2017

ranma42 commented Feb 18, 2017

pftbest commented Feb 18, 2017

cuviper commented Feb 18, 2017

cuviper commented Feb 21, 2017

pftbest commented Apr 1, 2017

coder543 commented May 9, 2017

th0br0 commented Jun 26, 2017 • edited Loading

codyps commented Jun 26, 2017

pftbest commented Jun 26, 2017

th0br0 commented Jun 26, 2017 • edited Loading

pftbest commented Jun 26, 2017

steveklabnik commented Dec 25, 2018

TianyiShi2001 commented Nov 12, 2020

th0br0 commented Jun 26, 2017 •

edited

Loading

th0br0 commented Jun 26, 2017 •

edited

Loading