Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same code is slow in Rust, but is fast in C using clang 3.9 #39446

Closed
pftbest opened this issue Feb 1, 2017 · 18 comments
Closed

Same code is slow in Rust, but is fast in C using clang 3.9 #39446

pftbest opened this issue Feb 1, 2017 · 18 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@pftbest
Copy link
Contributor

pftbest commented Feb 1, 2017

Simple code with loops and integer arithmetic is 15-20% slower in Rust, than the same exact (byte to byte exact) C code compiled by clang-3.9. I suspect this may be some issue in LLVM that is triggered by Rust.
Rust code: https://gist.github.com/pftbest/5c18e458cddd6a055878503c08a38848
C code: https://gist.github.com/pftbest/85ac44272eecb365ad62b7fbc4f72115
Rust version:

rustc 1.16.0-nightly (df8debf6d 2017-01-25)
binary: rustc
commit-hash: df8debf6d9afc431adbbd8311dcaf2b70eb9762e
commit-date: 2017-01-25
host: x86_64-unknown-linux-gnu
release: 1.16.0-nightly
LLVM version: 3.9

clang version:

clang version 3.9.1 (tags/RELEASE_391/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

Results from first PC (16% difference):

$ rustc -C opt-level=3 main.rs
$ time ./main
200574
60372774
120544974
180717174
240889374
301061574
309886830

real    0m0.829s
user    0m0.827s
sys     0m0.000s

$ clang -O3 main.c
$ time ./a.out
200574
60372774
120544974
180717174
240889374
301061574
309886830

real    0m0.694s
user    0m0.693s
sys     0m0.000s

Results from second PC (26% difference):

$ time ./main > /dev/null
real    0m1.593s
user    0m1.587s
sys     0m0.000s

time ./a.out > /dev/null
real    0m1.170s
user    0m1.167s
sys     0m0.000s
@steveklabnik steveklabnik added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Feb 1, 2017
@cuviper
Copy link
Member

cuviper commented Feb 17, 2017

Note that optimized Rust treats signed overflow as wrapping, whereas in C it's undefined behavior. I found that clang -O3 -fwrapv kills most of its advantage, which suggests it is indeed taking advantage of that signed-overflow UB for optimizations.

GCC is about the same with or without -fwrapv, FWIW.

@cuviper
Copy link
Member

cuviper commented Feb 17, 2017

Here's one of the hotspots with clang -O3:

        x_x = (x * x) / 200;
mov    %edx,%ebp            
imul   %ebp,%ebp            
imul   $0x51eb851f,%rbp,%rbp
shr    $0x26,%rbp           

And the same with clang -O3 -fwrapv:

        x_x = (x * x) / 200;
mov    %eax,%edx            
imul   %edx,%edx            
movslq %edx,%rdx            
imul   $0x51eb851f,%rdx,%rbp
mov    %rbp,%rdx            
shr    $0x3f,%rdx           
sar    $0x26,%rbp           
add    %edx,%ebp            

In both cases, it's optimizing the division into multiplication with fixups. But it looks like in the first case, with x * x assumed not to overflow and therefore always positive, it generated a much simpler replacement for effectively unsigned division.

The same line from rustc is basically identical to the clang -fwrapv case.

GCC outputs the same for me either way.

        x_x = (x * x) / 200;
mov    %edi,%r13d           
imul   %edi,%r13d           
mov    %r13d,%eax           
sar    $0x1f,%r13d          
imul   %ebx                 
sar    $0x6,%edx            
mov    %edx,%ecx            
sub    %r13d,%ecx           

@pftbest
Copy link
Contributor Author

pftbest commented Feb 18, 2017

Ok, if this confirmed to be a limitation caused by wrapping semantics, should be there an option in rustc to enable undefined behavior? It will be interesting to see how this affects other rust applications that do heavy computations.

@ranma42
Copy link
Contributor

ranma42 commented Feb 18, 2017

@pftbest you might try adding unsafe { std::intrinsics::assume(x * x >= 0); } right before the x_x = (x * x) / 200;. In the playground it seems sufficient to get the optimised code.

@pftbest
Copy link
Contributor Author

pftbest commented Feb 18, 2017

@ranma42 Your suggestion helped, but it wasn't enough. I've added this intrinsic to both x*x and y*y, and it reduced the run time from 1.59s to 1.37s but it is still slower than 1.17s I get from C code. Also I tried to add -fwrapv flag to clang, and this increased the time to 1.48s for C code.

@ranma42
Copy link
Contributor

ranma42 commented Feb 18, 2017

@pftbest my suggestion was specific for the x*x (and y*y) fragment, but there might be other hot spots that need tuning to get optimised as you would like.

@pftbest
Copy link
Contributor Author

pftbest commented Feb 18, 2017

@ranma42 yes, I know, that is why I was asking if there is a switch in the compiler.
This is not even my code, I was talking to a co-worker one day about Rust, how it is great and fast as C. In response he gave me this piece of code and asked what is wrong with it, and I had no good answer.

@cuviper
Copy link
Member

cuviper commented Feb 18, 2017

Another way to "trick" rustc without using unsafe is to actually make it unsigned division.

--- main.rs.orig	2017-02-18 14:41:31.993231078 -0800
+++ main.rs	2017-02-18 14:43:25.862194979 -0800
@@ -43,8 +43,8 @@ fn main() {
         y_y = 0;
         i = 0;
         while (i < max_iter && x_x + y_y <= 800) {
-          x_x = (x * x) / 200;
-          y_y = (y * y) / 200;
+          x_x = ((x * x) as u32 / 200) as i32;
+          y_y = ((y * y) as u32 / 200) as i32;
           if (x_x + y_y > 800) {
             the_char = 48 + i;
             if (i > 9) {
@@ -53,9 +53,9 @@ fn main() {
           } else {
             temp = x_x - y_y + x0;
             if ((x < 0 && y > 0) || (x > 0 && y < 0)) {
-              y = (-1 * ((-1 * (x * y)) / 100)) + y0;
+              y = (-1 * ((-1 * (x * y)) as u32 / 100) as i32) + y0;
             } else {
-              y = x * y / 100 + y0;
+              y = ((x * y) as u32 / 100) as i32 + y0;
             }
             x = temp;
           }

@cuviper
Copy link
Member

cuviper commented Feb 21, 2017

FWIW, I filed a GCC performance bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

@pftbest
Copy link
Contributor Author

pftbest commented Apr 1, 2017

Here is another classic example of overflow optimizations:

pub fn times_ten(num: isize) -> isize {
    let a = num * 30;
    let b = a / 3;
    b
}

gives:

example::times_ten:
        imul    rax, rdi, 30
        movabs  rcx, 6148914691236517206
        imul    rcx
        mov     rax, rdx
        shr     rax, 63
        lea     rax, [rax + rdx]
        ret

while C code knows its just times 10:

times_ten(int):                          # @times_ten(int)
        add     edi, edi
        lea     eax, [rdi + 4*rdi]
        ret

Looks like when RFC 0560 was changed no one considered the performance impact of this semantics. Or maybe they did consider, but did not mention it in the RFC.

@coder543
Copy link

coder543 commented May 9, 2017

Just bumping this issue, it looks like GCC has issued a performance fix for this: https://www.phoronix.com/scan.php?page=news_item&px=GCC-Inefficiency-Fix-67

@th0br0
Copy link

th0br0 commented Jun 26, 2017

I was able to achieve similar performance gains by casting to usize.
However, this has led to the rather peculiar issue where the code builds and works fine in --release but panicks with attempted to add with overflow in --debug.

I'm running on latest nightly, but the issue is reproducable with stable on the playground:

https://play.rust-lang.org/?gist=0d67cb96632865292cdeb338ca463027&version=nightly&backtrace=0

I'm not sure whether I should open this as a separate bug or whether this is part of this one.
Note the lack of warnings; if I replace the state[1] with (-1), rustc will complain.

@codyps
Copy link
Contributor

codyps commented Jun 26, 2017

@th0br0 you'll need to use Wrapping<T> or .wrapping_add() to avoid the debug assertion (which is intentionally only on by default in non-release builds).

@pftbest
Copy link
Contributor Author

pftbest commented Jun 26, 2017

The panic you see is intended behavior, so there is no bug here.

Also I don't see how changing isize to usize would give any performance in your case, because they compile to the same assembly code:

https://godbolt.org/g/DQrYkK

@th0br0
Copy link

th0br0 commented Jun 26, 2017

@pftbest the change was less a matter of isize/usize but rather explicitly casting each component of the addition to usize. Sorry that I didn't make that clear in my previous post.

see the different assembly code: (cast_once was what I had in the past)
https://godbolt.org/g/EeV9At

@pftbest
Copy link
Contributor Author

pftbest commented Jun 26, 2017

Ok, I see it now, the reason this would be fast in C because it will implicitly cast each component to isize but Rust will not. In Rust you would have to manually cast them to isize to get the same performance.

@Mark-Simulacrum Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jul 26, 2017
@steveklabnik
Copy link
Member

Triage: it's been a year and a half, and seems like this isn't a bug, just differences in semantics. Closing!

@TianyiShi2001
Copy link

In case anyone is still interested in this issue:

as of 2020-11-12, the benchmark program using i32 is faster than isize. Here are the benchmarks on my computer:

  • rust i32: 0.94s
  • rust isize: 1.06s
  • clang -fwrapv: 0.90s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

9 participants