Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving a method from struct impl to trait causes performance degradation #69593

Open
JoshMcguigan opened this issue Feb 29, 2020 · 24 comments
Open

Moving a method from struct impl to trait causes performance degradation #69593

JoshMcguigan opened this issue Feb 29, 2020 · 24 comments

Comments

@JoshMcguigan
Copy link

@JoshMcguigan JoshMcguigan commented Feb 29, 2020

Hi all, thanks for all the work on the rust project!

I tried this code:

use std::time::Instant;

fn main() {
    let finder = OddFinderOne;

    let start_time = Instant::now();
    let result = find_nth_odd(finder, 1_000_000_000);
    let elapsed_time = start_time.elapsed().as_millis();

    println!("{} in {} ms", result, elapsed_time);
}

fn find_nth_odd(odd_finder: OddFinderOne, n: u64) -> u64 {
    let mut i = 0;
    let mut odd_count = 0;

    while odd_count != n {
        i += 1;
        if odd_finder.is_odd(i) {
            odd_count += 1;
        }
    }

    i
}

trait OddFinder {
    fn is_odd(&self, n: u64) -> bool;
}

struct OddFinderOne;

impl OddFinder for OddFinderOne {
    fn is_odd(&self, n: u64) -> bool {
        n % 2 == 1
    }
}

I expected it to perform similarly to this code:

use std::time::Instant;

fn main() {
    let finder = OddFinder;

    let start_time = Instant::now();
    let result = find_nth_odd(finder, 1_000_000_000);
    let elapsed_time = start_time.elapsed().as_millis();

    println!("{} in {} ms", result, elapsed_time);
}

fn find_nth_odd(odd_finder: OddFinder, n: u64) -> u64 {
    let mut i = 0;
    let mut odd_count = 0;

    while odd_count != n {
        i += 1;
        if odd_finder.is_odd(i) {
            odd_count += 1;
        }
    }

    i
}

struct OddFinder;

impl OddFinder {
    fn is_odd(&self, n: u64) -> bool {
        n % 2 == 1
    }
}

Instead, this happened:

The version with the is_odd method defined directly on the struct completes in 1032 ms, while the version where is_odd is defined as part of the trait completes in 1340 ms. I would expect this abstraction to be zero cost, since the find_nth_odd method in both cases is defined as taking the concrete struct as an argument (as opposed to a trait object).

Meta

rustc --version --verbose:

rustc 1.41.0 (5e1a79984 2020-01-27)
binary: rustc
commit-hash: 5e1a799842ba6ed4a57e91f7ab9435947482f7d8
commit-date: 2020-01-27
host: x86_64-unknown-linux-gnu
release: 1.41.0
LLVM version: 9.0

rustc +nightly --version --verbose

rustc 1.43.0-nightly (0eb878d2a 2020-02-28)
binary: rustc
commit-hash: 0eb878d2aa6e3a1cb315f3f328681b26bb4bffdb
commit-date: 2020-02-28
host: x86_64-unknown-linux-gnu
release: 1.43.0-nightly
LLVM version: 9.0

I see the same behavior with both the latest stable and nightly compilers. I run both version of the program in release mode. The reported performance numbers have proven to be very repeatable.

Let me know if I can provide any additional useful information. You can find my repo containing this benchmark code at the link below.

https://github.com/JoshMcguigan/cost-of-indirection

@JoshMcguigan JoshMcguigan added the C-bug label Feb 29, 2020
@ehuss

This comment has been minimized.

Copy link
Contributor

@ehuss ehuss commented Feb 29, 2020

FWIW, if you set codegen-units to 1 you should get the same result.

@JoshMcguigan

This comment has been minimized.

Copy link
Author

@JoshMcguigan JoshMcguigan commented Feb 29, 2020

FWIW, if you set codegen-units to 1 you should get the same result.

I added codegen-units = 1 under a [profile.release] heading in my Cargo.toml file. The result of this was both versions of the program performing the same. Unfortunately it caused a regression in the performance of the baseline program, so both versions take about 1270 ms with this compiler configuration.

@vojta7

This comment has been minimized.

Copy link

@vojta7 vojta7 commented Feb 29, 2020

Hi, I repeated your benchmark on my computer with same toolchain:

rustc 1.41.0 (5e1a79984 2020-01-27)
binary: rustc
commit-hash: 5e1a799842ba6ed4a57e91f7ab9435947482f7d8
commit-date: 2020-01-27
host: x86_64-unknown-linux-gnu
release: 1.41.0
LLVM version: 9.0

and both programs behave exactly the same. I did also took look at generated assembly and the code is exactly the same ( minus the exported trait code, but this code is not used ), so I think you had just bad luck with the benchmark result.

@JoshMcguigan

This comment has been minimized.

Copy link
Author

@JoshMcguigan JoshMcguigan commented Feb 29, 2020

..and both programs behave exactly the same.

Hmm, this performance differential is very repeatable on my machine. Just to ensure we are doing the same thing, I am comparing the following commands (running inside the repo linked in the issue description):

cargo run --release --bin struct
cargo run --release --bin trait

@vojta7

This comment has been minimized.

Copy link

@vojta7 vojta7 commented Feb 29, 2020

indirection

@JoshMcguigan

This comment has been minimized.

Copy link
Author

@JoshMcguigan JoshMcguigan commented Feb 29, 2020

image

I wonder if you (or I) could somehow have some global compiler/cargo setting configured which could cause this difference?

I get this same performance no matter which of the benchmarks I run first.

@jonas-schievink

This comment has been minimized.

Copy link
Member

@jonas-schievink jonas-schievink commented Feb 29, 2020

The Cargo and rustc versions you posted mismatch. Please use the newest nightly when reporting performance bugs.

@JoshMcguigan

This comment has been minimized.

Copy link
Author

@JoshMcguigan JoshMcguigan commented Feb 29, 2020

The Cargo and rustc versions you posted mismatch. Please use the newest nightly when reporting performance bugs.

Apologies for the confusion. I was trying to convey that I tested with both stable and the latest nightly, and found the same issue. I realize now that my original issue description was misleading, so I've corrected it to show the version of rustc I was using for nightly testing.

edit - Actually, further testing with the following commands suggest the latest nightly provides equal performance for both versions of this program, but slower than the struct version on 1.41 stable.

cargo run --release --bin struct -> ~1000 ms
cargo run --release --bin trait -> ~1250 ms
cargo +nightly --release --bin struct -> ~1250 ms
cargo +nightly --release --bin trait -> ~1250 ms

@vojta7

This comment has been minimized.

Copy link

@vojta7 vojta7 commented Feb 29, 2020

I copy-pasted your code from linked repo to godbolt, fmt is making it hard to read. But we can see the only different part is this symbol <example::OddFinderOne as example::OddFinder>::is_odd: (line 48), but as far as I can see it is never called.

I think that this can cause different memory layout which can significantly change the result of such a small benchmark, but this effect is purely random and I believe that slightly different layout can even speed up the resulting binary. So this can explain that your result differs from mine.

Generated assembly is the same for nightly and rustc 1.41.0

Sorry for my English, I am not a native speaker

@JoshMcguigan

This comment has been minimized.

Copy link
Author

@JoshMcguigan JoshMcguigan commented Mar 1, 2020

Thanks for your help tracking this down. It is odd that godbolt shows the generated assembly is the same from 1.41.0 to nightly, as I was seeing different performance between the two. I've since updated to 1.41.1 on my machine (which seemed to behave the same a 1.41.0), and I did the following test:

$ cargo build --release --bin struct && cp target/release/struct ./struct-stable && cargo +nightly build --release --bin struct && cp target/release/struct ./struct-nightly
$ ls -l
total 5352
-rw-rw-r--. 1 josh josh     150 Feb 29 04:28 Cargo.lock
-rw-rw-r--. 1 josh josh     118 Feb 29 14:07 Cargo.toml
-rw-rw-r--. 1 josh josh   10803 Feb 29 08:10 LICENSE-APACHE.txt
-rw-rw-r--. 1 josh josh   10803 Feb 29 08:10 LICENSE-MIT.txt
-rw-rw-r--. 1 josh josh     121 Feb 29 09:45 README.md
drwxrwxr-x. 3 josh josh    4096 Feb 29 05:34 src
-rwxrwxr-x. 1 josh josh 2781704 Feb 29 18:20 struct-nightly
-rwxrwxr-x. 1 josh josh 2646504 Feb 29 18:20 struct-stable
drwxrwxr-x. 5 josh josh    4096 Feb 29 04:36 target
$ ./struct-stable
1999999999 in 1025 ms
$ ./struct-nightly
1999999999 in 1283 ms

Note the difference in size of the struct-stable vs struct-nightly binaries. I tried running objdump -d $BIN on each of these files and diffing the output, but the diff was very large. Is there some other way I could meaningfully inspect these binaries locally?

@zacps

This comment has been minimized.

Copy link

@zacps zacps commented Mar 1, 2020

Is there some other way I could meaningfully inspect these binaries locally?

See cargo asm

@avl

This comment has been minimized.

Copy link

@avl avl commented Mar 2, 2020

I tried the two programs, for fun, and got no difference. (458ms for both, on Ryzen 3900X).

@JoshMcguigan What CPU are you using? And what operating system?

It would be interesting to see what a profiler such as 'perf' sees!

If you're on linux, you can do

perf stat

To get some cpu-performance counter values. It would be interesting to know what differs between the two binaries.

@avl

This comment has been minimized.

Copy link

@avl avl commented Mar 2, 2020

Interestingly on linux I get:

anders@adesk:~/cost-of-indirection$ cargo run --release --bin struct
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/struct`
1999999999 in 884 ms
anders@adesk:~/cost-of-indirection$ cargo run --release --bin trait
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/trait`
1999999999 in 441 ms

While on windows, I got 458ms for both tests!

@avl

This comment has been minimized.

Copy link

@avl avl commented Mar 2, 2020

perf stat output:

struct:

 Performance counter stats for './struct':

            903,24 msec task-clock                #    0,999 CPUs utilized          
                85      context-switches          #    0,094 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
                79      page-faults               #    0,087 K/sec                  
     4 010 883 875      cycles                    #    4,441 GHz                      (83,18%)
           995 489      stalled-cycles-frontend   #    0,02% frontend cycles idle     (83,18%)
         2 152 816      stalled-cycles-backend    #    0,05% backend cycles idle      (83,18%)
    12 003 392 736      instructions              #    2,99  insn per cycle         
                                                  #    0,00  stalled cycles per insn  (83,58%)
     1 997 923 689      branches                  # 2211,947 M/sec                    (83,62%)
             6 810      branch-misses             #    0,00% of all branches          (83,27%)

       0,903763382 seconds time elapsed

       0,903399000 seconds user
       0,000000000 seconds sys


trait:

 Performance counter stats for './trait':

            461,13 msec task-clock                #    0,999 CPUs utilized          
                42      context-switches          #    0,091 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
                83      page-faults               #    0,180 K/sec                  
     2 005 966 735      cycles                    #    4,350 GHz                      (83,08%)
           479 742      stalled-cycles-frontend   #    0,02% frontend cycles idle     (83,52%)
         1 059 865      stalled-cycles-backend    #    0,05% backend cycles idle      (83,52%)
    11 976 454 941      instructions              #    5,97  insn per cycle         
                                                  #    0,00  stalled cycles per insn  (83,53%)
     1 994 858 712      branches                  # 4326,004 M/sec                    (83,52%)
             3 681      branch-misses             #    0,00% of all branches          (82,82%)

       0,461547111 seconds time elapsed

       0,461368000 seconds user
       0,000000000 seconds sys

@JoshMcguigan

This comment has been minimized.

Copy link
Author

@JoshMcguigan JoshMcguigan commented Mar 2, 2020

$ perf stat -d target/release/struct
1999999999 in 1012 ms

 Performance counter stats for 'target/release/struct':

          1,007.20 msec task-clock:u              #    0.994 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                77      page-faults:u             #    0.076 K/sec                  
     3,187,890,051      cycles:u                  #    3.165 GHz                      (49.87%)
    11,991,292,654      instructions:u            #    3.76  insn per cycle           (62.40%)
     1,998,678,625      branches:u                # 1984.399 M/sec                    (62.51%)
            10,936      branch-misses:u           #    0.00% of all branches          (62.51%)
           211,202      L1-dcache-loads:u         #    0.210 M/sec                    (62.60%)
             9,782      L1-dcache-load-misses:u   #    4.63% of all L1-dcache hits    (62.57%)
             1,063      LLC-loads:u               #    0.001 M/sec                    (50.01%)
                46      LLC-load-misses:u         #    4.33% of all LL-cache hits     (49.93%)

       1.013577794 seconds time elapsed

       0.997534000 seconds user
       0.003969000 seconds sys


$ perf stat -d target/release/trait
1999999999 in 1283 ms

 Performance counter stats for 'target/release/trait':

          1,283.87 msec task-clock:u              #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                75      page-faults:u             #    0.058 K/sec                  
     4,039,253,928      cycles:u                  #    3.146 GHz                      (49.88%)
    12,000,040,451      instructions:u            #    2.97  insn per cycle           (62.42%)
     2,000,434,927      branches:u                # 1558.131 M/sec                    (62.49%)
            14,333      branch-misses:u           #    0.00% of all branches          (62.57%)
           210,909      L1-dcache-loads:u         #    0.164 M/sec                    (62.61%)
             9,571      L1-dcache-load-misses:u   #    4.54% of all L1-dcache hits    (62.58%)
             1,256      LLC-loads:u               #    0.978 K/sec                    (49.97%)
               473      LLC-load-misses:u         #   37.66% of all LL-cache hits     (49.89%)

       1.284390399 seconds time elapsed

       1.273640000 seconds user
       0.001981000 seconds sys

Thanks @avl for suggesting the perf tool. Note the difference in LLC-load-misses between the two programs (~4% vs ~37%).

I think at this point we've pretty well proven this is a very hardware specific behaviour, particularly since your machine demonstrates the exact opposite (trait 2x faster than struct).

Given this, I'm not sure if there is anything actionable here for the compiler team, with the possible exception of not including the trait impl at all in the resulting binary if it is not used. I'm not sure if the compiler including all trait methods in the resulting binary is standard behaviour, or if perhaps this is a weird situation where since the function is inlined it can be removed, but the compiler doesn't take that code size optimization. In either case, feel free to close this issue if you don't think this merits further investigation.

@avl

This comment has been minimized.

Copy link

@avl avl commented Mar 2, 2020

The difference in LLC-load-misses was indeed very large! I wonder if it has any impact on the program though. Even 473 misses should complete very quickly, and not affect the total runtime very much. This program should be completely CPU-limited.

@avl

This comment has been minimized.

Copy link

@avl avl commented Mar 2, 2020

I see the 'bad' performance on "rustc 1.41.1 (f3e1a95 2020-02-24)".
On "rustc 1.43.0-nightly (834bc56 2020-02-24)", both versions are equally fast.

@avl

This comment has been minimized.

Copy link

@avl avl commented Mar 2, 2020

I've done some more digging. It seems the performance doesn't really have anything to do with the code in the binary. Minor changes which don't affect the hot code path at all seem to give differences. I even saw a case where running with the 'rt scheduler' in linux made one of the programs slower, but not the other.

I've established that the 'problem' is not a 'startup penalty'. Each iteration of the loop really becomes slower. I verified this by having an environment variables determine the number of iterations, and running the two binaries with different numbers of iterations. The percentage-wise performance difference was about constant.

If a static number of iterations is used, just changing the number of iterations is enough to change the performance.

@avl

This comment has been minimized.

Copy link

@avl avl commented Mar 2, 2020

perf annotate gives slightly different profiles for the fast and the slow runs:

fast:

slow:

@avl

This comment has been minimized.

Copy link

@avl avl commented Mar 2, 2020

Ok, I've reduced the performance difference (on my machine at least) to just being a question of alignment of the inner loop. I created two small assembler programs (so that we can get very direct control over the generated program), and could relatively easily create one fast and one slow program.

See repo https://github.com/avl/strange_performance_repro .

@avl

This comment has been minimized.

Copy link

@avl avl commented Mar 2, 2020

Well, actually, it turns out that it if the inner-loop is located entirely within one cache-line, it is fast. If it spans two cache-lines, it is slow.

If the compiler could somehow figure out that this is a very hot loop, it could, in this particular case, actually use this knowledge to improve performance.

On my machine, in this particular case. It would be interesting to know if this is common to all processors, in all situations, or if this particular micro benchmark triggers something in my CPU.

@avl

This comment has been minimized.

Copy link

@avl avl commented Mar 2, 2020

Okay, tried on an "Intel Xeon E3-1575M". The difference is much much smaller here, but the cacheline-spanning loop is still slightly slower, approximately in line with the original poster.

@JoshMcguigan

This comment has been minimized.

Copy link
Author

@JoshMcguigan JoshMcguigan commented Mar 3, 2020

Really interesting analysis! Thanks for sharing your work @avl

@avl

This comment has been minimized.

Copy link

@avl avl commented Mar 5, 2020

AMD's "Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors"
has this to say:

"Since the processor can read an aligned 64-byte fetch block every cycle, aligning the end of the loop to the last byte of a 64-byte cache line is the best thing to do, if possible."

So at least profile-guided optimization could in theory detect the case described by this report and fix it. Any small loop with more than N iterations should be 64-byte aligned. I suppose this requires LLVM-support and may be outside the scope of the Rust-project?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.