New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving a method from struct impl to trait causes performance degradation #69593
Comments
FWIW, if you set codegen-units to 1 you should get the same result. |
I added |
Hi, I repeated your benchmark on my computer with same toolchain:
and both programs behave exactly the same. I did also took look at generated assembly and the code is exactly the same ( minus the exported trait code, but this code is not used ), so I think you had just bad luck with the benchmark result. |
Hmm, this performance differential is very repeatable on my machine. Just to ensure we are doing the same thing, I am comparing the following commands (running inside the repo linked in the issue description):
|
The Cargo and rustc versions you posted mismatch. Please use the newest nightly when reporting performance bugs. |
Apologies for the confusion. I was trying to convey that I tested with both stable and the latest nightly, and found the same issue. I realize now that my original issue description was misleading, so I've corrected it to show the version of rustc I was using for nightly testing. edit - Actually, further testing with the following commands suggest the latest nightly provides equal performance for both versions of this program, but slower than the struct version on 1.41 stable.
|
I copy-pasted your code from linked repo to godbolt, fmt is making it hard to read. But we can see the only different part is this symbol I think that this can cause different memory layout which can significantly change the result of such a small benchmark, but this effect is purely random and I believe that slightly different layout can even speed up the resulting binary. So this can explain that your result differs from mine. Generated assembly is the same for nightly and rustc 1.41.0 Sorry for my English, I am not a native speaker |
Thanks for your help tracking this down. It is odd that godbolt shows the generated assembly is the same from 1.41.0 to nightly, as I was seeing different performance between the two. I've since updated to 1.41.1 on my machine (which seemed to behave the same a 1.41.0), and I did the following test: $ cargo build --release --bin struct && cp target/release/struct ./struct-stable && cargo +nightly build --release --bin struct && cp target/release/struct ./struct-nightly
$ ls -l
total 5352
-rw-rw-r--. 1 josh josh 150 Feb 29 04:28 Cargo.lock
-rw-rw-r--. 1 josh josh 118 Feb 29 14:07 Cargo.toml
-rw-rw-r--. 1 josh josh 10803 Feb 29 08:10 LICENSE-APACHE.txt
-rw-rw-r--. 1 josh josh 10803 Feb 29 08:10 LICENSE-MIT.txt
-rw-rw-r--. 1 josh josh 121 Feb 29 09:45 README.md
drwxrwxr-x. 3 josh josh 4096 Feb 29 05:34 src
-rwxrwxr-x. 1 josh josh 2781704 Feb 29 18:20 struct-nightly
-rwxrwxr-x. 1 josh josh 2646504 Feb 29 18:20 struct-stable
drwxrwxr-x. 5 josh josh 4096 Feb 29 04:36 target
$ ./struct-stable
1999999999 in 1025 ms
$ ./struct-nightly
1999999999 in 1283 ms Note the difference in size of the |
See cargo asm |
I tried the two programs, for fun, and got no difference. (458ms for both, on Ryzen 3900X). @JoshMcguigan What CPU are you using? And what operating system? It would be interesting to see what a profiler such as 'perf' sees! If you're on linux, you can do perf stat To get some cpu-performance counter values. It would be interesting to know what differs between the two binaries. |
Interestingly on linux I get:
While on windows, I got 458ms for both tests! |
perf stat output: struct:
trait:
|
Thanks @avl for suggesting the perf tool. Note the difference in I think at this point we've pretty well proven this is a very hardware specific behaviour, particularly since your machine demonstrates the exact opposite (trait 2x faster than struct). Given this, I'm not sure if there is anything actionable here for the compiler team, with the possible exception of not including the trait impl at all in the resulting binary if it is not used. I'm not sure if the compiler including all trait methods in the resulting binary is standard behaviour, or if perhaps this is a weird situation where since the function is inlined it can be removed, but the compiler doesn't take that code size optimization. In either case, feel free to close this issue if you don't think this merits further investigation. |
The difference in LLC-load-misses was indeed very large! I wonder if it has any impact on the program though. Even 473 misses should complete very quickly, and not affect the total runtime very much. This program should be completely CPU-limited. |
I've done some more digging. It seems the performance doesn't really have anything to do with the code in the binary. Minor changes which don't affect the hot code path at all seem to give differences. I even saw a case where running with the 'rt scheduler' in linux made one of the programs slower, but not the other. I've established that the 'problem' is not a 'startup penalty'. Each iteration of the loop really becomes slower. I verified this by having an environment variables determine the number of iterations, and running the two binaries with different numbers of iterations. The percentage-wise performance difference was about constant. If a static number of iterations is used, just changing the number of iterations is enough to change the performance. |
Ok, I've reduced the performance difference (on my machine at least) to just being a question of alignment of the inner loop. I created two small assembler programs (so that we can get very direct control over the generated program), and could relatively easily create one fast and one slow program. |
Well, actually, it turns out that it if the inner-loop is located entirely within one cache-line, it is fast. If it spans two cache-lines, it is slow. If the compiler could somehow figure out that this is a very hot loop, it could, in this particular case, actually use this knowledge to improve performance. On my machine, in this particular case. It would be interesting to know if this is common to all processors, in all situations, or if this particular micro benchmark triggers something in my CPU. |
Okay, tried on an "Intel Xeon E3-1575M". The difference is much much smaller here, but the cacheline-spanning loop is still slightly slower, approximately in line with the original poster. |
Really interesting analysis! Thanks for sharing your work @avl |
AMD's "Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors" "Since the processor can read an aligned 64-byte fetch block every cycle, aligning the end of the loop to the last byte of a 64-byte cache line is the best thing to do, if possible." So at least profile-guided optimization could in theory detect the case described by this report and fix it. Any small loop with more than N iterations should be 64-byte aligned. I suppose this requires LLVM-support and may be outside the scope of the Rust-project? |
Hi all, thanks for all the work on the rust project!
I tried this code:
I expected it to perform similarly to this code:
Instead, this happened:
The version with the
is_odd
method defined directly on the struct completes in 1032 ms, while the version whereis_odd
is defined as part of the trait completes in 1340 ms. I would expect this abstraction to be zero cost, since thefind_nth_odd
method in both cases is defined as taking the concrete struct as an argument (as opposed to a trait object).Meta
rustc --version --verbose
:rustc +nightly --version --verbose
I see the same behavior with both the latest stable and nightly compilers. I run both version of the program in release mode. The reported performance numbers have proven to be very repeatable.
Let me know if I can provide any additional useful information. You can find my repo containing this benchmark code at the link below.
https://github.com/JoshMcguigan/cost-of-indirection
The text was updated successfully, but these errors were encountered: