Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nearly 50% performance regression when copying/moving small byte arrays between 1.71 and 1.72 #115212

Closed
tpelkone opened this issue Aug 25, 2023 · 9 comments · Fixed by #115236
Closed
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-stable Performance or correctness regression from one stable version to another.

Comments

@tpelkone
Copy link

Summary

Nearly 50% performance regression between versions 1.71 and 1.72 when copying small byte arrays with stable-aarch64-apple-darwin platform (Macbook Pro M1). Used cargo bisect-rustc to track down the regression to fd9bf59 #111999

Code

I tried this code:

#![feature(test)]
extern crate test;
use test::Bencher;

#[bench]
fn bench(b: &mut Bencher) {
    b.iter(|| {
        let data = vec![[1u8; 24]; 1024 * 1024];
        let mut new_vec = Vec::with_capacity(1024 * 1024);
        for entry in data {
            new_vec.push(entry);
        }
    });
}

I expected to see this happen: to run about 6,200,000 ns/iter like version 1.71

$ cargo +nightly-2023-06-06-aarch64-apple-darwin bench
   Compiling benchmarks v0.1.0 (/Users/tuomas/rust/benchmarks)
    Finished bench [optimized] target(s) in 0.18s
     Running unittests src/main.rs (target/release/deps/benchmarks-0c0dfc3834e16b96)

running 1 test
test bench ... bench:   6,249,724 ns/iter (+/- 1,609,874)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out; finished in 5.70s

Instead, this happened: 9,159,412 ns/iter

$ cargo +nightly-2023-06-07-aarch64-apple-darwin bench
    Finished bench [optimized] target(s) in 0.00s
     Running unittests src/main.rs (target/release/deps/benchmarks-0c0dfc3834e16b96)

running 1 test
test bench ... bench:   9,159,412 ns/iter (+/- 469,568)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out; finished in 2.78s

Version it worked on

It most recently worked on: 1.71
More specifically nightly-2023-06-06-aarch64-apple-darwin

Version with regression

1.72
nightly-2023-06-07-aarch64-apple-darwin

rustc --version --verbose:

rustc 1.72.0 (5680fa18f 2023-08-23)
binary: rustc
commit-hash: 5680fa18feaa87f3ff04063800aec256c3d4b4be
commit-date: 2023-08-23
host: aarch64-apple-darwin
release: 1.72.0
LLVM version: 16.0.5

cargo bisect-rustc

Searched nightlies: from nightly-2023-06-06 to nightly-2023-06-07
regressed nightly: nightly-2023-06-07
searched commit range: e6d4725...b2b34bd
regressed commit: fd9bf59

bisected with cargo-bisect-rustc v0.6.7

Host triple: x86_64-apple-darwin
Reproduce with:

cargo bisect-rustc --start=2023-06-06 --end=2023-06-07 -- run --release

@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged

@tpelkone tpelkone added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Aug 25, 2023
@rustbot rustbot added needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. I-prioritize Issue: Indicates that prioritization has been requested for this issue. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. and removed regression-untriaged Untriaged performance or correctness regression. labels Aug 25, 2023
@saethlin
Copy link
Member

Here's a demo for [u8; 31]: https://godbolt.org/z/c3rdPM8K4

@saethlin saethlin added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Aug 25, 2023
@erikdesjardins
Copy link
Contributor

Here's a demo for [u8; 31]: godbolt.org/z/c3rdPM8K4

Oof, that's massively worse. But even with a nice size ([u8; 32]), it's still a small regression: https://godbolt.org/z/1PW1xbcjn

We should probably revert #111999, or limit it to very small arrays (<= 8 bytes / 1 usize? https://godbolt.org/z/6sWdYaeEx)

@saethlin
Copy link
Member

The original motivation of that PR is for arrays larger than that, so I'm not so sure.

If I strip this down to just the pointer write it's not as bad, but it's quite clear that the problem is just x86-isel: https://godbolt.org/z/qh89c85P8 Is that even good instruction selection?

@scottmcm scottmcm self-assigned this Aug 25, 2023
@scottmcm
Copy link
Member

Hmm, curiously looking at just the part of the push logic that copies the value, the assembly is entirely reasonable:

pub fn push(v: &mut Vec<[u8; 24]>, x: [u8; 24]) {
    if v.len() < v.capacity() { v.push(x); }
}

https://rust.godbolt.org/z/45fEo314o

example::push:
        mov     rax, qword ptr [rdi + 16]
        cmp     rax, qword ptr [rdi + 8]
        jae     .LBB0_2
        movups  xmm0, xmmword ptr [rsi]
        mov     rcx, qword ptr [rsi + 16]
        mov     rdx, qword ptr [rdi]
        lea     rsi, [rax + 2*rax]
        mov     qword ptr [rdx + 8*rsi + 16], rcx
        movups  xmmword ptr [rdx + 8*rsi], xmm0
        inc     rax
        mov     qword ptr [rdi + 16], rax
.LBB0_2:
        ret

But then in the context of a "real" push, it somehow still blows up: https://rust.godbolt.org/z/zW34r7h1j

@scottmcm
Copy link
Member

Here's a small tweak to try to fix this for 1.73: #115236

@scottmcm
Copy link
Member

I've opened #115242 to track the suboptimal push codegen mentioned in #115212 (comment)

@tpelkone Once a new nightly is out, please give it a shot and reopen if you're still seeing the regression.

@mati865
Copy link
Contributor

mati865 commented Aug 26, 2023

On Ubuntu VM running on Ryzen 5950X this is far worse:

# 1.71.1
test bench ... bench:   6,363,711 ns/iter (+/- 311,522)
# 1,72,0
test bench ... bench:  15,529,123 ns/iter (+/- 656,905)

2.4x regression.

@tpelkone
Copy link
Author

@scottmcm Things look much better with rust version 1.74.0-nightly (69e97df5c 2023-08-26). Thanks for the quick fix!

@scottmcm
Copy link
Member

@tpelkone Thanks for confirming! I'll see if it can make 1.73.

@apiraino apiraino removed the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-stable Performance or correctness regression from one stable version to another.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants