Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impls on arrays seem slower than equivalent tuples #80140

Closed
smmalis37 opened this issue Dec 18, 2020 · 17 comments
Closed

impls on arrays seem slower than equivalent tuples #80140

smmalis37 opened this issue Dec 18, 2020 · 17 comments
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@smmalis37
Copy link
Contributor

When running the following program:

Cargo.toml:

[dependencies]
criterion = "0.3"

main.rs:

use criterion::{black_box, Criterion};

fn main() {
    let mut c = Criterion::default();
    c.bench_function("partialeq-array", |b| {
        b.iter(|| {
            let a = black_box([0; 3]);
            let b = black_box([0; 3]);
            for _ in 0..1_000_000 {
                black_box(a.eq(&b));
            }
        })
    });
    c.bench_function("partialeq-tuple", |b| {
        b.iter(|| {
            let a = black_box((0, 0, 0));
            let b = black_box((0, 0, 0));
            for _ in 0..1_000_000 {
                black_box(a.eq(&b));
            }
        })
    });
}

I get these results in debug builds:

partialeq-array         time:   [46.308 ms 46.462 ms 46.618 ms]
partialeq-tuple         time:   [31.133 ms 31.230 ms 31.339 ms]

And these results in release builds:

partialeq-array         time:   [628.51 us 633.17 us 638.23 us]
partialeq-tuple         time:   [211.77 us 212.61 us 213.60 us]

I had expected these to be equivalent, but tuples are significantly faster than arrays. This seems to apply to any size of number (i8, u32, etc), and some other traits as well (notably Hash).

@smmalis37
Copy link
Contributor Author

@rustbot label I-slow

@rustbot rustbot added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Dec 18, 2020
@smmalis37
Copy link
Contributor Author

Tagging another interested party: @aldanor

@sfackler
Copy link
Member

The only difference in the compiled code is the array impl's extra check for pointer equality: https://rust.godbolt.org/z/1ea4e5

@sfackler
Copy link
Member

sfackler commented Dec 18, 2020

Huh, on 1.48.0 the tuple impl ends up branching on each value, and on nightly the array impl spills to the stack (?!).

https://rust.godbolt.org/z/oE41ed

@smmalis37
Copy link
Contributor Author

Correct me if I'm wrong, but since those arrays are being passed by value their addresses can never be the same right? So that check and branch are completely pointless? I could understand if it was over a slice or a ref to an array, but not on an array itself.

@smmalis37
Copy link
Contributor Author

Results on current nightly:
Debug:

partialeq-array         time:   [47.103 ms 47.252 ms 47.431 ms]
partialeq-tuple         time:   [32.375 ms 32.441 ms 32.513 ms]

Release:

partialeq-array         time:   [672.38 us 674.07 us 676.14 us]
partialeq-tuple         time:   [207.67 us 207.94 us 208.27 us]

Looks basically the same to me, just a little bit of noise.

@nikic
Copy link
Contributor

nikic commented Dec 19, 2020

IR variant of the above godbolt link is enlightening: https://rust.godbolt.org/z/Wb4dTY

Looks like this is fallout from the switch to pass values smaller than 2 registers by-value. For the array case this ends up requiring a stack copy, because the bcmp is expanded too late to avoid it. For the tuple case the value is passed as an i96 (!!!) and then parts are extracted from it using bit arithmetic. I guess in the latter case LLVM should probably fold these comparisons down to one comparison of i96.

@sfackler
Copy link
Member

Correct me if I'm wrong, but since those arrays are being passed by value their addresses can never be the same right?

That is true, but passing that information to LLVM results in miscompilations IIRC so it's not done by default.

@erikdesjardins
Copy link
Contributor

erikdesjardins commented Dec 20, 2020

The only difference in the compiled code is the array impl's extra check for pointer equality

I thought we removed this comparison because it caused other problems, but the PR languished and died; resurrected in #80209.

bors added a commit to rust-lang-ci/rust that referenced this issue Dec 26, 2020
This resurrects rust-lang#71735.

Fixes rust-lang#71602, helps with rust-lang#80140.

r? `@Mark-Simulacrum`
bors added a commit to rust-lang-ci/rust that referenced this issue Dec 26, 2020
Remove pointer comparison from slice equality

This resurrects rust-lang#71735.

Fixes rust-lang#71602, helps with rust-lang#80140.

r? `@Mark-Simulacrum`
@smmalis37
Copy link
Contributor Author

Despite that PR, the latest nightly sadly changes nothing for benchmark timings here on my system.

@erikdesjardins
Copy link
Contributor

#80209 only affects arrays that are passed indirectly. Arrays (or any aggregate, roughly) are passed indirectly when they're larger than 1 * size_of::<usize>() on 1.48, or 2 * size_of::<usize>() on nightly. In your example the array is 1.5*usize, small enough to be passed by value on nightly.

When passed by value, LLVM can tell that the arrays never have the same address, and eliminates the check. So it would otherwise generate optimal code...but it also unnecessarily spills to the stack (#80140 (comment)), which is a different problem.

@smmalis37
Copy link
Contributor Author

smmalis37 commented Dec 28, 2020

I tried incrementing the size of my benchmark to 12 elements (the largest size of tuple that has a PartialEq impl), and changed the types to i64, to ensure they'd be large enough to be passed indirectly. Same result.

use criterion::{black_box, Criterion};

fn main() {
    let mut c = Criterion::default();
    c.bench_function("partialeq-array", |b| {
        b.iter(|| {
            let a: [i64; 12] = black_box([0; 12]);
            let b: [i64; 12] = black_box([0; 12]);
            for _ in 0..1_000_000 {
                black_box(a.eq(&b));
            }
        })
    });
    c.bench_function("partialeq-tuple", |b| {
        b.iter(|| {
            let a: (i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64) =
                black_box((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0));
            let b: (i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64) =
                black_box((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0));
            for _ in 0..1_000_000 {
                black_box(a.eq(&b));
            }
        })
    });
}
partialeq-array         time:   [3.8812 ms 3.9138 ms 3.9492 ms]
partialeq-tuple         time:   [220.85 us 222.05 us 223.39 us]

Strangely the tuple time didn't actually get any longer despite being much bigger. Is there something wrong in the benchmark or is it seeing through black_box somehow?

@smmalis37
Copy link
Contributor Author

As of 1.51.0 these timings remain unchanged, both for small types (3 i32s) and large ones (12 i64s).

@smmalis37
Copy link
Contributor Author

This seems to be somewhat resolved sometime between 1.51 and the current nightly. While the assembly generated is not the same, the runtime of both is now roughly equivalent on my system:

partialeq-array         time:   [209.19 us 209.38 us 209.58 us]
partialeq-tuple         time:   [209.20 us 209.42 us 209.69 us]

@mati865
Copy link
Contributor

mati865 commented Apr 21, 2021

Is it fixed on beta?
LLVM 12 is likely what fixed this issue.

@smmalis37
Copy link
Contributor Author

This seems to be fixed in the current nightly. While debug builds still exhibit the issue it's to a much smaller degree, and release builds don't show it at all.

@scottmcm
Copy link
Member

and some other traits as well (notably Hash)

Sadly Hash is forced to be slower on arrays than on tuples because Borrowing arrays as slices means that hashing an array must include the length (#86131), which isn't needed for tuples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

7 participants