impls on arrays seem slower than equivalent tuples #80140

smmalis37 · 2020-12-18T00:53:09Z

When running the following program:

Cargo.toml:

[dependencies]
criterion = "0.3"

main.rs:

use criterion::{black_box, Criterion};

fn main() {
    let mut c = Criterion::default();
    c.bench_function("partialeq-array", |b| {
        b.iter(|| {
            let a = black_box([0; 3]);
            let b = black_box([0; 3]);
            for _ in 0..1_000_000 {
                black_box(a.eq(&b));
            }
        })
    });
    c.bench_function("partialeq-tuple", |b| {
        b.iter(|| {
            let a = black_box((0, 0, 0));
            let b = black_box((0, 0, 0));
            for _ in 0..1_000_000 {
                black_box(a.eq(&b));
            }
        })
    });
}

I get these results in debug builds:

partialeq-array         time:   [46.308 ms 46.462 ms 46.618 ms]
partialeq-tuple         time:   [31.133 ms 31.230 ms 31.339 ms]

And these results in release builds:

partialeq-array         time:   [628.51 us 633.17 us 638.23 us]
partialeq-tuple         time:   [211.77 us 212.61 us 213.60 us]

I had expected these to be equivalent, but tuples are significantly faster than arrays. This seems to apply to any size of number (i8, u32, etc), and some other traits as well (notably Hash).

The text was updated successfully, but these errors were encountered:

smmalis37 · 2020-12-18T00:53:33Z

@rustbot label I-slow

smmalis37 · 2020-12-18T00:54:16Z

Tagging another interested party: @aldanor

sfackler · 2020-12-18T02:04:04Z

The only difference in the compiled code is the array impl's extra check for pointer equality: https://rust.godbolt.org/z/1ea4e5

sfackler · 2020-12-18T02:21:01Z

Huh, on 1.48.0 the tuple impl ends up branching on each value, and on nightly the array impl spills to the stack (?!).

https://rust.godbolt.org/z/oE41ed

smmalis37 · 2020-12-18T04:08:08Z

Correct me if I'm wrong, but since those arrays are being passed by value their addresses can never be the same right? So that check and branch are completely pointless? I could understand if it was over a slice or a ref to an array, but not on an array itself.

smmalis37 · 2020-12-18T04:13:28Z

Results on current nightly:
Debug:

partialeq-array         time:   [47.103 ms 47.252 ms 47.431 ms]
partialeq-tuple         time:   [32.375 ms 32.441 ms 32.513 ms]

Release:

partialeq-array         time:   [672.38 us 674.07 us 676.14 us]
partialeq-tuple         time:   [207.67 us 207.94 us 208.27 us]

Looks basically the same to me, just a little bit of noise.

nikic · 2020-12-19T20:30:21Z

IR variant of the above godbolt link is enlightening: https://rust.godbolt.org/z/Wb4dTY

Looks like this is fallout from the switch to pass values smaller than 2 registers by-value. For the array case this ends up requiring a stack copy, because the bcmp is expanded too late to avoid it. For the tuple case the value is passed as an i96 (!!!) and then parts are extracted from it using bit arithmetic. I guess in the latter case LLVM should probably fold these comparisons down to one comparison of i96.

sfackler · 2020-12-19T20:47:42Z

Correct me if I'm wrong, but since those arrays are being passed by value their addresses can never be the same right?

That is true, but passing that information to LLVM results in miscompilations IIRC so it's not done by default.

erikdesjardins · 2020-12-20T01:03:07Z

The only difference in the compiled code is the array impl's extra check for pointer equality

I thought we removed this comparison because it caused other problems, but the PR languished and died; resurrected in #80209.

This resurrects rust-lang#71735. Fixes rust-lang#71602, helps with rust-lang#80140. r? `@Mark-Simulacrum`

Remove pointer comparison from slice equality This resurrects rust-lang#71735. Fixes rust-lang#71602, helps with rust-lang#80140. r? `@Mark-Simulacrum`

smmalis37 · 2020-12-27T22:13:07Z

Despite that PR, the latest nightly sadly changes nothing for benchmark timings here on my system.

erikdesjardins · 2020-12-28T01:51:03Z

#80209 only affects arrays that are passed indirectly. Arrays (or any aggregate, roughly) are passed indirectly when they're larger than 1 * size_of::<usize>() on 1.48, or 2 * size_of::<usize>() on nightly. In your example the array is 1.5*usize, small enough to be passed by value on nightly.

When passed by value, LLVM can tell that the arrays never have the same address, and eliminates the check. So it would otherwise generate optimal code...but it also unnecessarily spills to the stack (#80140 (comment)), which is a different problem.

smmalis37 · 2020-12-28T02:04:50Z

I tried incrementing the size of my benchmark to 12 elements (the largest size of tuple that has a PartialEq impl), and changed the types to i64, to ensure they'd be large enough to be passed indirectly. Same result.

use criterion::{black_box, Criterion};

fn main() {
    let mut c = Criterion::default();
    c.bench_function("partialeq-array", |b| {
        b.iter(|| {
            let a: [i64; 12] = black_box([0; 12]);
            let b: [i64; 12] = black_box([0; 12]);
            for _ in 0..1_000_000 {
                black_box(a.eq(&b));
            }
        })
    });
    c.bench_function("partialeq-tuple", |b| {
        b.iter(|| {
            let a: (i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64) =
                black_box((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0));
            let b: (i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64) =
                black_box((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0));
            for _ in 0..1_000_000 {
                black_box(a.eq(&b));
            }
        })
    });
}

partialeq-array         time:   [3.8812 ms 3.9138 ms 3.9492 ms]
partialeq-tuple         time:   [220.85 us 222.05 us 223.39 us]

Strangely the tuple time didn't actually get any longer despite being much bigger. Is there something wrong in the benchmark or is it seeing through black_box somehow?

smmalis37 · 2021-03-26T11:25:54Z

As of 1.51.0 these timings remain unchanged, both for small types (3 i32s) and large ones (12 i64s).

smmalis37 · 2021-04-14T00:20:55Z

This seems to be somewhat resolved sometime between 1.51 and the current nightly. While the assembly generated is not the same, the runtime of both is now roughly equivalent on my system:

partialeq-array         time:   [209.19 us 209.38 us 209.58 us]
partialeq-tuple         time:   [209.20 us 209.42 us 209.69 us]

mati865 · 2021-04-21T14:25:48Z

Is it fixed on beta?
LLVM 12 is likely what fixed this issue.

smmalis37 · 2021-09-29T17:34:31Z

This seems to be fixed in the current nightly. While debug builds still exhibit the issue it's to a much smaller degree, and release builds don't show it at all.

scottmcm · 2021-12-15T02:15:38Z

and some other traits as well (notably Hash)

Sadly Hash is forced to be slower on arrays than on tuples because Borrowing arrays as slices means that hashing an array must include the length (#86131), which isn't needed for tuples.

rustbot added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Dec 18, 2020

erikdesjardins mentioned this issue Dec 20, 2020

Remove pointer comparison from slice equality #80209

Closed

bors added a commit to rust-lang-ci/rust that referenced this issue Dec 26, 2020

Remove pointer comparison from slice equality

733cb54

This resurrects rust-lang#71735. Fixes rust-lang#71602, helps with rust-lang#80140. r? `@Mark-Simulacrum`

smmalis37 closed this as completed Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

impls on arrays seem slower than equivalent tuples #80140

impls on arrays seem slower than equivalent tuples #80140

smmalis37 commented Dec 18, 2020

smmalis37 commented Dec 18, 2020

smmalis37 commented Dec 18, 2020

sfackler commented Dec 18, 2020

sfackler commented Dec 18, 2020 •

edited

Loading

smmalis37 commented Dec 18, 2020

smmalis37 commented Dec 18, 2020

nikic commented Dec 19, 2020

sfackler commented Dec 19, 2020

erikdesjardins commented Dec 20, 2020 •

edited

Loading

smmalis37 commented Dec 27, 2020

erikdesjardins commented Dec 28, 2020

smmalis37 commented Dec 28, 2020 •

edited

Loading

smmalis37 commented Mar 26, 2021

smmalis37 commented Apr 14, 2021

mati865 commented Apr 21, 2021

smmalis37 commented Sep 29, 2021

scottmcm commented Dec 15, 2021

impls on arrays seem slower than equivalent tuples #80140

impls on arrays seem slower than equivalent tuples #80140

Comments

smmalis37 commented Dec 18, 2020

smmalis37 commented Dec 18, 2020

smmalis37 commented Dec 18, 2020

sfackler commented Dec 18, 2020

sfackler commented Dec 18, 2020 • edited Loading

smmalis37 commented Dec 18, 2020

smmalis37 commented Dec 18, 2020

nikic commented Dec 19, 2020

sfackler commented Dec 19, 2020

erikdesjardins commented Dec 20, 2020 • edited Loading

smmalis37 commented Dec 27, 2020

erikdesjardins commented Dec 28, 2020

smmalis37 commented Dec 28, 2020 • edited Loading

smmalis37 commented Mar 26, 2021

smmalis37 commented Apr 14, 2021

mati865 commented Apr 21, 2021

smmalis37 commented Sep 29, 2021

scottmcm commented Dec 15, 2021

sfackler commented Dec 18, 2020 •

edited

Loading

erikdesjardins commented Dec 20, 2020 •

edited

Loading

smmalis37 commented Dec 28, 2020 •

edited

Loading