Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slice equality is slow #16913

Closed
nham opened this Issue Sep 1, 2014 · 13 comments

Comments

Projects
None yet
@nham
Copy link
Contributor

nham commented Sep 1, 2014

If I run the following code with rustc -O --test src/slice_equality_slow.rs && ./slice_equality_slow --bench:

extern crate test;

fn eq<'a, T: PartialEq>(a: &'a [T], b: &'a [T]) -> bool {
    if a.len() != b.len() {
        return false;
    }

    for i in range(0, a.len()) {
        if a[i] != b[i] {
            return false;
        }
    }

    true
}

#[cfg(test)]
mod bench {
    use test::Bencher;

    #[bench]
    fn builtin_eq(b: &mut Bencher) {
        let mut u = vec!();
        let mut v = vec!();
        for i in range(0u, 1_000_000) {
            u.push(i);
            v.push(i);
        }

        let x = u.as_slice();
        let y = v.as_slice();

        b.iter(|| {
            assert!(x == y);
        })
    }

    #[bench]
    fn custom_eq(b: &mut Bencher) {
        let mut u = vec!();
        let mut v = vec!();
        for i in range(0u, 1_000_000) {
            u.push(i);
            v.push(i);
        }

        let x = u.as_slice();
        let y = v.as_slice();

        b.iter(|| {
            assert!(super::eq(x, y));
        })
    }
}

fn main() {}

Then I get:

running 2 tests
test bench::builtin_eq ... bench:  12513842 ns/iter (+/- 64900)
test bench::custom_eq  ... bench:   7303767 ns/iter (+/- 304090)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

I ran into this because I was able to speed up the naive string matching algorithm in core::str by replacing the slice equality with a for loop and comparing each component manually.

@huonw huonw added the I-slow label Sep 1, 2014

@ben0x539

This comment has been minimized.

Copy link
Contributor

ben0x539 commented Sep 2, 2014

the builtin_eq x == y comparison seems to take 3x as much time as memcmp, fwiw

@dotdash

This comment has been minimized.

Copy link
Contributor

dotdash commented Sep 9, 2014

The call to eq is not even inlined at opt-level 2, only at level 3. Also, it seems that the null checks from the iterator aren't eliminated at either opt-level:

match_case.i:                                     ; preds = %"_ZN5slice57Items$LT$$x27a$C$$x20T$GT$.Iterator$LT$$BP$$x27a$x20T$GT$4next21h11506221690941188551E.exit"
  %sret_slot.sroa.0.0.i.lcssa310 = phi i8* [ %sret_slot.sroa.0.0.i, %"_ZN5slice57Items$LT$$x27a$C$$x20T$GT$.Iterator$LT$$BP$$x27a$x20T$GT$4next21h11506221690941188551E.exit" ]
  %60 = icmp eq i8* %sret_slot.sroa.0.0.i.lcssa310, null
  br i1 %60, label %.noexc76, label %then-block-191-.i.loopexit309

match_case6.i:                                    ; preds = %"_ZN5slice57Items$LT$$x27a$C$$x20T$GT$.Iterator$LT$$BP$$x27a$x20T$GT$4next21h11506221690941188551E.exit"
  %61 = icmp eq i8* %sret_slot.sroa.0.0.i, null
  br i1 %61, label %then-block-191-.i.loopexit, label %.noexc197

.noexc197:                                        ; preds = %match_case6.i
  %62 = bitcast i8* %sret_slot.sroa.0.0.i to i64*
  %63 = bitcast i8* %sret_slot.sroa.0.0.i201 to i64*
  %64 = load i64* %63, align 8
  %65 = load i64* %62, align 8
  %66 = icmp eq i64 %64, %65
  br i1 %66, label %loop_body.i, label %then-block-191-.i.loopexit

cc @zwarich -- Could the NullCheckElim pass handle that?

That aside, even the custom_eq version isn't optimal yet. Using a direct memcmp call I get:

test bench::builtin_eq ... bench:   1747797 ns/iter (+/- 22496)
test bench::custom_eq  ... bench:    964028 ns/iter (+/- 63657)
test bench::memcmp_eq  ... bench:    518912 ns/iter (+/- 2196)

AFAICT, LLVM has no optimization to optimize loops to calls to memcmp, yet (it's a TODO in the LoopIdiomRecognize xform), so we probably lose out because of that.

@bluss

This comment has been minimized.

Copy link
Contributor

bluss commented Sep 9, 2014

The custom_eq method can be improved slightly by using raw pointers (C++ style iteration). But even then, the distance to memcmp is vast. For u8 elements it's 10x.

@dotdash

This comment has been minimized.

Copy link
Contributor

dotdash commented Sep 10, 2014

Hm, I don't see that much of a difference for u8 elements. The (unmodified) custom_eq is about as fast as memcp for me.

test bench::custom_eq  ... bench:    537754 ns/iter (+/- 8113)
test bench::memcmp_eq  ... bench:    516983 ns/iter (+/- 2994)

I implemented memcmp_eq like this:

fn memcmp_eq<'a, T: PartialEq>(a: &'a [T], b: &'a [T]) -> bool {
    if a.len() != b.len() {
        return false;
    }

    unsafe {
        rlibc::memcmp(a.as_ptr() as *const _, b.as_ptr() as *const _, a.len()) == 0
    }
}
@bluss

This comment has been minimized.

Copy link
Contributor

bluss commented Sep 10, 2014

try libc memcmp instead

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Oct 10, 2014

The LLVM loop idiom pass doesn't know how to generate memcmp right now. That's where the problem needs to be fixed. Otherwise, you need to target the specific CPU directly to get good SIMD from LLVM.

@cmr cmr self-assigned this Mar 25, 2015

@ranma42

This comment has been minimized.

Copy link
Contributor

ranma42 commented Sep 16, 2015

Is this fixed by #26884 ?

@bluss

This comment has been minimized.

Copy link
Contributor

bluss commented Sep 16, 2015

It seems to be fixed in the sense of the original reporter?

But, slice equality still doesn't vectorize properly or compare well to glibc's memcmp (for byte slices), so improvements remain.

@bluss

This comment has been minimized.

Copy link
Contributor

bluss commented Sep 16, 2015

@dotdash Not sure why it doesn't vectorize in nightly:

https://play.rust-lang.org/?gist=38c5ef4ccf66898cc261&version=nightly

pub fn compare(a: &[u8], b: &[u8]) -> bool {
    a == b
}

@cmr cmr removed their assignment Jan 5, 2016

@arthurprs

This comment has been minimized.

Copy link
Contributor

arthurprs commented Jan 23, 2016

As of today this is still true.

@ranma42

This comment has been minimized.

Copy link
Contributor

ranma42 commented Jan 23, 2016

Related bug in LLVM: https://llvm.org/bugs/show_bug.cgi?id=16332

Basically the problem is that the LLVM vectoriser does not know how to optimise loops with multiple exits or (which is the same) with termination guards that are not simple constraints on the index variable.

@nagisa

This comment has been minimized.

Copy link
Contributor

nagisa commented Mar 14, 2017

Seems to be fixed now with some specialisation (the slow comparison, not LLVM bug)

▪./memcmp --bench

running 2 tests
test bench::builtin_eq ... bench:     393,159 ns/iter (+/- 28,862)
test bench::custom_eq  ... bench:     581,997 ns/iter (+/- 42,720)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured
@aturon

This comment has been minimized.

Copy link
Member

aturon commented Mar 14, 2017

Thanks @nagisa, closing!

@aturon aturon closed this Mar 14, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.