Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chain() make collect very slow #63340

Open
Stargateur opened this issue Aug 6, 2019 · 17 comments
Open

chain() make collect very slow #63340

Stargateur opened this issue Aug 6, 2019 · 17 comments
Labels
A-iterators Area: Iterators C-enhancement Category: An issue proposing an enhancement or a PR with one. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@Stargateur
Copy link
Contributor

Stargateur commented Aug 6, 2019

While working on a SO question.

We was wondering if chain() would produce an acceptable speed, after some digging and benchmark, we come to the conclusion that collect() is slow because it use while let. Unfortunately, this make collect very slow, I don't really understand why but that a fact.

But we saw that for_each() (probably thank to fold()) implementation of chain() don't have this problem and produce something a lot faster.

#![feature(test)]
extern crate test;

use either::Either; // 1.5.2
use std::iter;

#[derive(Debug, Default)]
pub struct Data<X, Y> {
    head: Option<Y>,
    pairs: Vec<(X, Y)>,
    tail: Option<X>,
}

impl<X, Y> Data<X, Y> {
    pub fn iter(&self) -> impl Iterator<Item = Either<&X, &Y>> {
        let head = self.head.iter().map(Either::Right);

        let pairs = self.pairs.iter().flat_map(|(a, b)| {
            let a = iter::once(Either::Left(a));
            let b = iter::once(Either::Right(b));
            a.chain(b)
        });

        let tail = self.tail.iter().map(Either::Left);

        head.chain(pairs).chain(tail)
    }
}

#[derive(Debug)]
struct AData(usize);
#[derive(Debug)]
struct BData(usize);

#[cfg(test)]
mod tests {
    use crate::{AData, BData, Data};
    use test::Bencher; // 1.5.2

    #[bench]
    fn test_for_each(b: &mut Bencher) {
        b.iter(|| {
            let data = Data {
                head: Some(BData(84)),
                pairs: std::iter::repeat_with(|| (AData(42), BData(84)))
                    .take(20998)
                    .collect(),
                tail: Some(AData(42)),
            };

            let mut data_bis = Vec::with_capacity(21000);
            data.iter().for_each(|x| data_bis.push(x));
        });
    }

    #[bench]
    fn test_collect(b: &mut Bencher) {
        b.iter(|| {
            let data = Data {
                head: Some(BData(84)),
                pairs: std::iter::repeat_with(|| (AData(42), BData(84)))
                    .take(20998)
                    .collect(),
                tail: Some(AData(42)),
            };

            let _: Vec<_> = data.iter().collect();
        });
    }
}
test tests::test_collect  ... bench:   1,682,529 ns/iter (+/- 2,157,023)
test tests::test_for_each ... bench:     609,031 ns/iter (+/- 750,944)

So, should we change implementation of collect to use for_each() ? Note that a for loop doesn't solve the problem. For this to be optimized we need to use for_each().

@Centril Centril added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. A-iterators Area: Iterators labels Aug 6, 2019
@Centril
Copy link
Contributor

Centril commented Aug 6, 2019

cc @scottmcm

@scottmcm scottmcm added the E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. label Aug 6, 2019
@scottmcm
Copy link
Member

scottmcm commented Aug 7, 2019

Sounds like a good change; please make a PR, @Stargateur! Let me know if you need any assistance.

I don't really understand why but that a fact.

There are some resources about why linked from #45595

@Stargateur
Copy link
Contributor Author

I need to search a lot more because I find simple case where collect() look much better, but in some case that the exact opposite. As it's a core function I do not want to change anything if I'm not sure this is better on most case.

I find some bench test for vector, so I will try to add some and see if for_each() really improve thing. I'm not sure it will:

    fn extend_desugared<I: Iterator<Item = T>>(&mut self, iterator: I) {
        // This is the case for a general iterator.
        //
        // This function should be the moral equivalent of:
        //
        //      for item in iterator {
        //          self.push(item);
        //      }

        let (lower, upper) = iterator.size_hint();
        if let Some(upper) = upper {
            self.reserve(upper); // why not ?
        }
        else {
            self.reserve(lower)
        }

        // We use `for_each()` that should allow more efficient code
        iterator.for_each(|element| {
            let len = self.len();
            if len == self.capacity() {
                self.reserve(1);
            }
            unsafe {
                ptr::write(self.get_unchecked_mut(len), element);
                // NB can't overflow since we would have had to alloc the address space
                self.set_len(len + 1);
            }
        })
    }

The main difference (actuel code) is that we can't anymore check the lower bound of the iterator in the loop, also I wonder why not take the upper bound if it exist. Well, I suppose this is for saving memory but I don't know if it's make sense because reserve already take more memory in a lot of case. So why not just hint what should be the max ?

Also, Rust take like forever to compile on my PC, this going to take a while to do my search.

@tesuji
Copy link
Contributor

tesuji commented Aug 7, 2019

You could try https://internals.rust-lang.org/t/gcc-compile-farm-for-rustc/9511, but it will take
a while (half a month) to receive a confirm mail from the build farm moderators.

@Stargateur
Copy link
Contributor Author

#50481, well better give up for now, I can't wait 10 hours everytime to compile rust for each little change.

@hbina
Copy link
Contributor

hbina commented Oct 8, 2019

Here's my bench result

hbina085@akarin:~/Documents/issue_63340_chain_bench$ cargo bench
   Compiling issue_63340_chain_bench v0.1.0 (/home/hbina085/Documents/issue_63340_chain_bench)
    Finished bench [optimized] target(s) in 0.77s
     Running target/release/deps/issue_63340_chain_bench-51e8cb2b35e76763

running 2 tests
test tests::test_collect  ... bench:     206,178 ns/iter (+/- 213,147)
test tests::test_for_each ... bench:     216,507 ns/iter (+/- 6,650)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

     Running target/release/deps/issue_63340_chain_bench-d3cab5b8c9523cd3

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

While the average performance is about the same, the variance is still larger. Another run...

hbina085@akarin:~/Documents/issue_63340_chain_bench$ cargo bench
    Finished bench [optimized] target(s) in 0.01s
     Running target/release/deps/issue_63340_chain_bench-51e8cb2b35e76763

running 2 tests
test tests::test_collect  ... bench:     186,086 ns/iter (+/- 13,684)
test tests::test_for_each ... bench:     223,619 ns/iter (+/- 57,158)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

     Running target/release/deps/issue_63340_chain_bench-d3cab5b8c9523cd3

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

I don't even know anymore...

@hbina
Copy link
Contributor

hbina commented Oct 8, 2019

@rustbot claim

@rustbot rustbot self-assigned this Oct 8, 2019
@csmoe
Copy link
Member

csmoe commented Oct 9, 2019

@hbina cargo bench'd on my machine, ".for_each() faster than .collect()" still holds:

running 2 tests
test tests::test_collect  ... bench:     410,982 ns/iter (+/- 4,466)
test tests::test_for_each ... bench:     343,862 ns/iter (+/- 1,747)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

running 2 tests
test tests::test_collect  ... bench:     410,948 ns/iter (+/- 2,077)
test tests::test_for_each ... bench:     331,491 ns/iter (+/- 1,559)

@mati865
Copy link
Contributor

mati865 commented Oct 9, 2019

It could be due to different performance characteristics between different microarchitectures (especially AMD vs Intel). Although variance in hbina results is quite high.

@popzxc
Copy link
Contributor

popzxc commented Oct 24, 2019

Results of cargo bench on my machine (core i3-8100):

cargo bench
    Finished bench [optimized] target(s) in 0.01s
     Running target/release/deps/hash_bench-140dea619a7d2d1c

running 2 tests
test tests::test_collect  ... bench:     145,083 ns/iter (+/- 3,624)
test tests::test_for_each ... bench:     172,960 ns/iter (+/- 4,167)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

@mati865
Copy link
Contributor

mati865 commented Oct 24, 2019

Maybe something has fixed it already (like LLVM update).
@Stargateur @csmoe could you try with latest nightly and provide more details if you can still reproduce it?

current nightly on Arch Linux (Ryzen 2700X):
$  cargo bench
   Compiling either v1.5.3
   Compiling foo v0.1.0 (/tmp/foo)
    Finished bench [optimized] target(s) in 0.55s
     Running target/release/deps/foo-f86bbdc9426b849c

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running target/release/deps/test-33fbe6ef386a2878

running 2 tests
test tests::test_collect  ... bench:     129,910 ns/iter (+/- 1,603)
test tests::test_for_each ... bench:     157,252 ns/iter (+/- 2,265)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

$ rustc -vV                                           
rustc 1.40.0-nightly (4a8c5b20c 2019-10-23)
binary: rustc
commit-hash: 4a8c5b20c7772bc5342b83d4b0696ea216ef75a7
commit-date: 2019-10-23
host: x86_64-unknown-linux-gnu
release: 1.40.0-nightly
LLVM version: 9.0

@csmoe
Copy link
Member

csmoe commented Oct 24, 2019

@mati865 yes, for_each faster still holds with Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz and the same rustc as yours.

$ rustc -vV                                           
rustc 1.40.0-nightly (4a8c5b20c 2019-10-23)
binary: rustc
commit-hash: 4a8c5b20c7772bc5342b83d4b0696ea216ef75a7
commit-date: 2019-10-23
host: x86_64-unknown-linux-gnu
release: 1.40.0-nightly
LLVM version: 9.0

@mati865
Copy link
Contributor

mati865 commented Oct 24, 2019

@csmoe are you running on Windows?

On Windows 10 (the same PC) windows-gnu toolchain:

test tests::test_collect  ... bench:     327,421 ns/iter (+/- 5,594)
test tests::test_for_each ... bench:     175,176 ns/iter (+/- 4,975)

@csmoe

This comment has been minimized.

@csmoe
Copy link
Member

csmoe commented Oct 24, 2019

bench result with jemalloc as @mati865 suggested:

running 2 tests
test tests::test_collect  ... bench:     253,160 ns/iter (+/- 27,068)
test tests::test_for_each ... bench:     283,811 ns/iter (+/- 46,165)

@mati865
Copy link
Contributor

mati865 commented Oct 24, 2019

@Stargateur I compiled Rust with #63340 (comment) but it gave me slightly worse results on windows-gnu toolchain (linux-gnu toolchain doesn't reproduce the issue for me, most likely because I have latest glibc).

@hbina if you still want to work on this issue you will have to create environment where you can reproduce the slowness (maybe running old distros in Docker?).

Marwes added a commit to Marwes/rust that referenced this issue Jan 9, 2020
`for_each` are specialized for iterators such as `chain` allowing for
faster iteration than a normal `for/while` loop.

Note that since this only checks `size_hint` once at the start it may
end up needing to call `reserve` more in the case that `size_hint`
returns a larger and more accurate lower bound during iteration.

This could maybe be alleviated with an implementation closure like the current
one but the extra complexity will likely end up harming the normal case
of an accurate or 0 (think `filter`) lower bound.

```rust
while let Some(element) = iterator.next() {
    let (lower, _) = iterator.size_hint();
    self.reserve(lower.saturating_add(1));
    unsafe {
        let len = self.len();
        ptr::write(self.get_unchecked_mut(len), element);
        // NB can't overflow since we would have had to alloc the address space
        self.set_len(len + 1);
    }

    iterator.by_ref().take(self.capacity()).for_each(|element| {
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
    });
}

// OR

let (lower, _) = iterator.size_hint();
self.reserve(lower);
loop {
    let result = iterator.by_ref().try_for_each(|element| {
        if self.len() == self.capacity() {
            return Err(element);
        }
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
        Ok(())
    });

    match result {
        Ok(()) => break,
        Err(element) => {
            let (lower, _) = iterator.size_hint();
            self.reserve(lower.saturating_add(1));
            self.push(element);
        }
    }
}
```

Closes rust-lang#63340
bors added a commit that referenced this issue Jan 9, 2020
perf: Use `for_each` in `Vec::extend`

`for_each` are specialized for iterators such as `chain` allowing for
faster iteration than a normal `for/while` loop.

Note that since this only checks `size_hint` once at the start it may
end up needing to call `reserve` more in the case that `size_hint`
returns a larger and more accurate lower bound during iteration.

This could maybe be alleviated with an implementation closure like the current
one but the extra complexity will likely end up harming the normal case
of an accurate or 0 (think `filter`) lower bound.

```rust
while let Some(element) = iterator.next() {
    let (lower, _) = iterator.size_hint();
    self.reserve(lower.saturating_add(1));
    unsafe {
        let len = self.len();
        ptr::write(self.get_unchecked_mut(len), element);
        // NB can't overflow since we would have had to alloc the address space
        self.set_len(len + 1);
    }

    iterator.by_ref().take(self.capacity()).for_each(|element| {
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
    });
}

// OR

let (lower, _) = iterator.size_hint();
self.reserve(lower);
loop {
    let result = iterator.by_ref().try_for_each(|element| {
        if self.len() == self.capacity() {
            return Err(element);
        }
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
        Ok(())
    });

    match result {
        Ok(()) => break,
        Err(element) => {
            let (lower, _) = iterator.size_hint();
            self.reserve(lower.saturating_add(1));
            self.push(element);
        }
    }
}
```

Closes #63340
Marwes added a commit to Marwes/rust that referenced this issue Feb 6, 2020
`for_each` are specialized for iterators such as `chain` allowing for
faster iteration than a normal `for/while` loop.

Note that since this only checks `size_hint` once at the start it may
end up needing to call `reserve` more in the case that `size_hint`
returns a larger and more accurate lower bound during iteration.

This could maybe be alleviated with an implementation closure like the current
one but the extra complexity will likely end up harming the normal case
of an accurate or 0 (think `filter`) lower bound.

```rust
while let Some(element) = iterator.next() {
    let (lower, _) = iterator.size_hint();
    self.reserve(lower.saturating_add(1));
    unsafe {
        let len = self.len();
        ptr::write(self.get_unchecked_mut(len), element);
        // NB can't overflow since we would have had to alloc the address space
        self.set_len(len + 1);
    }

    iterator.by_ref().take(self.capacity()).for_each(|element| {
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
    });
}

// OR

let (lower, _) = iterator.size_hint();
self.reserve(lower);
loop {
    let result = iterator.by_ref().try_for_each(|element| {
        if self.len() == self.capacity() {
            return Err(element);
        }
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
        Ok(())
    });

    match result {
        Ok(()) => break,
        Err(element) => {
            let (lower, _) = iterator.size_hint();
            self.reserve(lower.saturating_add(1));
            self.push(element);
        }
    }
}
```

Closes rust-lang#63340
@JohnTitor JohnTitor added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 24, 2020
@JohnTitor
Copy link
Member

Triage: Let's clean-up assignment.
@rustbot release-assignment

@rustbot rustbot removed their assignment Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-iterators Area: Iterators C-enhancement Category: An issue proposing an enhancement or a PR with one. E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants