chain() make collect very slow #63340

Stargateur · 2019-08-06T23:37:15Z

While working on a SO question.

We was wondering if chain() would produce an acceptable speed, after some digging and benchmark, we come to the conclusion that collect() is slow because it use while let. Unfortunately, this make collect very slow, I don't really understand why but that a fact.

But we saw that for_each() (probably thank to fold()) implementation of chain() don't have this problem and produce something a lot faster.

#![feature(test)]
extern crate test;

use either::Either; // 1.5.2
use std::iter;

#[derive(Debug, Default)]
pub struct Data<X, Y> {
    head: Option<Y>,
    pairs: Vec<(X, Y)>,
    tail: Option<X>,
}

impl<X, Y> Data<X, Y> {
    pub fn iter(&self) -> impl Iterator<Item = Either<&X, &Y>> {
        let head = self.head.iter().map(Either::Right);

        let pairs = self.pairs.iter().flat_map(|(a, b)| {
            let a = iter::once(Either::Left(a));
            let b = iter::once(Either::Right(b));
            a.chain(b)
        });

        let tail = self.tail.iter().map(Either::Left);

        head.chain(pairs).chain(tail)
    }
}

#[derive(Debug)]
struct AData(usize);
#[derive(Debug)]
struct BData(usize);

#[cfg(test)]
mod tests {
    use crate::{AData, BData, Data};
    use test::Bencher; // 1.5.2

    #[bench]
    fn test_for_each(b: &mut Bencher) {
        b.iter(|| {
            let data = Data {
                head: Some(BData(84)),
                pairs: std::iter::repeat_with(|| (AData(42), BData(84)))
                    .take(20998)
                    .collect(),
                tail: Some(AData(42)),
            };

            let mut data_bis = Vec::with_capacity(21000);
            data.iter().for_each(|x| data_bis.push(x));
        });
    }

    #[bench]
    fn test_collect(b: &mut Bencher) {
        b.iter(|| {
            let data = Data {
                head: Some(BData(84)),
                pairs: std::iter::repeat_with(|| (AData(42), BData(84)))
                    .take(20998)
                    .collect(),
                tail: Some(AData(42)),
            };

            let _: Vec<_> = data.iter().collect();
        });
    }
}

test tests::test_collect  ... bench:   1,682,529 ns/iter (+/- 2,157,023)
test tests::test_for_each ... bench:     609,031 ns/iter (+/- 750,944)

So, should we change implementation of collect to use for_each() ? Note that a for loop doesn't solve the problem. For this to be optimized we need to use for_each().

The text was updated successfully, but these errors were encountered:

Centril · 2019-08-06T23:49:17Z

cc @scottmcm

scottmcm · 2019-08-07T00:01:47Z

Sounds like a good change; please make a PR, @Stargateur! Let me know if you need any assistance.

I don't really understand why but that a fact.

There are some resources about why linked from #45595

Stargateur · 2019-08-07T02:58:13Z

I need to search a lot more because I find simple case where collect() look much better, but in some case that the exact opposite. As it's a core function I do not want to change anything if I'm not sure this is better on most case.

I find some bench test for vector, so I will try to add some and see if for_each() really improve thing. I'm not sure it will:

    fn extend_desugared<I: Iterator<Item = T>>(&mut self, iterator: I) {
        // This is the case for a general iterator.
        //
        // This function should be the moral equivalent of:
        //
        //      for item in iterator {
        //          self.push(item);
        //      }

        let (lower, upper) = iterator.size_hint();
        if let Some(upper) = upper {
            self.reserve(upper); // why not ?
        }
        else {
            self.reserve(lower)
        }

        // We use `for_each()` that should allow more efficient code
        iterator.for_each(|element| {
            let len = self.len();
            if len == self.capacity() {
                self.reserve(1);
            }
            unsafe {
                ptr::write(self.get_unchecked_mut(len), element);
                // NB can't overflow since we would have had to alloc the address space
                self.set_len(len + 1);
            }
        })
    }

The main difference (actuel code) is that we can't anymore check the lower bound of the iterator in the loop, also I wonder why not take the upper bound if it exist. Well, I suppose this is for saving memory but I don't know if it's make sense because reserve already take more memory in a lot of case. So why not just hint what should be the max ?

Also, Rust take like forever to compile on my PC, this going to take a while to do my search.

tesuji · 2019-08-07T03:01:18Z

You could try https://internals.rust-lang.org/t/gcc-compile-farm-for-rustc/9511, but it will take
a while (half a month) to receive a confirm mail from the build farm moderators.

Stargateur · 2019-08-08T06:01:59Z

#50481, well better give up for now, I can't wait 10 hours everytime to compile rust for each little change.

hbina · 2019-10-08T17:40:01Z

Here's my bench result

hbina085@akarin:~/Documents/issue_63340_chain_bench$ cargo bench
   Compiling issue_63340_chain_bench v0.1.0 (/home/hbina085/Documents/issue_63340_chain_bench)
    Finished bench [optimized] target(s) in 0.77s
     Running target/release/deps/issue_63340_chain_bench-51e8cb2b35e76763

running 2 tests
test tests::test_collect  ... bench:     206,178 ns/iter (+/- 213,147)
test tests::test_for_each ... bench:     216,507 ns/iter (+/- 6,650)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

     Running target/release/deps/issue_63340_chain_bench-d3cab5b8c9523cd3

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

While the average performance is about the same, the variance is still larger. Another run...

hbina085@akarin:~/Documents/issue_63340_chain_bench$ cargo bench
    Finished bench [optimized] target(s) in 0.01s
     Running target/release/deps/issue_63340_chain_bench-51e8cb2b35e76763

running 2 tests
test tests::test_collect  ... bench:     186,086 ns/iter (+/- 13,684)
test tests::test_for_each ... bench:     223,619 ns/iter (+/- 57,158)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

     Running target/release/deps/issue_63340_chain_bench-d3cab5b8c9523cd3

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

I don't even know anymore...

hbina · 2019-10-08T19:14:40Z

@rustbot claim

csmoe · 2019-10-09T16:14:00Z

@hbina cargo bench'd on my machine, ".for_each() faster than .collect()" still holds:

running 2 tests
test tests::test_collect  ... bench:     410,982 ns/iter (+/- 4,466)
test tests::test_for_each ... bench:     343,862 ns/iter (+/- 1,747)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

running 2 tests
test tests::test_collect  ... bench:     410,948 ns/iter (+/- 2,077)
test tests::test_for_each ... bench:     331,491 ns/iter (+/- 1,559)

mati865 · 2019-10-09T17:28:35Z

It could be due to different performance characteristics between different microarchitectures (especially AMD vs Intel). Although variance in hbina results is quite high.

popzxc · 2019-10-24T06:02:11Z

Results of cargo bench on my machine (core i3-8100):

cargo bench
    Finished bench [optimized] target(s) in 0.01s
     Running target/release/deps/hash_bench-140dea619a7d2d1c

running 2 tests
test tests::test_collect  ... bench:     145,083 ns/iter (+/- 3,624)
test tests::test_for_each ... bench:     172,960 ns/iter (+/- 4,167)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

mati865 · 2019-10-24T09:09:44Z

Maybe something has fixed it already (like LLVM update).
@Stargateur @csmoe could you try with latest nightly and provide more details if you can still reproduce it?

current nightly on Arch Linux (Ryzen 2700X):

$  cargo bench
   Compiling either v1.5.3
   Compiling foo v0.1.0 (/tmp/foo)
    Finished bench [optimized] target(s) in 0.55s
     Running target/release/deps/foo-f86bbdc9426b849c

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running target/release/deps/test-33fbe6ef386a2878

running 2 tests
test tests::test_collect  ... bench:     129,910 ns/iter (+/- 1,603)
test tests::test_for_each ... bench:     157,252 ns/iter (+/- 2,265)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 0 filtered out

$ rustc -vV                                           
rustc 1.40.0-nightly (4a8c5b20c 2019-10-23)
binary: rustc
commit-hash: 4a8c5b20c7772bc5342b83d4b0696ea216ef75a7
commit-date: 2019-10-23
host: x86_64-unknown-linux-gnu
release: 1.40.0-nightly
LLVM version: 9.0

csmoe · 2019-10-24T09:21:36Z

@mati865 yes, for_each faster still holds with Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz and the same rustc as yours.

$ rustc -vV                                           
rustc 1.40.0-nightly (4a8c5b20c 2019-10-23)
binary: rustc
commit-hash: 4a8c5b20c7772bc5342b83d4b0696ea216ef75a7
commit-date: 2019-10-23
host: x86_64-unknown-linux-gnu
release: 1.40.0-nightly
LLVM version: 9.0

mati865 · 2019-10-24T09:57:41Z

@csmoe are you running on Windows?

On Windows 10 (the same PC) windows-gnu toolchain:

test tests::test_collect  ... bench:     327,421 ns/iter (+/- 5,594)
test tests::test_for_each ... bench:     175,176 ns/iter (+/- 4,975)

csmoe · 2019-10-24T11:16:38Z

bench result with jemalloc as @mati865 suggested:

running 2 tests
test tests::test_collect  ... bench:     253,160 ns/iter (+/- 27,068)
test tests::test_for_each ... bench:     283,811 ns/iter (+/- 46,165)

mati865 · 2019-10-24T13:06:18Z

@Stargateur I compiled Rust with #63340 (comment) but it gave me slightly worse results on windows-gnu toolchain (linux-gnu toolchain doesn't reproduce the issue for me, most likely because I have latest glibc).

@hbina if you still want to work on this issue you will have to create environment where you can reproduce the slowness (maybe running old distros in Docker?).

`for_each` are specialized for iterators such as `chain` allowing for faster iteration than a normal `for/while` loop. Note that since this only checks `size_hint` once at the start it may end up needing to call `reserve` more in the case that `size_hint` returns a larger and more accurate lower bound during iteration. This could maybe be alleviated with an implementation closure like the current one but the extra complexity will likely end up harming the normal case of an accurate or 0 (think `filter`) lower bound. ```rust while let Some(element) = iterator.next() { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } iterator.by_ref().take(self.capacity()).for_each(|element| { unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } }); } // OR let (lower, _) = iterator.size_hint(); self.reserve(lower); loop { let result = iterator.by_ref().try_for_each(|element| { if self.len() == self.capacity() { return Err(element); } unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } Ok(()) }); match result { Ok(()) => break, Err(element) => { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); self.push(element); } } } ``` Closes rust-lang#63340

perf: Use `for_each` in `Vec::extend` `for_each` are specialized for iterators such as `chain` allowing for faster iteration than a normal `for/while` loop. Note that since this only checks `size_hint` once at the start it may end up needing to call `reserve` more in the case that `size_hint` returns a larger and more accurate lower bound during iteration. This could maybe be alleviated with an implementation closure like the current one but the extra complexity will likely end up harming the normal case of an accurate or 0 (think `filter`) lower bound. ```rust while let Some(element) = iterator.next() { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } iterator.by_ref().take(self.capacity()).for_each(|element| { unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } }); } // OR let (lower, _) = iterator.size_hint(); self.reserve(lower); loop { let result = iterator.by_ref().try_for_each(|element| { if self.len() == self.capacity() { return Err(element); } unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } Ok(()) }); match result { Ok(()) => break, Err(element) => { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); self.push(element); } } } ``` Closes #63340

`for_each` are specialized for iterators such as `chain` allowing for faster iteration than a normal `for/while` loop. Note that since this only checks `size_hint` once at the start it may end up needing to call `reserve` more in the case that `size_hint` returns a larger and more accurate lower bound during iteration. This could maybe be alleviated with an implementation closure like the current one but the extra complexity will likely end up harming the normal case of an accurate or 0 (think `filter`) lower bound. ```rust while let Some(element) = iterator.next() { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } iterator.by_ref().take(self.capacity()).for_each(|element| { unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } }); } // OR let (lower, _) = iterator.size_hint(); self.reserve(lower); loop { let result = iterator.by_ref().try_for_each(|element| { if self.len() == self.capacity() { return Err(element); } unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } Ok(()) }); match result { Ok(()) => break, Err(element) => { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); self.push(element); } } } ``` Closes rust-lang#63340

JohnTitor · 2020-07-24T11:44:05Z

Triage: Let's clean-up assignment.
@rustbot release-assignment

scottmcm added the E-mentor Call for participation: This issue has a mentor. Use #t-compiler/help on Zulip for discussion. label Aug 6, 2019

rustbot self-assigned this Oct 8, 2019

This comment has been minimized.

Sign in to view

Marwes mentioned this issue Jan 9, 2020

perf: Use for_each in Vec::extend #68046

Closed

JohnTitor added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 24, 2020

rustbot removed their assignment Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chain() make collect very slow #63340

chain() make collect very slow #63340

Stargateur commented Aug 6, 2019 •

edited by rustbot

Centril commented Aug 6, 2019

scottmcm commented Aug 7, 2019

Stargateur commented Aug 7, 2019

tesuji commented Aug 7, 2019 •

edited

Stargateur commented Aug 8, 2019

hbina commented Oct 8, 2019 •

edited

hbina commented Oct 8, 2019

csmoe commented Oct 9, 2019 •

edited

mati865 commented Oct 9, 2019

popzxc commented Oct 24, 2019

mati865 commented Oct 24, 2019

csmoe commented Oct 24, 2019 •

edited

mati865 commented Oct 24, 2019

This comment has been minimized.

csmoe commented Oct 24, 2019

mati865 commented Oct 24, 2019

JohnTitor commented Jul 24, 2020

chain() make collect very slow #63340

chain() make collect very slow #63340

Comments

Stargateur commented Aug 6, 2019 • edited by rustbot

Centril commented Aug 6, 2019

scottmcm commented Aug 7, 2019

Stargateur commented Aug 7, 2019

tesuji commented Aug 7, 2019 • edited

Stargateur commented Aug 8, 2019

hbina commented Oct 8, 2019 • edited

hbina commented Oct 8, 2019

csmoe commented Oct 9, 2019 • edited

mati865 commented Oct 9, 2019

popzxc commented Oct 24, 2019

mati865 commented Oct 24, 2019

csmoe commented Oct 24, 2019 • edited

mati865 commented Oct 24, 2019

This comment has been minimized.

csmoe commented Oct 24, 2019

mati865 commented Oct 24, 2019

JohnTitor commented Jul 24, 2020

Stargateur commented Aug 6, 2019 •

edited by rustbot

tesuji commented Aug 7, 2019 •

edited

hbina commented Oct 8, 2019 •

edited

csmoe commented Oct 9, 2019 •

edited

csmoe commented Oct 24, 2019 •

edited