Missed optimization: repeated pointer increments don't compile to a memcpy #69187

timvermeulen · 2020-02-15T14:04:36Z

pub unsafe fn copy(slice: &[u8], dst: *mut u8) {
    let mut src = slice.as_ptr();
    let mut dst = dst;

    for _ in 0..slice.len() {
        *dst = *src;
        src = src.add(1);
        dst = dst.add(1);
    }
}

This doesn't currently compile to a memcpy (Godbolt).

I figured this was worth filing because this seems to be the main reason that a naive iter::Zip implementation isn't optimal, which is why it's implemented the way it is.

The text was updated successfully, but these errors were encountered:

the8472 · 2020-02-15T17:43:27Z

~~since pointers can point into the same allocation it would have to be a memmove in the general case~~. Nevermind, aliasing rules allows memcpy.

cynecx · 2020-02-15T18:12:03Z

pub unsafe fn copy(slice: &[u8], dst: *mut u8) {
    let src = slice.as_ptr();
    for i in 0..slice.len() {
        *dst.add(i) = *src.add(i);
    }
}

This does (https://godbolt.org/z/MdgeTL).

timvermeulen · 2020-02-15T18:33:50Z

@cynecx Indeed, that is exactly what Zip currently boils down to using the private TrustedRandomAccess trait. An attempt to replace it lead me to the snippet above. It would be really nice if repeated pointer increments would optimize in the same way because that's essentially what slice::{Iter, IterMut}::next does.

cynecx · 2020-02-15T18:59:21Z

@timvermeulen Fwiw LLVM isn't doing a great job at detecting this particular pattern: https://godbolt.org/z/TW2bJH (C++).

specialize some collection and iterator operations to run in-place This is a rebase and update of rust-lang#66383 which was closed due inactivity. Recent rustc changes made the compile time regressions disappear, at least for webrender-wrench. Running a stage2 compile and the rustc-perf suite takes hours on the hardware I have at the moment, so I can't do much more than that. ![Screenshot_2020-04-05 rustc performance data](https://user-images.githubusercontent.com/1065730/78462657-5d60f100-76d4-11ea-8a0b-4f3962707c38.png) In the best case of the `vec::bench_in_place_recycle` synthetic microbenchmark these optimizations can provide a 15x speedup over the regular implementation which allocates a new vec for every benchmark iteration. [Benchmark results](https://gist.github.com/the8472/6d999b2d08a2bedf3b93f12112f96e2f). In real code the speedups are tiny, but it also depends on the allocator used, a system allocator that uses a process-wide mutex will benefit more than one with thread-local pools. ## What was changed * `SpecExtend` which covered `from_iter` and `extend` specializations was split into separate traits * `extend` and `from_iter` now reuse the `append_elements` if passed iterators are from slices. * A preexisting `vec.into_iter().collect::<Vec<_>>()` optimization that passed through the original vec has been generalized further to also cover cases where the original has been partially drained. * A chain of *Vec<T> / BinaryHeap<T> / Box<[T]>* `IntoIter`s through various iterator adapters collected into *Vec<U>* and *BinaryHeap<U>* will be performed in place as long as `T` and `U` have the same alignment and size and aren't ZSTs. * To enable above specialization the unsafe, unstable `SourceIter` and `InPlaceIterable` traits have been added. The first allows reaching through the iterator pipeline to grab a pointer to the source memory. The latter is a marker that promises that the read pointer will advance as fast or faster than the write pointer and thus in-place operation is possible in the first place. * `vec::IntoIter` implements `TrustedRandomAccess` for `T: Copy` to allow in-place collection when there is a `Zip` adapter in the iterator. TRA had to be made an unstable public trait to support this. ## In-place collectible adapters * `Map` * `MapWhile` * `Filter` * `FilterMap` * `Fuse` * `Skip` * `SkipWhile` * `Take` * `TakeWhile` * `Enumerate` * `Zip` (left hand side only, `Copy` types only) * `Peek` * `Scan` * `Inspect` ## Concerns `vec.into_iter().filter(|_| false).collect()` will no longer return a vec with 0 capacity, instead it will return its original allocation. This avoids the cost of doing any allocation or deallocation but could lead to large allocations living longer than expected. If that's not acceptable some resizing policy at the end of the attempted in-place collect would be necessary, which in the worst case could result in one more memcopy than the non-specialized case. ## Possible followup work * split liballoc/vec.rs to remove `ignore-tidy-filelength` * try to get trivial chains such as `vec.into_iter().skip(1).collect::<Vec<)>>()` to compile to a `memmove` (currently compiles to a pile of SIMD, see rust-lang#69187 ) * improve up the traits so they can be reused by other crates, e.g. itertools. I think currently they're only good enough for internal use * allow iterators sourced from a `HashSet` to be in-place collected into a `Vec`

nikic · 2021-03-12T20:17:05Z

This is detected as a memcpy on nightly, presumably due to LLVM 12 upgrade.

jonas-schievink added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 15, 2020

jonas-schievink added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Feb 15, 2020

the8472 mentioned this issue Apr 4, 2020

specialize some collection and iterator operations to run in-place #70793

Merged

timvermeulen mentioned this issue Jun 28, 2020

nth_back() for Zip returns wrong values #68536

Closed

nikic closed this as completed Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missed optimization: repeated pointer increments don't compile to a memcpy #69187

Missed optimization: repeated pointer increments don't compile to a memcpy #69187

timvermeulen commented Feb 15, 2020

the8472 commented Feb 15, 2020 •

edited

Loading

cynecx commented Feb 15, 2020

timvermeulen commented Feb 15, 2020

cynecx commented Feb 15, 2020

nikic commented Mar 12, 2021

Missed optimization: repeated pointer increments don't compile to a memcpy #69187

Missed optimization: repeated pointer increments don't compile to a memcpy #69187

Comments

timvermeulen commented Feb 15, 2020

the8472 commented Feb 15, 2020 • edited Loading

cynecx commented Feb 15, 2020

timvermeulen commented Feb 15, 2020

cynecx commented Feb 15, 2020

nikic commented Mar 12, 2021

the8472 commented Feb 15, 2020 •

edited

Loading