Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missed optimization: repeated pointer increments don't compile to a memcpy #69187

Closed
timvermeulen opened this issue Feb 15, 2020 · 5 comments
Closed
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@timvermeulen
Copy link
Contributor

pub unsafe fn copy(slice: &[u8], dst: *mut u8) {
    let mut src = slice.as_ptr();
    let mut dst = dst;

    for _ in 0..slice.len() {
        *dst = *src;
        src = src.add(1);
        dst = dst.add(1);
    }
}

This doesn't currently compile to a memcpy (Godbolt).

I figured this was worth filing because this seems to be the main reason that a naive iter::Zip implementation isn't optimal, which is why it's implemented the way it is.

@jonas-schievink jonas-schievink added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 15, 2020
@the8472
Copy link
Member

the8472 commented Feb 15, 2020

since pointers can point into the same allocation it would have to be a memmove in the general case. Nevermind, aliasing rules allows memcpy.

@cynecx
Copy link
Contributor

cynecx commented Feb 15, 2020

pub unsafe fn copy(slice: &[u8], dst: *mut u8) {
    let src = slice.as_ptr();
    for i in 0..slice.len() {
        *dst.add(i) = *src.add(i);
    }
}

This does (https://godbolt.org/z/MdgeTL).

@timvermeulen
Copy link
Contributor Author

@cynecx Indeed, that is exactly what Zip currently boils down to using the private TrustedRandomAccess trait. An attempt to replace it lead me to the snippet above. It would be really nice if repeated pointer increments would optimize in the same way because that's essentially what slice::{Iter, IterMut}::next does.

@cynecx
Copy link
Contributor

cynecx commented Feb 15, 2020

@timvermeulen Fwiw LLVM isn't doing a great job at detecting this particular pattern: https://godbolt.org/z/TW2bJH (C++).

@jonas-schievink jonas-schievink added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Feb 15, 2020
bors added a commit to rust-lang-ci/rust that referenced this issue Sep 3, 2020
specialize some collection and iterator operations to run in-place

This is a rebase and update of rust-lang#66383 which was closed due inactivity.

Recent rustc changes made the compile time regressions disappear, at least for webrender-wrench. Running a stage2 compile and the rustc-perf suite takes hours on the hardware I have at the moment, so I can't do much more than that.

![Screenshot_2020-04-05 rustc performance data](https://user-images.githubusercontent.com/1065730/78462657-5d60f100-76d4-11ea-8a0b-4f3962707c38.png)

In the best case of the `vec::bench_in_place_recycle` synthetic microbenchmark these optimizations can provide a 15x speedup over the regular implementation which allocates a new vec for every benchmark iteration. [Benchmark results](https://gist.github.com/the8472/6d999b2d08a2bedf3b93f12112f96e2f). In real code the speedups are tiny, but it also depends on the allocator used, a system allocator that uses a process-wide mutex will benefit more than one with thread-local pools.

## What was changed

* `SpecExtend` which covered `from_iter` and `extend` specializations was split into separate traits
* `extend` and `from_iter` now reuse the `append_elements` if passed iterators are from slices.
* A preexisting `vec.into_iter().collect::<Vec<_>>()` optimization that passed through the original vec has been generalized further to also cover cases where the original has been partially drained.
* A chain of *Vec<T> / BinaryHeap<T> / Box<[T]>* `IntoIter`s  through various iterator adapters collected into *Vec<U>* and *BinaryHeap<U>* will be performed in place as long as `T` and `U` have the same alignment and size and aren't ZSTs.
* To enable above specialization the unsafe, unstable `SourceIter` and `InPlaceIterable` traits have been added. The first allows reaching through the iterator pipeline to grab a pointer to the source memory. The latter is a marker that promises that the read pointer will advance as fast or faster than the write pointer and thus in-place operation is possible in the first place.
* `vec::IntoIter` implements `TrustedRandomAccess` for `T: Copy` to allow in-place collection when there is a `Zip` adapter in the iterator. TRA had to be made an unstable public trait to support this.

## In-place collectible adapters

* `Map`
* `MapWhile`
* `Filter`
* `FilterMap`
* `Fuse`
* `Skip`
* `SkipWhile`
* `Take`
* `TakeWhile`
* `Enumerate`
* `Zip` (left hand side only, `Copy` types only)
* `Peek`
* `Scan`
* `Inspect`

## Concerns

`vec.into_iter().filter(|_| false).collect()` will no longer return a vec with 0 capacity, instead it will return its original allocation. This avoids the cost of doing any allocation or deallocation but could lead to large allocations living longer than expected.
If that's not acceptable some resizing policy at the end of the attempted in-place collect would be necessary, which in the worst case could result in one more memcopy than the non-specialized case.

## Possible followup work

* split liballoc/vec.rs to remove `ignore-tidy-filelength`
* try to get trivial chains such as `vec.into_iter().skip(1).collect::<Vec<)>>()` to compile to a `memmove` (currently compiles to a pile of SIMD, see rust-lang#69187 )
* improve up the traits so they can be reused by other crates, e.g. itertools. I think currently they're only good enough for internal use
* allow iterators sourced from a `HashSet` to be in-place collected into a `Vec`
@nikic
Copy link
Contributor

nikic commented Mar 12, 2021

This is detected as a memcpy on nightly, presumably due to LLVM 12 upgrade.

@nikic nikic closed this as completed Mar 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants