Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vec's swap_remove is needlessly slow #52150

Closed
orlp opened this issue Jul 8, 2018 · 6 comments
Closed

Vec's swap_remove is needlessly slow #52150

orlp opened this issue Jul 8, 2018 · 6 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one.

Comments

@orlp
Copy link
Contributor

orlp commented Jul 8, 2018

Currently Vec's swap_remove has the following implementation:

pub fn swap_remove(&mut self, index: usize) -> T {
    let length = self.len();
    self.swap(index, length - 1);
    self.pop().unwrap()
}

This is needlessly slow. The swap does a bounds check, and then pop does
another bounds check that never fails. Furthermore, there is an actual swap
right before the pop - this results in a lot useless moves that don't (seem to)
get optimized out.

It's possible to write a safe implementation that only does one bounds check and
uses two moves with a little unsafe code:

pub fn swap_remove(&mut self, index: usize) -> T {    
    unsafe {
        // Singular bounds check on index ensures safety.
        let hole: *mut T = &mut self[index];
        let value = ptr::read(hole);

        // Since the bounds check on index succeeded we know back >= 0.
        let back = self.len() - 1;
        ptr::copy(self.get_unchecked(back), hole, 1);
        self.set_len(back);
        value        
    }
}

I found it quite hard to benchmark this, but the following benchmark showed on
my machine the new implementation to be faster:

#![feature(test)]
extern crate test;

pub fn swap_remove_opt<T>(v: &mut Vec<T>, idx: usize) -> T {    
    unsafe {
        // Singular bounds check on idx ensures safety.
        let hole: *mut T = &mut v[idx];
        let value = std::ptr::read(hole);
        let back = v.len() - 1;
        std::ptr::copy(v.get_unchecked(back), hole, 1);
        v.set_len(back);
        value        
    }
}

// Tiny LCG RNG.
fn rng_next(s: u64) -> u64 {
    6364136223846793005 * s  + 1
}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;
    
    #[bench]
    fn bench_1000_overhead(b: &mut Bencher) {
        let mut rng = 43;
        let v: Vec<usize> = (0..1024).collect();
        b.iter(|| {
            let v = v.clone();
            for _ in 0..512 {
                rng = rng_next(rng);
                let idx = (rng >> 32) % 512;
                test::black_box(idx as usize);
            }
            v
        })
    }

    #[bench]
    fn bench_vec_1000_swap_remove(b: &mut Bencher) {
        let mut rng = 43;
        let v: Vec<usize> = (0..1024).collect();
        b.iter(|| {
            let mut v = v.clone();
            for _ in 0..512 {
                rng = rng_next(rng);
                let idx = (rng >> 32) % 512;
                test::black_box(v.swap_remove(idx as usize));
            }
            v
        })
    }

    #[bench]
    fn bench_vec_1000_swap_remove_opt(b: &mut Bencher) {
        let mut rng = 43;
        let v: Vec<usize> = (0..1024).collect();
        b.iter(|| {
            let mut v = v.clone();
            for _ in 0..512 {
                rng = rng_next(rng);
                let idx = (rng >> 32) % 512;
                test::black_box(swap_remove_opt(&mut v, idx as usize));
            }
            v
        })
    }
}
test tests::bench_1000_overhead            ... bench:       1,049 ns/iter (+/- 16)
test tests::bench_vec_1000_swap_remove     ... bench:       1,306 ns/iter (+/- 122)
test tests::bench_vec_1000_swap_remove_opt ... bench:       1,193 ns/iter (+/- 104)

As I said, it's quite hard to benchmark - lots of overhead and a small difference to be observed. Regardless, running multiple times show consistently swap_remove_opt to be faster.

Although I can't run the above benchmark on stable due to test being unstable,
I conjecture that on stable Rust the difference is much bigger. A quick look at
the disassembly should show why: https://godbolt.org/g/fU4Edu

Nightly fares a lot better in the disassembly but as seen above still loses out in the benchmark.

@csmoe csmoe added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 8, 2018
@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Jul 8, 2018

Thanks a lot for looking into this!

Comparing the assembly for the two implementations in nightly on godbolt reveals that the only substantial difference is that Vec::swap_remove writes the value back (as expected -- we don't have a way to tell LLVM that nobody will access the elements beyond len and that they will be overwritten before len is increased again, and frankly that isn't even true), the bounds checks are already combined into one. All remaining differences are scheduling/isel/regalloc differences and those are generally not correlated with source level changes in any obvious way. So I'm wondering whether we can get all of the benefit with less and less-tricky unsafe? For example, I dropped the get_unchecked_mut and still got identical assembly: https://godbolt.org/g/wdG8VN

@orlp
Copy link
Contributor Author

orlp commented Jul 8, 2018

@rkruppe Do you know what change causes nightly to be significantly better at reasoning about the bounds check than stable, and whether that functionality is coming to stable soon?

Regardless, the benchmarks I posted above are done on nightly, so there still seems to be a performance benefit.

@hanna-kruppe
Copy link
Contributor

Do you know what change causes nightly to be significantly better at reasoning about the bounds check than stable

Note that my implementation, which doesn't use get_unchecked, still has the same assembly as yours on 1.26.

and whether that functionality is coming to stable soon?

We backport serious bug fixes but not performance fixes. So any performance improvements in nightly get into beta in a six week cycle, and that beta becomes the next stable another six weeks later.

Regardless, the benchmarks I posted above are done on nightly, so there still seems to be a performance benefit.

Yes, there is a performance benefit to be had by replacing the current implementation in std. I'm only talking about how much unsafe we need to use to get that benefit.

@orlp
Copy link
Contributor Author

orlp commented Jul 8, 2018

@rkruppe I think I managed to reduce the unsafe code to a minimum with the following implementation:

pub fn swap_remove(&mut self, index: usize) -> T {
    unsafe {
        let hole: *mut T = &mut self[index];
        std::ptr::replace(hole, self.pop().unwrap())
    }
}

This still compiles to the essentially the same assembly on stable.

@hanna-kruppe
Copy link
Contributor

That's great! Do you want to add comments describing the motivation and why this unsafe code is sound and submit this optimization as a pull request?

bors added a commit that referenced this issue Jul 9, 2018
Performance improvement of Vec's swap_remove.

The old implementation *literally* swapped and then removed, which resulted in unnecessary move instructions. The new implementation does use unsafe code, but is easy to see that it is correct.

Fixes #52150.
@X01XX
Copy link

X01XX commented Oct 18, 2021

Popping the vector subtracts one from the length, suggesting that the subtraction in the above code may not be required. If you want to access an item in a vector before removing it, that is easy to do, suggesting that returning the item in the above code may not be required. So,

    let last_item = avec.pop().unwrap();

    if inx < avec.len() {
        avec[inx] = last_item;
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one.
Projects
None yet
Development

No branches or pull requests

4 participants