Vec's swap_remove is needlessly slow #52150

orlp · 2018-07-08T05:05:13Z

Currently Vec's swap_remove has the following implementation:

pub fn swap_remove(&mut self, index: usize) -> T {
    let length = self.len();
    self.swap(index, length - 1);
    self.pop().unwrap()
}

This is needlessly slow. The swap does a bounds check, and then pop does
another bounds check that never fails. Furthermore, there is an actual swap
right before the pop - this results in a lot useless moves that don't (seem to)
get optimized out.

It's possible to write a safe implementation that only does one bounds check and
uses two moves with a little unsafe code:

pub fn swap_remove(&mut self, index: usize) -> T {    
    unsafe {
        // Singular bounds check on index ensures safety.
        let hole: *mut T = &mut self[index];
        let value = ptr::read(hole);

        // Since the bounds check on index succeeded we know back >= 0.
        let back = self.len() - 1;
        ptr::copy(self.get_unchecked(back), hole, 1);
        self.set_len(back);
        value        
    }
}

I found it quite hard to benchmark this, but the following benchmark showed on
my machine the new implementation to be faster:

#![feature(test)]
extern crate test;

pub fn swap_remove_opt<T>(v: &mut Vec<T>, idx: usize) -> T {    
    unsafe {
        // Singular bounds check on idx ensures safety.
        let hole: *mut T = &mut v[idx];
        let value = std::ptr::read(hole);
        let back = v.len() - 1;
        std::ptr::copy(v.get_unchecked(back), hole, 1);
        v.set_len(back);
        value        
    }
}

// Tiny LCG RNG.
fn rng_next(s: u64) -> u64 {
    6364136223846793005 * s  + 1
}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;
    
    #[bench]
    fn bench_1000_overhead(b: &mut Bencher) {
        let mut rng = 43;
        let v: Vec<usize> = (0..1024).collect();
        b.iter(|| {
            let v = v.clone();
            for _ in 0..512 {
                rng = rng_next(rng);
                let idx = (rng >> 32) % 512;
                test::black_box(idx as usize);
            }
            v
        })
    }

    #[bench]
    fn bench_vec_1000_swap_remove(b: &mut Bencher) {
        let mut rng = 43;
        let v: Vec<usize> = (0..1024).collect();
        b.iter(|| {
            let mut v = v.clone();
            for _ in 0..512 {
                rng = rng_next(rng);
                let idx = (rng >> 32) % 512;
                test::black_box(v.swap_remove(idx as usize));
            }
            v
        })
    }

    #[bench]
    fn bench_vec_1000_swap_remove_opt(b: &mut Bencher) {
        let mut rng = 43;
        let v: Vec<usize> = (0..1024).collect();
        b.iter(|| {
            let mut v = v.clone();
            for _ in 0..512 {
                rng = rng_next(rng);
                let idx = (rng >> 32) % 512;
                test::black_box(swap_remove_opt(&mut v, idx as usize));
            }
            v
        })
    }
}

test tests::bench_1000_overhead            ... bench:       1,049 ns/iter (+/- 16)
test tests::bench_vec_1000_swap_remove     ... bench:       1,306 ns/iter (+/- 122)
test tests::bench_vec_1000_swap_remove_opt ... bench:       1,193 ns/iter (+/- 104)

As I said, it's quite hard to benchmark - lots of overhead and a small difference to be observed. Regardless, running multiple times show consistently swap_remove_opt to be faster.

Although I can't run the above benchmark on stable due to test being unstable,
I conjecture that on stable Rust the difference is much bigger. A quick look at
the disassembly should show why: https://godbolt.org/g/fU4Edu

Nightly fares a lot better in the disassembly but as seen above still loses out in the benchmark.

The text was updated successfully, but these errors were encountered:

hanna-kruppe · 2018-07-08T14:34:10Z

Thanks a lot for looking into this!

Comparing the assembly for the two implementations in nightly on godbolt reveals that the only substantial difference is that Vec::swap_remove writes the value back (as expected -- we don't have a way to tell LLVM that nobody will access the elements beyond len and that they will be overwritten before len is increased again, and frankly that isn't even true), the bounds checks are already combined into one. All remaining differences are scheduling/isel/regalloc differences and those are generally not correlated with source level changes in any obvious way. So I'm wondering whether we can get all of the benefit with less and less-tricky unsafe? For example, I dropped the get_unchecked_mut and still got identical assembly: https://godbolt.org/g/wdG8VN

orlp · 2018-07-08T15:33:54Z

@rkruppe Do you know what change causes nightly to be significantly better at reasoning about the bounds check than stable, and whether that functionality is coming to stable soon?

Regardless, the benchmarks I posted above are done on nightly, so there still seems to be a performance benefit.

hanna-kruppe · 2018-07-08T15:56:41Z

Do you know what change causes nightly to be significantly better at reasoning about the bounds check than stable

Note that my implementation, which doesn't use get_unchecked, still has the same assembly as yours on 1.26.

and whether that functionality is coming to stable soon?

We backport serious bug fixes but not performance fixes. So any performance improvements in nightly get into beta in a six week cycle, and that beta becomes the next stable another six weeks later.

Regardless, the benchmarks I posted above are done on nightly, so there still seems to be a performance benefit.

Yes, there is a performance benefit to be had by replacing the current implementation in std. I'm only talking about how much unsafe we need to use to get that benefit.

orlp · 2018-07-08T16:41:09Z

@rkruppe I think I managed to reduce the unsafe code to a minimum with the following implementation:

pub fn swap_remove(&mut self, index: usize) -> T {
    unsafe {
        let hole: *mut T = &mut self[index];
        std::ptr::replace(hole, self.pop().unwrap())
    }
}

This still compiles to the essentially the same assembly on stable.

hanna-kruppe · 2018-07-08T17:03:53Z

That's great! Do you want to add comments describing the motivation and why this unsafe code is sound and submit this optimization as a pull request?

Performance improvement of Vec's swap_remove. The old implementation *literally* swapped and then removed, which resulted in unnecessary move instructions. The new implementation does use unsafe code, but is easy to see that it is correct. Fixes #52150.

X01XX · 2021-10-18T17:00:48Z

Popping the vector subtracts one from the length, suggesting that the subtraction in the above code may not be required. If you want to access an item in a vector before removing it, that is easy to do, suggesting that returning the item in the above code may not be required. So,

    let last_item = avec.pop().unwrap();

    if inx < avec.len() {
        avec[inx] = last_item;
    }

csmoe added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 8, 2018

orlp mentioned this issue Jul 9, 2018

Performance improvement of Vec's swap_remove. #52166

Merged

bors closed this as completed in #52166 Jul 9, 2018

comphead mentioned this issue Sep 3, 2024

Improve & unify validation in LogicalPlan::with_new_exprs apache/datafusion#12264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vec's swap_remove is needlessly slow #52150

Vec's swap_remove is needlessly slow #52150

orlp commented Jul 8, 2018 •

edited

Loading

hanna-kruppe commented Jul 8, 2018 •

edited

Loading

orlp commented Jul 8, 2018

hanna-kruppe commented Jul 8, 2018

orlp commented Jul 8, 2018 •

edited

Loading

hanna-kruppe commented Jul 8, 2018

X01XX commented Oct 18, 2021 •

edited

Loading

Vec's swap_remove is needlessly slow #52150

Vec's swap_remove is needlessly slow #52150

Comments

orlp commented Jul 8, 2018 • edited Loading

hanna-kruppe commented Jul 8, 2018 • edited Loading

orlp commented Jul 8, 2018

hanna-kruppe commented Jul 8, 2018

orlp commented Jul 8, 2018 • edited Loading

hanna-kruppe commented Jul 8, 2018

X01XX commented Oct 18, 2021 • edited Loading

orlp commented Jul 8, 2018 •

edited

Loading

hanna-kruppe commented Jul 8, 2018 •

edited

Loading

orlp commented Jul 8, 2018 •

edited

Loading

X01XX commented Oct 18, 2021 •

edited

Loading