vec::from_elem with primitives should be as fast as calloc #7136

erickt · 2013-06-14T23:34:46Z

While @cmr landed a nice optimization of vec::from_elem in #6876, he said that it's still not performing as fast as doing a malloc and a ptr::set_memory. We should figure out why it is not performing as well as it should be and fix it in order to remove the temptation of using the faster unsafe functions.

The text was updated successfully, but these errors were encountered:

erickt · 2013-06-15T00:03:12Z

It turns out this is substantially slower than I expected. With this test benchmark:

extern mod extra;

use std::vec;
use std::ptr;

#[bench]
fn bench_from_elem(b: &mut extra::test::BenchHarness) {
    do b.iter {
        let v: ~[u8] = vec::from_elem(1024, 0u8);
    }
}

#[bench]
fn bench_set_memory(b: &mut extra::test::BenchHarness) {
    do b.iter {
        let mut v: ~[u8] = vec::with_capacity(1024);
        unsafe {
            let vp = vec::raw::to_mut_ptr(v);
            ptr::set_memory(vp, 0, 1024);
            vec::raw::set_len(&mut v, 1024);
        }
    }
}

fn main() {}

I'm getting these results:

running 2 tests
test bench_from_elem ... bench: 16351 ns/iter (+/- 192)
test bench_set_memory ... bench: 384 ns/iter (+/- 6)

This is related to #6623 and #7137.

erickt · 2013-06-15T04:29:39Z

@huonw mentioned in irc that with -O the difference is:

test bench_from_elem ... bench: 799 ns/iter (+/- 0)
test bench_set_memory ... bench: 200 ns/iter (+/- 0)

Which is much better, but still not great.

emberian · 2013-06-15T12:08:10Z

memset "cheats" with sse. from_elem spends a lot of time in move_val_init without optimizations so there's a bunch of unnecessary function calls. With optimizations, the thing that's killing it is the copies.

thestinger · 2013-06-15T19:40:58Z

LLVM knows how to optimize loops to the same code as memcpy/memmove/memset though, as long as you generate good IR.

http://llvm.org/docs/doxygen/html/LoopIdiomRecognize_8cpp_source.html

luqmana · 2013-06-15T22:28:58Z

I added a bench for the literal vec repeat syntax as well:

#[bench]
fn bench_vec_repeat(b: &mut extra::test::BenchHarness) {
    do b.iter {
        let v: ~[u8] = ~[0u8, ..1024];
    }
}

Without optimizations:

running 3 tests
test bench_from_elem ... bench: 21803 ns/iter (+/- 597)
test bench_set_memory ... bench: 381 ns/iter (+/- 6)
test bench_vec_repeat ... bench: 2125 ns/iter (+/- 12)

With optimizations (-O):

running 3 tests
test bench_from_elem ... bench: 752 ns/iter (+/- 0)
test bench_set_memory ... bench: 183 ns/iter (+/- 0)
test bench_vec_repeat ... bench: 90 ns/iter (+/- 3)

Looking at the optimized IR, both set_memory and vec_repeat become a memset but set_memory seems to have a lot more overhead. from_elem does not become a memset hence the bad results.

thestinger · 2013-06-28T22:31:49Z

@erickt: this gets a lot better with the optimization passes from #7466, from ~3x as slow to ~2x as slow

Before (opt-level=2):

test bench_from_elem ... bench: 986 ns/iter (+/- 3)
test bench_set_memory ... bench: 343 ns/iter (+/- 3)
test bench_vec_repeat ... bench: 175 ns/iter (+/- 3)

After (opt-level=2)

test bench_from_elem ... bench: 667 ns/iter (+/- 0)
test bench_set_memory ... bench: 343 ns/iter (+/- 0)
test bench_vec_repeat ... bench: 178 ns/iter (+/- 0)

thestinger · 2013-06-28T22:37:05Z

@luqmana: the problem is with the code bloat from the surrounding code, rather than set_memory itself - I think we actually produce some IR there with undefined behaviour

thestinger · 2013-07-10T02:28:28Z

#7682 greatly speeds up with_capacity, fixing half of the issue

Before:

test bench_from_elem ... bench: 683 ns/iter (+/- 3)
test bench_set_memory ... bench: 334 ns/iter (+/- 3)
test bench_vec_repeat ... bench: 176 ns/iter (+/- 0)

After:

test bench_from_elem ... bench: 461 ns/iter (+/- 0)
test bench_set_memory ... bench: 143 ns/iter (+/- 0)
test bench_vec_repeat ... bench: 148 ns/iter (+/- 0)

thestinger · 2013-07-28T23:06:57Z

The performance of all 3 has improved, but from_elem got slower relative to the others:

test bench_from_elem ... bench: 409 ns/iter (+/- 4)
test bench_set_memory ... bench: 83 ns/iter (+/- 1)
test bench_vec_repeat ... bench: 84 ns/iter (+/- 3)

thestinger · 2013-07-29T21:24:17Z

It looks like the remaining issue is our pointer arithmetic being slow. The offset and mut_offset functions compile to conversions to and from integers, rather than actual pointer arithmetic with the stricter semantics. I think we need to make the + and - implementations for pointers a compiler feature.

Closes #8118, #7136 ~~~rust extern mod extra; use std::vec; use std::ptr; fn bench_from_elem(b: &mut extra::test::BenchHarness) { do b.iter { let v: ~[u8] = vec::from_elem(1024, 0u8); } } fn bench_set_memory(b: &mut extra::test::BenchHarness) { do b.iter { let mut v: ~[u8] = vec::with_capacity(1024); unsafe { let vp = vec::raw::to_mut_ptr(v); ptr::set_memory(vp, 0, 1024); vec::raw::set_len(&mut v, 1024); } } } fn bench_vec_repeat(b: &mut extra::test::BenchHarness) { do b.iter { let v: ~[u8] = ~[0u8, ..1024]; } } ~~~ Before: test bench_from_elem ... bench: 415 ns/iter (+/- 17) test bench_set_memory ... bench: 85 ns/iter (+/- 4) test bench_vec_repeat ... bench: 83 ns/iter (+/- 3) After: test bench_from_elem ... bench: 84 ns/iter (+/- 2) test bench_set_memory ... bench: 84 ns/iter (+/- 5) test bench_vec_repeat ... bench: 84 ns/iter (+/- 3)

thestinger · 2013-07-30T16:03:23Z

Fixed by #8121.

test bench_from_elem ... bench: 84 ns/iter (+/- 1)
test bench_set_memory ... bench: 84 ns/iter (+/- 3)
test bench_vec_repeat ... bench: 84 ns/iter (+/- 2)

eddyb · 2013-12-10T19:10:51Z

This has regressed since that fix, @cmr is attempting a bisect atm.

Update: bisect finished, #8780 is to blame. This loop isn't optimized. Maybe the Finallyalizer drop method needs to be inlined for further optimizations.

emberian · 2013-12-11T09:59:13Z

Note that set_memory has regressed by ~16ns, cause is #9933 (df187a0)

…alid_sugg_macro_expansion, r=llogiq manual_unwrap_or: fix invalid code suggestion, due to macro expansion fixes rust-lang#6965 changelog: fix invalid code suggestion in `manual_unwrap_or` lint, due to macro expansion

This was referenced Jul 29, 2013

pointer arithmetic should use GEP #8118

Closed

implement pointer arithmetic with GEP #8121

Closed

thestinger closed this as completed Jul 30, 2013

huonw reopened this Dec 10, 2013

eddyb mentioned this issue Dec 11, 2013

Inline Finallyalizer::drop, allowing LLVM to optimize finally. #10918

Merged

bors closed this as completed in 09bf5de Dec 14, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vec::from_elem with primitives should be as fast as calloc #7136

vec::from_elem with primitives should be as fast as calloc #7136

erickt commented Jun 14, 2013

erickt commented Jun 15, 2013

erickt commented Jun 15, 2013

emberian commented Jun 15, 2013

thestinger commented Jun 15, 2013

luqmana commented Jun 15, 2013

thestinger commented Jun 28, 2013

thestinger commented Jun 28, 2013

thestinger commented Jul 10, 2013

thestinger commented Jul 28, 2013

thestinger commented Jul 29, 2013

thestinger commented Jul 30, 2013

eddyb commented Dec 10, 2013

emberian commented Dec 11, 2013

vec::from_elem with primitives should be as fast as calloc #7136

vec::from_elem with primitives should be as fast as calloc #7136

Comments

erickt commented Jun 14, 2013

erickt commented Jun 15, 2013

erickt commented Jun 15, 2013

emberian commented Jun 15, 2013

thestinger commented Jun 15, 2013

luqmana commented Jun 15, 2013

thestinger commented Jun 28, 2013

thestinger commented Jun 28, 2013

thestinger commented Jul 10, 2013

thestinger commented Jul 28, 2013

thestinger commented Jul 29, 2013

thestinger commented Jul 30, 2013

eddyb commented Dec 10, 2013

emberian commented Dec 11, 2013