Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add small-copy optimization for copy_from_slice #37573

Merged
merged 3 commits into from
Dec 1, 2016

Conversation

ruuda
Copy link
Contributor

@ruuda ruuda commented Nov 4, 2016

Summary

During benchmarking, I found that one of my programs spent between 5 and 10 percent of the time doing memmoves. Ultimately I tracked these down to single-byte slices being copied with a memcopy. Doing a manual copy if the slice contains only one element can speed things up significantly. For my program, this reduced the running time by 20%.

Background

I am optimizing a program that relies heavily on reading a single byte at a time. To avoid IO overhead, I read all data into a vector once, and then I use a Cursor around that vector to read from. During profiling, I noticed that __memmove_avx_unaligned_erms was hot, taking up 7.3% of the running time. It turns out that these were caused by calls to Cursor::read(), which calls <&[u8] as Read>::read(), which calls &[T]::copy_from_slice(), which calls ptr::copy_nonoverlapping(). This one is implemented as a memcopy. Copying a single byte with a memcopy is very wasteful, because (at least on my platform) it involves calling memcpy in libc. This is an indirect call when libc is linked dynamically, and furthermore memcpy is optimized for copying large amounts of data at the cost of a bit of overhead for small copies.

Benchmarks

Before I made this change, perf reported the following for my program. I only included the relevant functions, and how they rank. (This is on a different machine than where I ran the original benchmarks. It has an older CPU, so __memmove_sse2_unaligned_erms is called instead of __memmove_avx_unaligned_erms.)

#3   5.47%  bench_decode  libc-2.24.so      [.] __memmove_sse2_unaligned_erms
#5   1.67%  bench_decode  libc-2.24.so      [.] memcpy@GLIBC_2.2.5
#6   1.51%  bench_decode  bench_decode      [.] memcpy@plt

memcpy is eating up 8.65% of the total running time, and the overhead of dispatching to a specialized fast copy function (memcpy@GLIBC showing up) is clearly visible. The price of dynamic linking (memcpy@plt showing up) is visible too.

After this change, this is what perf reports:

#5   0.33%  bench_decode  libc-2.24.so      [.] __memmove_sse2_unaligned_erms
#14  0.01%  bench_decode  libc-2.24.so      [.] memcpy@GLIBC_2.2.5

Now only 0.34% of the running time is spent on memcopies. The dynamic linking overhead is not significant at all any more.

To add some more data, my program generates timing results for the operation in its main loop. These are the timings before and after the change:

Time before Time after After/Before
29.8 ± 0.8 ns 23.6 ± 0.5 ns 0.79 ± 0.03

The time is basically the total running time divided by a constant; the actual numbers are not important. This change reduced the total running time by 21% (much more than the original 9% spent on memmoves, likely because the CPU is stalling a lot less because data dependencies are more transparent). Of course YMMV and for most programs this will not matter at all. But when it does, the gains can be significant!

Alternatives

  • At first I implemented this in io::Cursor. I moved it to &[T]::copy_from_slice() instead, but this might be too intrusive, especially because it applies to all T, not just u8. To restrict this to io::Read, <&[u8] as Read>::read() is probably the best place.
  • I tried copying bytes in a loop up to 64 or 8 bytes before calling Read::read, but both resulted in about a 20% slowdown instead of speedup.

@rust-highfive
Copy link
Collaborator

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

@alexcrichton
Copy link
Member

Thanks for the PR! To get a handle on the perf impact here, can you also run some benchmarks for memcpys of 1 to maybe 10 bytes in size? Presumably this punishes other small memcpys with an extra branch, but I'd be curious to see by how much

@ruuda
Copy link
Contributor Author

ruuda commented Nov 4, 2016

I created a microbenchmark with the following results:

Before:

bench_copy_u32_slice_01_x100 ... bench:         824 ns/iter (+/- 5)
bench_copy_u32_slice_02_x100 ... bench:         990 ns/iter (+/- 1)
bench_copy_u32_slice_03_x100 ... bench:         976 ns/iter (+/- 1)
bench_copy_u32_slice_04_x100 ... bench:       1,429 ns/iter (+/- 1)
bench_copy_u32_slice_06_x100 ... bench:       1,129 ns/iter (+/- 1)
bench_copy_u32_slice_08_x100 ... bench:         997 ns/iter (+/- 0)
bench_copy_u32_slice_16_x100 ... bench:       1,905 ns/iter (+/- 39)
bench_copy_u8_slice_01_x100  ... bench:         798 ns/iter (+/- 2)
bench_copy_u8_slice_02_x100  ... bench:         844 ns/iter (+/- 0)
bench_copy_u8_slice_03_x100  ... bench:         856 ns/iter (+/- 1)
bench_copy_u8_slice_04_x100  ... bench:         794 ns/iter (+/- 2)
bench_copy_u8_slice_06_x100  ... bench:         793 ns/iter (+/- 2)
bench_copy_u8_slice_08_x100  ... bench:         844 ns/iter (+/- 0)
bench_copy_u8_slice_16_x100  ... bench:         708 ns/iter (+/- 0)

After:

bench_copy_u32_slice_01_x100 ... bench:         480 ns/iter (+/- 1)
bench_copy_u32_slice_02_x100 ... bench:       1,048 ns/iter (+/- 2)
bench_copy_u32_slice_03_x100 ... bench:         981 ns/iter (+/- 1)
bench_copy_u32_slice_04_x100 ... bench:         799 ns/iter (+/- 4)
bench_copy_u32_slice_06_x100 ... bench:         762 ns/iter (+/- 9)
bench_copy_u32_slice_08_x100 ... bench:         800 ns/iter (+/- 2)
bench_copy_u32_slice_16_x100 ... bench:       1,369 ns/iter (+/- 4)
bench_copy_u8_slice_01_x100  ... bench:         407 ns/iter (+/- 1)
bench_copy_u8_slice_02_x100  ... bench:         955 ns/iter (+/- 1)
bench_copy_u8_slice_03_x100  ... bench:         955 ns/iter (+/- 3)
bench_copy_u8_slice_04_x100  ... bench:         893 ns/iter (+/- 1)
bench_copy_u8_slice_06_x100  ... bench:         893 ns/iter (+/- 3)
bench_copy_u8_slice_08_x100  ... bench:         955 ns/iter (+/- 2)
bench_copy_u8_slice_16_x100  ... bench:         713 ns/iter (+/- 3)

These were performed with the scaling governor set to powersave to minimize variance. There is a big win for single-element slices, but for u8 slices there is indeed an extra cost for bigger slices. I am not sure how this will work in practice though, writing a good benchmark is hard.

The standard deviation estimates cannot be trusted; I get a difference of as much as 325 ns/iter between runs, where a deviation of +/- 4 is reported.

To get some more real-world data, I ran cargo bench for html5ever v0.9.0, which seems like a crate that could benefit from this.

Before:

lipsum-zh.html size    1024           ... bench:      10,654 ns/iter (+/- 45)
lipsum-zh.html size 1048576           ... bench:  10,418,459 ns/iter (+/- 7,053)
lipsum.html size    1024              ... bench:       6,289 ns/iter (+/- 9)
lipsum.html size 1048576              ... bench:   6,121,224 ns/iter (+/- 6,787)
lipsum.html size 1048576 (clone only) ... bench:      12,520 ns/iter (+/- 69)
medium-fragment.html                  ... bench:     213,680 ns/iter (+/- 843)
small-fragment.html                   ... bench:      20,141 ns/iter (+/- 197)
strong.html size    1024              ... bench:     175,902 ns/iter (+/- 13,085)
strong.html size 1048576              ... bench: 175,450,855 ns/iter (+/- 2,841,922)
tiny-fragment.html                    ... bench:       1,804 ns/iter (+/- 43)

After:

lipsum-zh.html size    1024           ... bench:      10,017 ns/iter (+/- 8)
lipsum-zh.html size 1048576           ... bench:  10,422,632 ns/iter (+/- 2,762)
lipsum.html size    1024              ... bench:       6,301 ns/iter (+/- 3)
lipsum.html size 1048576              ... bench:   6,114,793 ns/iter (+/- 2,335)
lipsum.html size 1048576 (clone only) ... bench:      12,664 ns/iter (+/- 57)
medium-fragment.html                  ... bench:     199,746 ns/iter (+/- 1,179)
small-fragment.html                   ... bench:      18,884 ns/iter (+/- 161)
strong.html size    1024              ... bench:     149,285 ns/iter (+/- 2,743)
strong.html size 1048576              ... bench: 152,433,080 ns/iter (+/- 1,018,484)
tiny-fragment.html                    ... bench:       1,715 ns/iter (+/- 63)

There are wins as big as 15% on some benchmarks, but regressions of a few percent as well (although that might be noise, given that the deviation estimates are too low).

Also, keep in mind that this is just for x86_64 on Linux.

@bluss
Copy link
Member

bluss commented Nov 4, 2016

This kind of issue is interesting, but ideally this memcpy should be so easily available to the optimizer that it can make this decision by itself. Might be worth investigating if there's anything standing in the way of that.

@ruuda
Copy link
Contributor Author

ruuda commented Nov 4, 2016

I ran the microbenchmarks again 32 times and took the minimum:

Before:

bench_copy_u32_slice_01_x100  824
bench_copy_u32_slice_02_x100  902
bench_copy_u32_slice_03_x100  906
bench_copy_u32_slice_04_x100  716
bench_copy_u32_slice_06_x100  710
bench_copy_u32_slice_08_x100  736
bench_copy_u32_slice_16_x100 1422
 bench_copy_u8_slice_01_x100  781
 bench_copy_u8_slice_02_x100  844
 bench_copy_u8_slice_03_x100  844
 bench_copy_u8_slice_04_x100  781
 bench_copy_u8_slice_06_x100  781
 bench_copy_u8_slice_08_x100  844
 bench_copy_u8_slice_16_x100  638

After:

bench_copy_u32_slice_01_x100  470
bench_copy_u32_slice_02_x100 1048
bench_copy_u32_slice_03_x100  980
bench_copy_u32_slice_04_x100  798
bench_copy_u32_slice_06_x100  761
bench_copy_u32_slice_08_x100  800
bench_copy_u32_slice_16_x100 1369
 bench_copy_u8_slice_01_x100  407
 bench_copy_u8_slice_02_x100  955
 bench_copy_u8_slice_03_x100  955
 bench_copy_u8_slice_04_x100  892
 bench_copy_u8_slice_06_x100  892
 bench_copy_u8_slice_08_x100  955
 bench_copy_u8_slice_16_x100  713

@bluss: for the case of Cursor::read, where the slice has length 1 at compile time, could it be that by the time the ptr::copy_nonoverlapping is called, the compiler has lost track of the size of the slice? But when the slice length is not known at compile time, there is nothing the optimizer would be able to do, right? (In the benchmark I pull the bounds through a black box to hide the slice length.)

@bluss
Copy link
Member

bluss commented Nov 4, 2016

I don't know, I don't really like black box; with it, you get the most pessimistic optimization, and without it, the benchmark program might be compiled too optimistically (for exactly how a function is used in that executable only). Benchmarks from real programs such as your own are more interesting.

I wanted to look at how gcc and clang optimize memcpy for short sizes (I think they do insert some inline code conditionally, for example when the inputs have constant sizes), but I didn't find any documentation or examples of that yet.

@alexcrichton
Copy link
Member

Yeah LLVM will certainly optimize a memcpy of length 1 (for small sizes) but my guess is that LLVM has lost track of the size so the variable length falls back to the memcpy symbol. Once that happens I'd definitely believe that a branch here is much more performant.

Thanks for gathering the numbers @ruuda!

The data looks somewhat inconclusive to me. Certainly faster (as expected) for one element but sometimes pretty slower for multiple entries?

@ruuda
Copy link
Contributor Author

ruuda commented Nov 4, 2016

Yes, the overhead is much more than I expected. Calling memcpy involves doing two jumps to a dynamic address and all calling convention bookkeeping, I did not expect the extra branch to be so bad on top of that. But I’m not sure these microbenchmarks are realistic either.

In my program the length of the slice was known at compile time, so at least in theory the compiler would be able to omit the memcpy. Do you have any idea what could cause LLVM to lose track of the length?

@alexcrichton
Copy link
Member

Unfortunately there's not particularly anything specific that'd cause that. Just a general lack of inlining or otherwise interference would perhaps cause that over time.

@rust-lang/libs thoughts about landing this given the benchmarks numbers? I'm slightly leaning towards no as this seems very much like "microbenchmark" territory, personally.

@BurntSushi
Copy link
Member

I've definitely seen major wins in my own code by optimizing usage of ptr::copy_nonoverlapping, specifically in a way that avoids a full blown memcpy. (In particular, the snap crate.) So I think optimizations like this might be worth it.

One thing that might be interesting is to expand this to slices of length 8 or perhaps even 16, which should permit a copy using an unaligned load/store. i.e., Instead of if self.len() == 1 use if self.len() <= 16. Of course, you might want to tweak the size limit based on the size of a word. e.g., I'd imagine a copy of length 16 to use a SIMD register on x64.

@ruuda
Copy link
Contributor Author

ruuda commented Nov 7, 2016

To copy a fixed number of bytes (e.g. up to 16), we could check something like self.len() * mem::size_of::<T>() < 16. Which is a good idea anyway, because for more bytes the relative overhead of a memcpy is small anyway.

The only issue then is when T is small and multiple elements fit in 16 bytes. Then there has to be sufficient padding after the slice to ensure that a single 16-byte copy does not overwrite anything. But this is not guaranteed in general, is it? For example if the slice is a subslice of some large byte array.

Alternatively, we could sidestep the tradeoff for copy_from_slice and do the optimization only for &[u8] as io::Read. That way copy_from_slice will at least be consistent in what it does, and it would be the responsibility of authors to avoid it for small slices.

@bluss
Copy link
Member

bluss commented Nov 7, 2016

The cursor / read impls do a bunch of indexing operations, so it may well be that which is an impediment to optimization, for example if bounds checks are not elided.

It might be far fetched, but maybe something can be done there first? Then it seems good to either completely restrict this optimization to the I/O code, or at least start it there as a "prototype".

ruuda, do you have a small example program that exhibits this problem? Then I could see if Cursor/Read can be tuned in any way.

@ruuda
Copy link
Contributor Author

ruuda commented Nov 7, 2016

@bluss: I do not have a small test case at the moment, but there are the html5ever benchmarks which do profit from this too. I’ll try to put together a small test case in the next few days.

The program in which I originally found this is the bench_decode example in Claxon. Commit 8305e9e should reproduce that behavior.

@ruuda
Copy link
Contributor Author

ruuda commented Nov 7, 2016

Here is a minimal yet vaguely realistic example that suffers from this issue. It counts the number of bytes in a file excluding C-style comments.

use std::fs::File;
use std::io::{BufReader, Read};

#[derive(Copy, Clone)]
enum State {
    Outside,
    BeginSlash,
    Inside,
    EndAsterisk,
}

fn count_non_comment_bytes(fname: &str) -> std::io::Result<u32> {
    let mut input = BufReader::new(File::open(fname)?);
    let mut state = State::Outside;
    let mut buffer = [0u8];
    let mut non_comment_count = 0;
    while input.read(&mut buffer)? > 0 {
        match (state, buffer[0]) {
            (State::Outside, b'/') => state = State::BeginSlash,
            (State::Outside, _) => non_comment_count += 1,
            (State::BeginSlash, b'*') => state = State::Inside,
            (State::BeginSlash, _) => {
                state = State::Outside;
                non_comment_count += 2;
            }
            (State::Inside, b'*') => state = State::EndAsterisk,
            (State::Inside, _) => {},
            (State::EndAsterisk, b'/') => state = State::Outside,
            (State::EndAsterisk, _) => state = State::Inside,
        }
    }
    Ok(non_comment_count)
}

fn main() {
    for fname in std::env::args().skip(1) {
        match count_non_comment_bytes(&fname) {
            Ok(n) => println!("{}: {} non-comment bytes", fname, n),
            Err(err) => println!("error while processing {}: {:?}", fname, err),
        }
    }
}

I compiled this with rustc -C opt-level=3 -g main.rs and then I ran the binary on the file template_css.css linked on bootstrap.org, concatenated to itself 10000 times. These are the (very rough) timings:

running time time spent on memcpy
rustc 1.14.0-nightly (3caf63c 2016-10-24) 1.36s 35.16%
patched (forks from 0ca9967) 0.92s 0.0%

Interestingly, if I change the example to use the bytes() iterator instead of reading into a slice, it does avoid the call to memcpy.

@bluss
Copy link
Member

bluss commented Nov 7, 2016

Great. There is one bounds check from fill_buf that's not optimized out there, let's see what happens if it is removed.

@alexcrichton
Copy link
Member

@ruuda hm I've always been under the assumption that if you're searching byte-by-byte through a file then std::io::Read is an inherently-slow interface to work with? That is, it'd instead be much more efficient to read blocks of text and then search through that block byte-by-byte? The interference to optimizations mentioned by @bluss is definitely happening with all the layers of indirection in I/O, unfortunately.

@alexcrichton
Copy link
Member

@BurntSushi yeah it definitely makes sense to me that small memcpys are slower than inlining it. That comes at a cost of code size, though, and for me at least it seems odd where we'd put this optimization in the stack. For example many calling copy_from_slice do so because they want it to be a call to memcpy, not necessarily other pieces perhaps.

@bluss
Copy link
Member

bluss commented Nov 8, 2016

Eliminating the bounds check had no effect on this, at least it didn't change the way memcpy was called.

@ruuda
Copy link
Contributor Author

ruuda commented Nov 8, 2016

That is, it'd instead be much more efficient to read blocks of text and then search through that block byte-by-byte?

This is not always possible or convenient. To give a concrete example, in the program that I am working on, there is a CRC-16 stored after a certain block of data (the size of which may only become known while reading it). It was very convenient to create a Crc16Reader which computes the CRC of the data it reads on the fly. Crc16Reader wraps an io::Read and implements io::Read itself. That way, at the start of the block I wrap the underlying reader in a Crc16Reader, and at the end of the block I can ask it for the CRC of all the bytes read. This way the other reading logic does have to worry about CRCs. When I discovered that reading a single byte is slow, looked into using io::BufRead instead. Unfortunately it is not possible to efficiently implement such a thing on top of io::BufRead, because by the time consume() is called (which would be the point where the CRC is computed over the consumed bytes), there is no way to safely access the underlying buffer any more. I ended up building my own read abstraction that can efficiently read a single byte, but that meant throwing away the building blocks in the standard library and doing everything manually.

Also, yes, it would be possible to manually read into a buffer and keep an index into that, and on every read check if it required to refill the buffer, and deal with the edge cases of reads straddling the buffer boundary. But that is all boilerplate, and it is exactly what io::BufReader does already. With Rust aiming for zero-cost abstractions, I did not expect this to come at a price.

For example many calling copy_from_slice do so because they want it to be a call to memcpy, not necessarily other pieces perhaps.

I share this concern. Would you be more comfortable with moving the optimization to the io::Read impl of &[u8]?

Eliminating the bounds check had no effect on this, at least it didn't change the way memcpy was called.

I am not familiar with the LLVM optimizer details, but this does make sense to me: if the bounds check is there, then inside the branch bounds on the value are known, which is information that could be abused by the optimizer.

@arthurprs
Copy link
Contributor

LLVM should be seeing this if it inlined all the way to copy_from_slice. So maybe an inline is being refused somewhere?

@ruuda
Copy link
Contributor Author

ruuda commented Nov 9, 2016

@arthurprs, no, it is inlining everything properly but still generating a call to memcpy. Here’s a snippet of the main function of the example program I posted above. Everything got inlined there:

     _ZN4main23count_non_comment_bytesE():
       nop
                 (State::Inside, b'*') => state = State::EndAsterisk,
300:   cmp    $0x2a,%al
       sete   %r12b
       or     $0x2,%r12b
     _ZN3std2io8buffered8{{impl}}23fill_buf<std::fs::File>E():
30a:   mov    -0x40(%rbp),%r14
       mov    -0x38(%rbp),%r15
     _ZN3std2io8buffered8{{impl}}19read<std::fs::File>E():
       cmp    %r15,%r14
     ↓ jne    360
     _ZN3std2io8buffered8{{impl}}23fill_buf<std::fs::File>E():
       mov    -0x48(%rbp),%rcx
     _ZN3std2io8buffered8{{impl}}19read<std::fs::File>E():
       cmp    $0x1,%rcx
     ↓ jbe    3c0
     _ZN3std2io8buffered8{{impl}}23fill_buf<std::fs::File>E():
       mov    -0x50(%rbp),%rdx
       lea    -0x130(%rbp),%rdi
       lea    -0x58(%rbp),%rsi
     → callq  _$LT$std..fs..File$u20$as$u20$std..io..Read$GT$::read::h68b68e96a3ee717e
       mov    -0x128(%rbp),%r15
     _ZN4core3ops8{{impl}}89translate<usize,std::io::error::Error,core::result::Result<usize, std::io::error::Error>>E():
       cmpq   $0x1,-0x130(%rbp)
     ↓ je     540
     _ZN3std2io8buffered8{{impl}}23fill_buf<std::fs::File>E():
       mov    %r15,-0x38(%rbp)
       movq   $0x0,-0x40(%rbp)
     _ZN3std2io8buffered8{{impl}}22consume<std::fs::File>E():
       xor    %r14d,%r14d
     ↓ jmp    369
     _ZN4core5slice8{{impl}}9index<u8>E():
       nop
360:   cmp    %r14,%r15
     ↓ jb     62c
     _ZN3std2io8buffered8{{impl}}23fill_buf<std::fs::File>E():
369:   mov    -0x48(%rbp),%rsi
     _ZN4core5slice8{{impl}}9index<u8>E():
       cmp    %rsi,%r15
     ↓ ja     624
       mov    -0x50(%rbp),%rsi
       add    %r14,%rsi
     _ZN4core3cmp5impls8{{impl}}2leE():
       xor    %ebx,%ebx
       cmp    %r14,%r15
       setne  %bl
       lea    -0x29(%rbp),%rdi
     _ZN4core5slice8{{impl}}19copy_from_slice<u8>E():
       mov    %rbx,%rdx
     → callq  memcpy@plt # <--- Note the call here
     _ZN3std2io8buffered8{{impl}}22consume<std::fs::File>E():
       add    %rbx,%r14
     _ZN4core3cmp5impls8{{impl}}2leE():
       cmp    %r15,%r14

@arthurprs
Copy link
Contributor

arthurprs commented Nov 9, 2016

Interesting. The assembly shows a couple of bound checks (that shouldn't really be necessary), so I guess the optimizer is getting confused somewhere.

Edit: On the other hand it uses a very specific instruction (setne) to set the memcpy length to 1 so the buffer length is correctly propagated. Why it didn't use a movb though, I have no idea.

// significant. If the element is big then the assignment is a memcopy
// anyway.
if self.len() == 1 {
self[0] = src[0];
Copy link
Member

@eddyb eddyb Nov 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet you could even copy the ptr::copy_non_overlapping call here and replace self.len() with 1.
If that works, then this is specializing for values of self.len() without it being known at compile-time.

That the overhead of calling memcpy is higher than what it does is scary. What has dynamic linking done now...

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Nov 10, 2016
@alexcrichton
Copy link
Member

@ruuda yeah I assumed that this sort of case would be handled by reading a bunch of memory (through the stack of read combinators) and then you'd deal with the chunk of memory at the end, processing it in an optimized fashion at then. Although complicated to implement, I'd assume that it's going to be faster than byte-by-byte iteration no matter what level of optimizations we implement here (even assuming this lands). Do you have an example program perhaps though that I could test out this assumption on?

I'd be more comfortable, yeah, putting this higher in the stack. I'm still a little skeptical that it's what you'd want to end up with in the long run. That is, it seems like the fastest solution would still not rely on this optimization at all.

@brson
Copy link
Contributor

brson commented Nov 11, 2016

The wins @ruuda posted in #37573 (comment) look pretty compelling to me, if it can be done without making other cases significantly worse. Better performance is good, even if it's ugly and shouldn't have to be written like this.

If this lands the doc comment should explain more clearly that this is working around LLVM's inability to see the constant.

We could maybe solve this problem generally by having LLVM emit the memcpy intrinsic by a custom symbol name that we intercept, and do this there. (Edit: rather we can sink this optimization into ptr::copy_nonoverlapping, duh).

@eddyb
Copy link
Member

eddyb commented Nov 11, 2016

@brson Doing it any lower would make a lot of things worse, or at least harder to optimize.
I don't want to do this without very accurate profiling pointing out where exactly the costs lie.

@alexcrichton
Copy link
Member

@brson note that from the previous numbers collected it appears that non-1-length copies are regressing due to this change.

@eddyb
Copy link
Member

eddyb commented Nov 11, 2016

That'd be expected, there's strictly more conditional branches and icache waste.

eddyb added a commit to eddyb/rust that referenced this pull request Nov 11, 2016
Remove one bounds check from BufReader

Very minor thing. Otherwise the optimizer can't be sure that pos <= cap. Added a paranoid debug_assert to ensure correctness instead.

CC rust-lang#37573
eddyb added a commit to eddyb/rust that referenced this pull request Nov 12, 2016
Remove one bounds check from BufReader

Very minor thing. Otherwise the optimizer can't be sure that pos <= cap. Added a paranoid debug_assert to ensure correctness instead.

CC rust-lang#37573
@ruuda
Copy link
Contributor Author

ruuda commented Nov 12, 2016

I moved the optimization into <&[u8]> as Read> now. Rationale:

  • It is a big win if you are reading single bytes, and if you are reading large slices (i.e. in the order of a few thousand bytes) then the additional overhead is negligible, so this would be a win in the majority of situations.
  • Users might rely on copy_from_slice being a memcpy.

Benchmark results for html5ever v0.9.0 (on a different machine than before, so the results are not comparable to my earlier benchmarks):

Before (rustc 1.14.0-nightly cae6ab1c4 2016-11-05):

  lipsum-zh.html size    1024           ... bench:       3,142 ns/iter (+/- 15)
  lipsum-zh.html size 1048576           ... bench:   3,187,696 ns/iter (+/- 327,000)
  lipsum.html size    1024              ... bench:       1,265 ns/iter (+/- 247)
  lipsum.html size 1048576              ... bench:   2,140,937 ns/iter (+/- 96,737)
  lipsum.html size 1048576 (clone only) ... bench:       5,200 ns/iter (+/- 114)
  medium-fragment.html                  ... bench:      74,914 ns/iter (+/- 37,242)
  small-fragment.html                   ... bench:       5,858 ns/iter (+/- 1,893)
  strong.html size    1024              ... bench:      56,843 ns/iter (+/- 31,700)
  strong.html size 1048576              ... bench:  54,887,503 ns/iter (+/- 4,128,378)
  tiny-fragment.html                    ... bench:         634 ns/iter (+/- 26)

With patch:

  lipsum-zh.html size    1024           ... bench:       2,427 ns/iter (+/- 192)
  lipsum-zh.html size 1048576           ... bench:   2,463,618 ns/iter (+/- 190,855)
  lipsum.html size    1024              ... bench:         819 ns/iter (+/- 40)
  lipsum.html size 1048576              ... bench:   2,109,098 ns/iter (+/- 147,097)
  lipsum.html size 1048576 (clone only) ... bench:       5,418 ns/iter (+/- 293)
  medium-fragment.html                  ... bench:      62,560 ns/iter (+/- 822)
  small-fragment.html                   ... bench:       5,397 ns/iter (+/- 478)
  strong.html size    1024              ... bench:      51,712 ns/iter (+/- 13,471)
  strong.html size 1048576              ... bench:  53,146,206 ns/iter (+/- 2,641,773)
  tiny-fragment.html                    ... bench:         626 ns/iter (+/- 67)

These are very noisy, but there is still a clear win in a number of cases.

See below for more benchmark results.


Although complicated to implement, I'd assume that it's going to be faster than byte-by-byte iteration no matter what level of optimizations we implement here (even assuming this lands).

That is, it seems like the fastest solution would still not rely on this optimization at all.

Of course, code specialized and optimized exactly for its use case is going to be faster. But I am hoping we can get 80% of the win with 20% (or ideally, 0%) of the effort on behalf of the user. Also, with “processing it in an optimized fashion”, keep in mind that even if you have a buffer it is not always possible to do better than reading byte by byte. In the byte count example I posted before you could do something like memchr, but sometimes the data is just inherently serial, which is the case with many compression formats for example.

Do you have an example program perhaps though that I could test out this assumption on?

What kind of example do you want? There is the toy example I posted earlier in this thread. For a more real-world example, take a look at the benchmarks of html5ever.

I also did exactly this (replace most of std::io with a custom reader abstraction) in the library I was working on. You can use the bench_decode example in commit 8305e9e (before my custom reader) and in 60965d1 (with my custom reader).

Here are the timing results on my machine (lower is better):

Standard library io::Cursor Custom
Rustc 1.14 cae6ab1c4 20.1 ± 0.6 ns 16.2 ± 0.3 ns
With patch 16.7 ± 0.4 ns 16.4 ± 0.4 ns

So here indeed this patch (which requires no change to the code at all) is responsible for 80% of the win. A fairly invasive change to the entire codebase was slightly faster, but not by a lot.

That'd be expected, there's strictly more conditional branches and icache waste.

That is true for programs that heavily use copy_from_slice, but in the microbenchmark the regressions are not due to icache misses: the loop code is very small and easily fits in L1i even with the extra instructions.

@arielb1
Copy link
Contributor

arielb1 commented Nov 12, 2016

This looks like an LLVM bug to me:

You see this in optimized output:

  %_0.0.sroa.speculated.i.i.i.i = zext i1 %116 to i64
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull %37, i8* nonnull %115, i64 %_0.0.sroa.speculated.i.i.i.i, i32 1, i1 false), !noalias !65

A memcpy where the length is either a constant or 0 should be optimized to a conditional load, methinks.

@eddyb
Copy link
Member

eddyb commented Nov 12, 2016

@arielb1 Introducing conditionals is not exactly something LLVM does lightly - or do you mean in the target code generation? Still seems like a non-trivial optimization, even if an useful one.

@arielb1
Copy link
Contributor

arielb1 commented Nov 12, 2016

Filed bug https://llvm.org/bugs/show_bug.cgi?id=31001 - let see how it goes.

@mrhota
Copy link
Contributor

mrhota commented Nov 14, 2016

@ruuda What if you run the compare script from rustc-benchmarks? Is that an inappropriate test?

@Mark-Simulacrum
Copy link
Member

The rustc-benchmarks compare script may show some results, but it's only testing compile times, while are the runtimes of the compiler, so it is sort of a valid test. I wouldn't base anything of this scale off of the results, though.

@ruuda
Copy link
Contributor Author

ruuda commented Nov 14, 2016

@mrhota I don’t know, I’ll see if I can run that and get some results. Note that now this PR should only affect code that reads single bytes, which is a very specific kind of IO. Perhaps the parser does that (I don’t know), but then again only a tiny amount of time is spent parsing, so I do not expect to see a significant difference in overall timings.

During benchmarking, I found that one of my programs spent between 5 and
10 percent of the time doing memmoves. Ultimately I tracked these down
to single-byte slices being copied with a memcopy in io::Cursor::read().
Doing a manual copy if only one byte is requested can speed things up
significantly. For my program, this reduced the running time by 20%.

Why special-case only a single byte, and not a "small" slice in general?
I tried doing this for slices of at most 64 bytes and of at most 8
bytes. In both cases my test program was significantly slower.
Ultimately copy_from_slice is being a bottleneck, not io::Cursor::read.
It might be worthwhile to move the check here, so more places can
benefit from it.
Based on the discussion in rust-lang#37573,
it is likely better to keep this limited to std::io, instead of
modifying a function which users expect to be a memcpy.
@ruuda
Copy link
Contributor Author

ruuda commented Nov 30, 2016

I rebased this on top of 8e373b4 and ran rustc-benchmarks (scaling governor set to powersave to minimize variance):

futures-rs-test 10.079s vs  9.993s --> 1.009x faster (variance: 1.003x, 1.003x)
helloworld       0.383s vs  0.383s --> 0.998x faster (variance: 1.007x, 1.003x)
html5ever-2016- 13.651s vs 13.488s --> 1.012x faster (variance: 1.002x, 1.001x)
hyper.0.5.0     12.311s vs 12.179s --> 1.011x faster (variance: 1.003x, 1.005x)
inflate-0.1.0   10.475s vs 10.395s --> 1.008x faster (variance: 1.002x, 1.002x)
issue-32062-equ  0.565s vs  0.568s --> 0.995x faster (variance: 1.006x, 1.003x)
issue-32278-big  4.036s vs  4.035s --> 1.000x faster (variance: 1.007x, 1.007x)
jld-day15-parse  3.195s vs  3.196s --> 1.000x faster (variance: 1.013x, 1.002x)
piston-image-0. 27.955s vs 27.721s --> 1.008x faster (variance: 1.008x, 1.001x)
regex-0.1.80    22.128s vs 21.978s --> 1.007x faster (variance: 1.005x, 1.004x)
regex.0.1.30     5.913s vs  5.850s --> 1.011x faster (variance: 1.002x, 1.006x)
rust-encoding-0  4.851s vs  4.808s --> 1.009x faster (variance: 1.002x, 1.003x)
syntex-0.42.2    0.137s vs  0.135s --> 1.012x faster (variance: 2.407x, 1.020x)
tuple-stress     8.931s vs  8.911s --> 1.002x faster (variance: 1.011x, 1.007x)

It looks like compiling a few crates got a percent faster, and for the others there is no significant effect. Which is expected, because I don’t think rustc heavily reads single bytes.


With no activity on the LLVM bug for three weeks, what do you all think about merging this in the mean time?

@alexcrichton
Copy link
Member

Oh sorry I think I missed the update to push this into io. That seems like a fine change to me (esp as it affects semantics of how I/O calls it).

@bors: r+

@bors
Copy link
Contributor

bors commented Dec 1, 2016

📌 Commit 3be2c3b has been approved by alexcrichton

@bors
Copy link
Contributor

bors commented Dec 1, 2016

⌛ Testing commit 3be2c3b with merge 070fad1...

bors added a commit that referenced this pull request Dec 1, 2016
Add small-copy optimization for copy_from_slice

## Summary

During benchmarking, I found that one of my programs spent between 5 and 10 percent of the time doing memmoves. Ultimately I tracked these down to single-byte slices being copied with a memcopy. Doing a manual copy if the slice contains only one element can speed things up significantly. For my program, this reduced the running time by 20%.

## Background

I am optimizing a program that relies heavily on reading a single byte at a time. To avoid IO overhead, I read all data into a vector once, and then I use a `Cursor` around that vector to read from. During profiling, I noticed that `__memmove_avx_unaligned_erms` was hot, taking up 7.3% of the running time. It turns out that these were caused by calls to `Cursor::read()`, which calls `<&[u8] as Read>::read()`, which calls `&[T]::copy_from_slice()`, which calls `ptr::copy_nonoverlapping()`. This one is implemented as a memcopy. Copying a single byte with a memcopy is very wasteful, because (at least on my platform) it involves calling `memcpy` in libc. This is an indirect call when libc is linked dynamically, and furthermore `memcpy` is optimized for copying large amounts of data at the cost of a bit of overhead for small copies.

## Benchmarks

Before I made this change, `perf` reported the following for my program. I only included the relevant functions, and how they rank. (This is on a different machine than where I ran the original benchmarks. It has an older CPU, so `__memmove_sse2_unaligned_erms` is called instead of `__memmove_avx_unaligned_erms`.)

```
#3   5.47%  bench_decode  libc-2.24.so      [.] __memmove_sse2_unaligned_erms
#5   1.67%  bench_decode  libc-2.24.so      [.] memcpy@GLIBC_2.2.5
#6   1.51%  bench_decode  bench_decode      [.] memcpy@plt
```

`memcpy` is eating up 8.65% of the total running time, and the overhead of dispatching to a specialized fast copy function (`memcpy@GLIBC` showing up) is clearly visible. The price of dynamic linking (`memcpy@plt` showing up) is visible too.

After this change, this is what `perf` reports:

```
#5   0.33%  bench_decode  libc-2.24.so      [.] __memmove_sse2_unaligned_erms
#14  0.01%  bench_decode  libc-2.24.so      [.] memcpy@GLIBC_2.2.5
```

Now only 0.34% of the running time is spent on memcopies. The dynamic linking overhead is not significant at all any more.

To add some more data, my program generates timing results for the operation in its main loop. These are the timings before and after the change:

| Time before   | Time after    | After/Before |
|---------------|---------------|--------------|
| 29.8 ± 0.8 ns | 23.6 ± 0.5 ns |  0.79 ± 0.03 |

The time is basically the total running time divided by a constant; the actual numbers are not important. This change reduced the total running time by 21% (much more than the original 9% spent on memmoves, likely because the CPU is stalling a lot less because data dependencies are more transparent). Of course YMMV and for most programs this will not matter at all. But when it does, the gains can be significant!

## Alternatives

* At first I implemented this in `io::Cursor`. I moved it to `&[T]::copy_from_slice()` instead, but this might be too intrusive, especially because it applies to all `T`, not just `u8`. To restrict this to `io::Read`, `<&[u8] as Read>::read()` is probably the best place.
* I tried copying bytes in a loop up to 64 or 8 bytes before calling `Read::read`, but both resulted in about a 20% slowdown instead of speedup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.