Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix massive performance issue in read_to_end #23820

Merged
merged 1 commit into from Mar 29, 2015

Conversation

@sfackler
Copy link
Member

sfackler commented Mar 29, 2015

with_end_to_cap is enormously expensive now that it's initializing
memory since it involves 64k allocation + memset on every call. This is
most noticable when calling read_to_end on very small readers, where the
new version if 4 orders of magnitude faster.

BufReader also depended on with_end_to_cap so I've rewritten it in its
original form.

As a bonus, converted the buffered IO struct Debug impls to use the
debug builders.

I first came across this in sfackler/rust-postgres#106 where a user reported a 10x performance regression. A call to read_to_end turned out to be the culprit: sfackler/rust-postgres@9cd413d.

The new version differs from the old in a couple of ways. The buffer size used is now adaptive. It starts at 32 bytes and doubles each time EOF hasn't been reached up to a limit of 64k. In addition, the buffer is only truncated when EOF or an error has been reached, rather than after every call to read as was the case for the old implementation.

I wrote up a benchmark to compare the old version and new version: https://gist.github.com/sfackler/e979711b0ee2f2063462

It tests a couple of different cases: a high bandwidth reader, a low bandwidth reader, and a low bandwidth reader that won't return more than 10k per call to read. The high bandwidth reader should be analagous to use cases when reading from e.g. a BufReader or Vec, and the low bandwidth readers should be analogous to reading from something like a TcpStream.

Of special note, reads from a high bandwith reader containing 4 bytes are now 4,495 times faster.

~/foo ❯ cargo bench
   Compiling foo v0.0.1 (file:///home/sfackler/foo)
     Running target/release/foo-7498d7dd7faecf5c

running 13 tests
test test_new ... ignored
test new_delay_4      ... bench:    230768 ns/iter (+/- 14812)
test new_delay_4_cap  ... bench:    231421 ns/iter (+/- 7211)
test new_delay_5m     ... bench:  14495370 ns/iter (+/- 4008648)
test new_delay_5m_cap ... bench:  73127954 ns/iter (+/- 59908587)
test new_nodelay_4    ... bench:        83 ns/iter (+/- 2)
test new_nodelay_5m   ... bench:  12527237 ns/iter (+/- 335243)
test std_delay_4      ... bench:    373095 ns/iter (+/- 12613)
test std_delay_4_cap  ... bench:    374190 ns/iter (+/- 19611)
test std_delay_5m     ... bench:  17356012 ns/iter (+/- 15906588)
test std_delay_5m_cap ... bench: 883555035 ns/iter (+/- 205559857)
test std_nodelay_4    ... bench:    144937 ns/iter (+/- 2448)
test std_nodelay_5m   ... bench:  16095893 ns/iter (+/- 3315116)

test result: ok. 0 passed; 0 failed; 1 ignored; 12 measured

r? @alexcrichton

buf: Cursor<Vec<u8>>,
buf: Vec<u8>,
pos: usize,
cap: usize,

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Mar 29, 2015

Member

I personally tried to keep this to a Cursor<Vec<u8>> because it just felt like it was duplicating the functionality otherwise, but I think it would otherwise require unsafe calls to set_len which is somewhat worrisome.

I think that the buf could be represented as Box<[u8]> as well (as the capacity isn't necessary, just the length), but I also think it's fine for now.

This comment has been minimized.

Copy link
@sfackler

sfackler Mar 29, 2015

Author Member

Yeah I tried to think of ways to keep it as a Cursor but nothing really seemed right, especially since Cursor<Vec<u8>> extends the vec on its own.

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Mar 29, 2015

Member

Yeah I'm fine by this :)

Ok(0) => return Ok(read),
Ok(n) => read += n,

match r.read(&mut buf[len..]) {

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Mar 29, 2015

Member

I realize now that if this panics the length of buf is probably "wrong", but I also think that's ok because with a &mut Vec<u8> we're guaranteed no one else is reading it unless they're possibly in another thread, so it's not really the end of the world.

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Mar 29, 2015

Member

Note that with_end_to_cap also suffered this problem.

This comment has been minimized.

Copy link
@sfackler

sfackler Mar 29, 2015

Author Member

Yeah, I thought about having a guard fix the length but it seems not worth it. Worst case you see some extra 0 bytes.

This comment has been minimized.

Copy link
@alexcrichton
@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Mar 29, 2015

Very nice wins! Could you add some comments about the rationale of the growing behavior to the read_to_end function? Other than that r=me

with_end_to_cap is enormously expensive now that it's initializing
memory since it involves 64k allocation + memset on every call. This is
most noticable when calling read_to_end on very small readers, where the
new version if **4 orders of magnitude** faster.

BufReader also depended on with_end_to_cap so I've rewritten it in its
original form.

As a bonus, converted the buffered IO struct Debug impls to use the
debug builders.

Fixes #23815
@sfackler sfackler force-pushed the sfackler:fast_read_to_end branch from ed971cb to ccb4e84 Mar 29, 2015
@sfackler

This comment has been minimized.

Copy link
Member Author

sfackler commented Mar 29, 2015

@bors r=alexcrichton ccb4e84

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Mar 29, 2015

⌛️ Testing commit ccb4e84 with merge 867fcd8...

bors added a commit that referenced this pull request Mar 29, 2015
with_end_to_cap is enormously expensive now that it's initializing
memory since it involves 64k allocation + memset on every call. This is
most noticable when calling read_to_end on very small readers, where the
new version if **4 orders of magnitude** faster.

BufReader also depended on with_end_to_cap so I've rewritten it in its
original form.

As a bonus, converted the buffered IO struct Debug impls to use the
debug builders.

I first came across this in sfackler/rust-postgres#106 where a user reported a 10x performance regression. A call to read_to_end turned out to be the culprit: sfackler/rust-postgres@9cd413d.

The new version differs from the old in a couple of ways. The buffer size used is now adaptive. It starts at 32 bytes and doubles each time EOF hasn't been reached up to a limit of 64k. In addition, the buffer is only truncated when EOF or an error has been reached, rather than after every call to read as was the case for the old implementation.

I wrote up a benchmark to compare the old version and new version: https://gist.github.com/sfackler/e979711b0ee2f2063462

It tests a couple of different cases: a high bandwidth reader, a low bandwidth reader, and a low bandwidth reader that won't return more than 10k per call to `read`. The high bandwidth reader should be analagous to use cases when reading from e.g. a `BufReader` or `Vec`, and the low bandwidth readers should be analogous to reading from something like a `TcpStream`.

Of special note, reads from a high bandwith reader containing 4 bytes are now *4,495 times faster*. 
```
~/foo ❯ cargo bench
   Compiling foo v0.0.1 (file:///home/sfackler/foo)
     Running target/release/foo-7498d7dd7faecf5c

running 13 tests
test test_new ... ignored
test new_delay_4      ... bench:    230768 ns/iter (+/- 14812)
test new_delay_4_cap  ... bench:    231421 ns/iter (+/- 7211)
test new_delay_5m     ... bench:  14495370 ns/iter (+/- 4008648)
test new_delay_5m_cap ... bench:  73127954 ns/iter (+/- 59908587)
test new_nodelay_4    ... bench:        83 ns/iter (+/- 2)
test new_nodelay_5m   ... bench:  12527237 ns/iter (+/- 335243)
test std_delay_4      ... bench:    373095 ns/iter (+/- 12613)
test std_delay_4_cap  ... bench:    374190 ns/iter (+/- 19611)
test std_delay_5m     ... bench:  17356012 ns/iter (+/- 15906588)
test std_delay_5m_cap ... bench: 883555035 ns/iter (+/- 205559857)
test std_nodelay_4    ... bench:    144937 ns/iter (+/- 2448)
test std_nodelay_5m   ... bench:  16095893 ns/iter (+/- 3315116)

test result: ok. 0 passed; 0 failed; 1 ignored; 12 measured
```

r? @alexcrichton
@bors

This comment has been minimized.

Copy link
Contributor

bors commented Mar 29, 2015

💔 Test failed - auto-mac-64-nopt-t

buf.reserve(DEFAULT_BUF_SIZE);
if len == buf.len() {
if buf.capacity() == buf.len() {
if cap_bump < DEFAULT_BUF_SIZE {

This comment has been minimized.

Copy link
@ghost

ghost Mar 29, 2015

Why have a maximum capacity increase at all? This makes the function O(number of bytes squared) instead of O (number of bytes). As the memory needed to be allocated gets larger than 64 kb this is going to be doing a lot of unnecessary reallocs.

This comment has been minimized.

Copy link
@sfackler

sfackler Mar 29, 2015

Author Member

All reserve does is ensure that the Vec has at least that much space. It's free to reallocate with whatever size it feels is best internally: http://doc.rust-lang.org/collections/vec/struct.Vec.html#method.reserve

This comment has been minimized.

Copy link
@sfackler

sfackler Mar 29, 2015

Author Member

There's a cap because the memset is relatively expensive. We expect about half of the last vec extension to be unused after the last part of the stream is read. As is, that's 32k of wasted work, but if there was no cap it could be 10s or 100s of megabytes.

This comment has been minimized.

Copy link
@ghost

ghost Mar 29, 2015

Looking at the implementation of reserve() I think there's two separate issues.

  1. It looks like reserve() actually always rounds up to the nearest power of two. Because of that, the algorithm is basically equivalent to my suggestion of always doubling the capacity of the vector, and so is O(number of bytes). However this is not guaranteed by the reserve() API and thus a change to reserve, holding the API consistent, could make this algorithm O(n^2) (or maybe worse).

-if the exact behavior of reserve is meant to be part of the API, and will be included in the API, then this algorithm as written is more complex than necessary, and can be made equivalent by always reserving 16 when the capacity equals the length.

-However, I expect this behavior is not meant to be a guarantee of the API. Therefore, I think this should probably be coded here directly with reserve_exact to make sure the algorithm is always O(number of bytes).

As is, that's 32k of wasted work, but if there was no cap it could be 10s or 100s of megabytes

Not so. Let's pretend we start with a vector that has 32k capacity and size, and need to read in 128 megabytes + 1 byte (so we are doing the maximum wasted work for the doubling algorithm). Let's assume that copying (ie a realloc) and memsetting are equally expensive.
For the doubling algorithm, we need to copy and memset on (64 k + 128k + 256k + 512k + ... + ... + 64 MB + 128 MB + 256 MB) = 256(1 + 1/2 + 1/4 + ...) MB ~= 512 MB.

For the increase by 32k algorithm, we need to copy and memset on (64k + 96k + 128k + 160k + ... + (128 MB - 32k) + (128 MB) + (128 MB + 32k) = 64(1 + 2 + 3 + ... + 2000 + 2001) kilobytes = 64 * (2001 * 2002) / 2 kilobytes ~= 128e6 kilobytes = 128 gigabytes(!)

Being O(n^2), this behavior gets even worse for larger amounts of data, and 128 MB is not at all of an unusual amount of data to be working with.

This comment has been minimized.

Copy link
@sfackler

sfackler Mar 29, 2015

Author Member

I'm not sure if I fully understand what you mean. I'm not able to reproduce the numbers I would expect to see if there was indeed a 250x difference in memory writes: https://gist.github.com/sfackler/81f8a7ac614e19f96cc8

This comment has been minimized.

Copy link
@ghost

ghost Mar 29, 2015

See point 1. Due to the not-necessarily apparent and undocumented way in which reserve() decides how to actually reserve memory, the current implementation of reserve() and read_to_end() just happen to sync in such a way that the behavior is equivalent to doubling the size of the vector at each step. If you replace reserve() here with reserve_exact() you should see the behavior I am talking about.

This comment has been minimized.

Copy link
@ghost

ghost Mar 29, 2015

You can see it with this example:

use std::io::Read;
fn main() {
    let buf = [1u8; 1024 * 1024 + 2];
    let mut src = &buf[..];
    let mut dst = Vec::with_capacity(32);
    src.read_to_end(&mut dst);
    println!("{}",dst.capacity());
}

If it were allocating 64k at a time, it would print 1024 * 1024 + 64 * 1024 = 114112. Instead it prints 1024 * 1024 * 2 = 2097152.

This is why I think the doubling behavior should be made explicit with reserve_exact(), this is not at all obvious behavior and not mentioned or guaranteed by the API.

let mut read = 0;
let start_len = buf.len();
let mut len = start_len;
let mut cap_bump = 16;

This comment has been minimized.

Copy link
@ghost

ghost Mar 29, 2015

Seems like this should be buf.capacity() rather than a constant. Otherwise, if the vector is large, you are going to be reallocating it a lot unnecessarily once the read reaches its capacity.
I think the capacity increase behavior in this function should be identical to what would happen if we used buf.push() on every byte in r (ie it simply doubles the buf's capacity every time it is filled to capacity). This is probably what the user will be expecting, and allows them to be more flexible by reading in data of varying sizes from multiple sources and having it be efficient in every case.

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Mar 29, 2015

@bors: retry

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Mar 29, 2015

⌛️ Testing commit ccb4e84 with merge 92f3d9a...

bors added a commit that referenced this pull request Mar 29, 2015
with_end_to_cap is enormously expensive now that it's initializing
memory since it involves 64k allocation + memset on every call. This is
most noticable when calling read_to_end on very small readers, where the
new version if **4 orders of magnitude** faster.

BufReader also depended on with_end_to_cap so I've rewritten it in its
original form.

As a bonus, converted the buffered IO struct Debug impls to use the
debug builders.

I first came across this in sfackler/rust-postgres#106 where a user reported a 10x performance regression. A call to read_to_end turned out to be the culprit: sfackler/rust-postgres@9cd413d.

The new version differs from the old in a couple of ways. The buffer size used is now adaptive. It starts at 32 bytes and doubles each time EOF hasn't been reached up to a limit of 64k. In addition, the buffer is only truncated when EOF or an error has been reached, rather than after every call to read as was the case for the old implementation.

I wrote up a benchmark to compare the old version and new version: https://gist.github.com/sfackler/e979711b0ee2f2063462

It tests a couple of different cases: a high bandwidth reader, a low bandwidth reader, and a low bandwidth reader that won't return more than 10k per call to `read`. The high bandwidth reader should be analagous to use cases when reading from e.g. a `BufReader` or `Vec`, and the low bandwidth readers should be analogous to reading from something like a `TcpStream`.

Of special note, reads from a high bandwith reader containing 4 bytes are now *4,495 times faster*. 
```
~/foo ❯ cargo bench
   Compiling foo v0.0.1 (file:///home/sfackler/foo)
     Running target/release/foo-7498d7dd7faecf5c

running 13 tests
test test_new ... ignored
test new_delay_4      ... bench:    230768 ns/iter (+/- 14812)
test new_delay_4_cap  ... bench:    231421 ns/iter (+/- 7211)
test new_delay_5m     ... bench:  14495370 ns/iter (+/- 4008648)
test new_delay_5m_cap ... bench:  73127954 ns/iter (+/- 59908587)
test new_nodelay_4    ... bench:        83 ns/iter (+/- 2)
test new_nodelay_5m   ... bench:  12527237 ns/iter (+/- 335243)
test std_delay_4      ... bench:    373095 ns/iter (+/- 12613)
test std_delay_4_cap  ... bench:    374190 ns/iter (+/- 19611)
test std_delay_5m     ... bench:  17356012 ns/iter (+/- 15906588)
test std_delay_5m_cap ... bench: 883555035 ns/iter (+/- 205559857)
test std_nodelay_4    ... bench:    144937 ns/iter (+/- 2448)
test std_nodelay_5m   ... bench:  16095893 ns/iter (+/- 3315116)

test result: ok. 0 passed; 0 failed; 1 ignored; 12 measured
```

r? @alexcrichton
@ghost ghost mentioned this pull request Mar 29, 2015
@bors bors merged commit ccb4e84 into rust-lang:master Mar 29, 2015
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details
@sfackler sfackler deleted the sfackler:fast_read_to_end branch Nov 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.