Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in 1.64+ when BufReader inner reader doesn't fill the buffer #102727

Closed
bgilbert opened this issue Oct 6, 2022 · 8 comments · Fixed by #102760
Closed
Assignees
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-critical Critical priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@bgilbert
Copy link

bgilbert commented Oct 6, 2022

Code

I tried this code:

use std::cmp::min;
use std::io;

const TOTAL_BYTES: usize = 16 * 1024 * 1024;
// runtime increases as the difference between these values increases
const BYTES_PER_READ: usize = 8 * 1024;
const BUFREADER_CAPACITY: usize = 256 * 1024;

fn main() {
    let mut reader = io::BufReader::with_capacity(BUFREADER_CAPACITY, Reader(TOTAL_BYTES));
    let count = io::copy(&mut reader, &mut io::sink()).unwrap();
    println!("copied {count} bytes");
}

struct Reader(usize);

impl io::Read for Reader {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        let len = min(min(buf.len(), self.0), BYTES_PER_READ);
        if len > 0 {
            buf[..len].copy_from_slice(&vec![0; len]);
            self.0 -= len;
        }
        Ok(len)
    }
}

I expected to see this happen:

$ rustc -o slowio slowio.rs
$ time ./slowio
copied 16777216 bytes

real	0m0.015s
user	0m0.011s
sys	0m0.003s

Instead, this happened:

$ rustc -o slowio slowio.rs
$ time ./slowio 
copied 16777216 bytes

real	0m8.909s
user	0m8.891s
sys	0m0.002s

The problem doesn't occur with optimization enabled.

Version it worked on

It most recently worked on:

rustc 1.63.0 (4b91a6ea7 2022-08-08)
binary: rustc
commit-hash: 4b91a6ea7258a947e59c6522cd5898e7c0a6a88f
commit-date: 2022-08-08
host: x86_64-unknown-linux-gnu
release: 1.63.0
LLVM version: 14.0.5

Version with regression

rustc 1.64.0 (Fedora 1.64.0-1.fc36)
binary: rustc
commit-hash: unknown
commit-date: unknown
host: x86_64-unknown-linux-gnu
release: 1.64.0
LLVM version: 14.0.5

Also fails on:

rustc 1.66.0-nightly (01af5040f 2022-10-04)
binary: rustc
commit-hash: 01af5040fdada6ef8f1b749cda798d80a8590b2c
commit-date: 2022-10-04
host: x86_64-unknown-linux-gnu
release: 1.66.0-nightly
LLVM version: 15.0.2

@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged

@bgilbert bgilbert added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Oct 6, 2022
@rustbot rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. and removed regression-untriaged Untriaged performance or correctness regression. labels Oct 6, 2022
@saethlin
Copy link
Member

saethlin commented Oct 6, 2022

Thanks for reporting this! This is an issue with a debug build so it's unlikely to be prioritized high but I'm interested in helping out because this is very goofy.

The slow code here is the Iterator impl on slices. I think in the last stable release we switched the implementation of BufReader to eventually use ReadBuf, and the slow code is ReadBuf initializing the remaining uninitialized capacity. It does this by iterating over a slice, and that iteration is the slow bit.

What I don't get here is why we need to initialize at all. That doesn't seem right.

@apiraino
Copy link
Contributor

apiraino commented Oct 6, 2022

WG-prioritization assigning highest priority since this looks a pretty visible runtime perf. regression (Zulip discussion). If the assessment should be revised, please feel free to comment.

@rustbot label -I-prioritize +P-critical i-slow

@rustbot rustbot added I-slow Issue: Problems and improvements with respect to performance of generated code. P-critical Critical priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Oct 6, 2022
@apiraino apiraino added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Oct 6, 2022
@apiraino
Copy link
Contributor

apiraino commented Oct 6, 2022

I've tried bisecting with cargo bisect and feeding this command: rustc main.rs && timeout 2 ./main.

Regression seems to be in 50166d5 and confirm #98748

bisected with cargo-bisect-rustc v0.6.0

Host triple: x86_64-unknown-linux-gnu
Reproduce with:

cargo bisect-rustc ./script.sh --start 2022-07-01 --preserve 

@saethlin
Copy link
Member

saethlin commented Oct 6, 2022

😂 I was so sure this wasn't my fault. More fool me for not bisecting it myself.

@pnkfelix
Copy link
Member

pnkfelix commented Oct 6, 2022

(This probably shows an opportunity to add a benchmark to the newly-established "runtime benchmarks" (rather than the compilation-time benchmarks that perf.rlo has classically focused on...)

@saethlin
Copy link
Member

saethlin commented Oct 6, 2022

After sleeping on this, I realized that there's no real reason to expect the worlds-worst-memset code which is the source of the slowdown to be totally optimized out with --release. And of course it isn't. This is also a substantial regression with optimizations. Here's a version of the original program that's touched up a bit:

use std::cmp::min;
use std::io;

const TOTAL_BYTES: usize = 16 * 1024 * 1024;
// runtime increases as the difference between these values increases
const BYTES_PER_READ: usize = 8 * 1024;
const BUFREADER_CAPACITY: usize = 256 * 1024;

fn main() {
    for _ in 0..1_000 {
        let mut reader = io::BufReader::with_capacity(BUFREADER_CAPACITY, Reader(TOTAL_BYTES));
        let count = io::copy(&mut reader, &mut io::sink()).unwrap();
        assert_eq!(count, 16777216)
    }
}

struct Reader(usize);

impl io::Read for Reader {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        let len = min(min(buf.len(), self.0), BYTES_PER_READ);
        if len > 0 {
            buf[..len].copy_from_slice(&vec![0; len]);
            self.0 -= len;
        }
        Ok(len)
    }
}

On 1.63 I see runtime ~0.43 s, and on 1.64 ~4.8 s. So we have a slowdown of around 10x, though I suspect that number could be ratcheted up higher.

@saethlin
Copy link
Member

saethlin commented Oct 6, 2022

@rustbot claim

@bgilbert
Copy link
Author

bgilbert commented Oct 9, 2022

Awesome, thanks for the really quick fix!

Dylan-DPC added a commit to Dylan-DPC/rust that referenced this issue Oct 12, 2022
Use memset to initialize readbuf

The write loop was found to be slow in rust-lang#102727

The proper fix is in rust-lang#102760 but this might still help debug builds and code running under miri by using the write_bytes intrinsic instead of writing one byte at a time.
thomcc pushed a commit to tcdi/postgrestd that referenced this issue Feb 10, 2023
Avoid repeated re-initialization of the BufReader buffer

Fixes rust-lang/rust#102727

We accidentally removed this in rust-lang/rust#98748. It looks so redundant. But it isn't.

The default `Read::read_buf` will defensively initialize the whole buffer, if any of it is indicated to be uninitialized. In uses where reads from the wrapped `Read` impl completely fill the `BufReader`, `initialized` and `filled` are the same, and this extra member isn't required. But in the reported issue, the `BufReader` wraps a `Read` impl which will _never_ fill the whole buffer. So the default `Read::read_buf` implementation repeatedly re-initializes the extra space in the buffer.

This adds back the extra `initialized` member, which ensures that the default `Read::read_buf` only zero-initialized the buffer once, and I've tried to add a comment which explains this whole situation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-critical Critical priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants