Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std::io::copy performance may be 20x slower than it could be? #49921

Closed
spease opened this issue Apr 13, 2018 · 18 comments · Fixed by #78641
Closed

std::io::copy performance may be 20x slower than it could be? #49921

spease opened this issue Apr 13, 2018 · 18 comments · Fixed by #78641
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@spease
Copy link

spease commented Apr 13, 2018

I was using std::io::copy from an SSD to memory using ProgressBarReader, which reported an eye-raising 30 MB/s. Wrapping it in a BufReader (either as the outer or middle layer) made no real difference.

So I googled around a bit and found the recommended size was 64KB according to some random Microsoft article, so I decided to copy-paste the std::io::copy implementation and increase the buffer size. This got up to 600 MB/s at about 256KB which sounded a lot more reasonable...then it continued going upwards from there past the point of believability (unless the files were being pulled from RAM as a consequence of my repeated accesses). With a 10MB Vec, I got 228.36 GB/s 🙃.

Unrealistic values aside, has anybody checked the buffer value to make sure there isn't some huge low-hanging fruit for std::io::copy speeds?

@kennytm kennytm added I-slow Issue: Problems and improvements with respect to performance of generated code. C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Apr 13, 2018
@the8472
Copy link
Member

the8472 commented Apr 14, 2018

You mention windows. This may be platform-specific behavior. On unix systems I would fadvise POSIX_FADV_SEQUENTIAL on the input file for copying. Googling tells me that the windows equivalent is FILE_FLAG_SEQUENTIAL_SCAN. Have you tried that?

@Voultapher
Copy link
Contributor

After a coarse overview, my intuition is that there is some funny business going on with the length determined by seek and the subsequent inc calls on the ProgressBar, that could explain the unrealistic results.

@spease
Copy link
Author

spease commented Jun 10, 2018

It’s possible. The actual times did increase dramatically from the std::io::copy default though - my suspicion is that there is probably some caching to RAM going on or something as cold runs were generally slower than hot runs.

This was for some work code that I’m no longer actively developing. I’ll leave this email marked as unread for awhile in case I have some spare time to write a benchmark.

I don’t think the other flag that was mentioned is accessible with the default open file api, but in any case it at least appears there’s some low hanging fruit wrt to std::io::copy perf

@ljedrz
Copy link
Contributor

ljedrz commented Aug 1, 2018

I tried benchmarking io::copy on Windoze before and after changing DEFAULT_BUF_SIZE to 64KB (instead of the original 8KB) and the results were slightly worse (up to ~3%) with the modified setup, both on HDD and SSD. However, I'm not sure if my benchmarks were good enough - I was having a hard time tweaking them so that I could see any difference between results for different file sizes.

FWIW, I'm providing the benchmark code below; it can easily be expanded to different file sizes:

#[bench]
fn bench_copy_4MB(b: &mut Bencher) {
    let mut file = test::black_box(File::open("4MB").unwrap());
    let mut writer: Vec<u8> = Vec::with_capacity(4 * 1024 * 1024);

    b.iter(|| {
        let mut x = Ok(0);
        for _ in 0..100 { x = io::copy(&mut file, &mut writer); writer.clear(); }
        x
    })
}

@spease
Copy link
Author

spease commented Aug 1, 2018

4MB isn’t a whole lot. That’s well within the size of hard drive caches or possibly an OS memory-based cache. I was noticing the difference with files several gigabytes in size.

As a sanity check, maybe look to see what the actual time is, not just the % difference between timings with different buffer sizes. The actual time may be wildly different than the actual steady state performance of the disk if it’s reading the data from a cache.

It may also be possible at a low level to pass a flag to encourage the OS to disable or flush the cache, but I don’t know if there’s any way to instruct the hard drive to do the same.

@ljedrz
Copy link
Contributor

ljedrz commented Aug 1, 2018

I only pasted the benchmark for 4MB as a reference - I tested files from 4KB up to 16MB. I'm not an expert with low-level hard drive stuff, just hoping my attempt can help someone else.

@retep998
Copy link
Member

retep998 commented Aug 6, 2018

Ideally you would test with a file approximately twice as big as you have RAM, so if you have 8GB of RAM you would want to test with a 16GB file.

@rbtcollins
Copy link
Contributor

@spease extended open flags are accessible via https://doc.rust-lang.org/std/os/windows/fs/trait.OpenOptionsExt.html
FWIW lots of bennchmarks exist showing synchronous IO calls improve when matching IO operations to the underlying block size (e.g. 4MB for modern SSD's) across multiple OS's, so its very likely that there are improvements to gain here. However, 4MB is stack-breaking size by default, and heap allocations will destroy potential improvements (and further, if we're optimising, we probably want overlapped IO and a ring of buffers rather than just one buffer which will always leave one device idle).... so its complex;)

@main--
Copy link
Contributor

main-- commented Apr 11, 2020

heap allocations will destroy potential improvements

How come? io::copy would allocate once at the beginning. Heap memory isn't inherently slower and one memory allocation is basically free compared to the heavy IO that's happening afterwards.

@retep998
Copy link
Member

But we also don't want to be allocating 4MB every time we do io::copy as someone might be doing a bunch of small copies.

@the8472
Copy link
Member

the8472 commented Apr 11, 2020

possible approaches:

  • thread local buffers or a buffer pool
  • stat the input before deciding on the buffer size (this is what gnu cp does I think)
  • heap-allocate a bigger buffer after a few iterations of the copy loop with an on-stack buffer
  • specialize copy operations from file to memory to read directly into the already allocated target, thus skipping the need for a buffer

@main--
Copy link
Contributor

main-- commented Apr 11, 2020

specialize copy operations from file to memory to read directly into the already allocated target, thus skipping the need for a buffer

I think some kind of specialization for BufRead sources would be ideal. Because in this case, io::copy should not buffer at all and simply use the reader's buffer. Effectively this also gives you a customizable buffer size by wrapping your File in a BufReader with whatever size you want.

As of today, if your source is BufRead then io::copy is always the wrong answer. I got a substantial performance improvement using a simple loop instead:

loop {
    let buf = reader.fill_buf()?;
    if buf.is_empty() { break; }
    let consumed = file_out.write(buf)?;
    drop(buf);
    reader.consume(consumed);
}

@rbtcollins
Copy link
Contributor

@main-- a single heap allocation is entirely reasonable for a single invocation of copy of a lot of data. For many copies of a small amount of data, one ends up with a lot of allocations again (consider for instance copying data out of a tar, where we have views on the data and there's no actual synchronous IO etc happening.

I like some of the suggested possibilities with specialisation.

@the8472
Copy link
Member

the8472 commented Apr 13, 2020

I think some kind of specialization for BufRead sources would be ideal.

It's a nice general optimization, but not necesarily ideal if you use a Vec as Write. Adding a BufRead on the reader size does let you control the buffer size but it still introduces an unnecessary memcopy compared to File::read_to_end.

Perhaps a writer-side optimization would be more flexible. Then you can specialize either for BufWriter or Vec

@main--
Copy link
Contributor

main-- commented Apr 14, 2020

That sounds rather uncommon to me though. After all, read_to_end is provided by the Read trait so it's always available. The only case I can see where you would invoke io::copy with a Vec as destination is some generic interfaces that takes any W: Write. But IMO such interfaces are ill-designed, as they could simply implement Read instead and offer a much more flexible API.

@the8472
Copy link
Member

the8472 commented Apr 14, 2020

yes, generic code taking a write was the concern. E.g. tar-rs has a tar archive Builder which takes a Write and is fed with Read items. I'm not saying that creating an in-memory archive would be a good idea, just pointing out that these patterns do exist.

@Niketin
Copy link

Niketin commented Jun 26, 2020

I had same kind of issue but with different library (indicatif). The copy rate was around 70 MB/s (inside single NVMe SSD). The problem was actually with running the debug version of my program. When I built and ran the release version, the speed jumped to over 1 GB/s.

henryboisdequin pushed a commit to henryboisdequin/rust that referenced this issue Feb 1, 2021
Let io::copy reuse BufWriter buffers

This optimization will allow users to implicitly set the buffer size for io::copy by wrapping the writer into a `BufWriter` if the default block size is insufficient, which should fix rust-lang#49921

Due to min_specialization limitations this approach only works with `BufWriter` but not for `BufReader<R>` since `R` is unconstrained and thus the necessary specialization on `R: Read` is not always applicable. Once specialization becomes more powerful this optimization could be extended to look at the reader and writer side and use whichever buffer is larger.
@bors bors closed this as completed in a7a6f01 Feb 1, 2021
@the8472
Copy link
Member

the8472 commented Feb 1, 2021

With #78641 merged you can now (on next nightly) use BufWriter to control the chunk size of io::copy.
Since this relies on specialization it only works if you pass a BufWriter directly, any other wrapper around it would disable this optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
9 participants