std::io::copy performance may be 20x slower than it could be? #49921

spease · 2018-04-13T07:19:22Z

I was using std::io::copy from an SSD to memory using ProgressBarReader, which reported an eye-raising 30 MB/s. Wrapping it in a BufReader (either as the outer or middle layer) made no real difference.

So I googled around a bit and found the recommended size was 64KB according to some random Microsoft article, so I decided to copy-paste the std::io::copy implementation and increase the buffer size. This got up to 600 MB/s at about 256KB which sounded a lot more reasonable...then it continued going upwards from there past the point of believability (unless the files were being pulled from RAM as a consequence of my repeated accesses). With a 10MB Vec, I got 228.36 GB/s 🙃.

Unrealistic values aside, has anybody checked the buffer value to make sure there isn't some huge low-hanging fruit for std::io::copy speeds?

the8472 · 2018-04-14T19:22:57Z

You mention windows. This may be platform-specific behavior. On unix systems I would fadvise POSIX_FADV_SEQUENTIAL on the input file for copying. Googling tells me that the windows equivalent is FILE_FLAG_SEQUENTIAL_SCAN. Have you tried that?

Voultapher · 2018-06-10T12:49:13Z

After a coarse overview, my intuition is that there is some funny business going on with the length determined by seek and the subsequent inc calls on the ProgressBar, that could explain the unrealistic results.

spease · 2018-06-10T19:22:08Z

It’s possible. The actual times did increase dramatically from the std::io::copy default though - my suspicion is that there is probably some caching to RAM going on or something as cold runs were generally slower than hot runs.

This was for some work code that I’m no longer actively developing. I’ll leave this email marked as unread for awhile in case I have some spare time to write a benchmark.

I don’t think the other flag that was mentioned is accessible with the default open file api, but in any case it at least appears there’s some low hanging fruit wrt to std::io::copy perf

ljedrz · 2018-08-01T06:41:41Z

I tried benchmarking io::copy on Windoze before and after changing DEFAULT_BUF_SIZE to 64KB (instead of the original 8KB) and the results were slightly worse (up to ~3%) with the modified setup, both on HDD and SSD. However, I'm not sure if my benchmarks were good enough - I was having a hard time tweaking them so that I could see any difference between results for different file sizes.

FWIW, I'm providing the benchmark code below; it can easily be expanded to different file sizes:

#[bench]
fn bench_copy_4MB(b: &mut Bencher) {
    let mut file = test::black_box(File::open("4MB").unwrap());
    let mut writer: Vec<u8> = Vec::with_capacity(4 * 1024 * 1024);

    b.iter(|| {
        let mut x = Ok(0);
        for _ in 0..100 { x = io::copy(&mut file, &mut writer); writer.clear(); }
        x
    })
}

spease · 2018-08-01T07:24:54Z

4MB isn’t a whole lot. That’s well within the size of hard drive caches or possibly an OS memory-based cache. I was noticing the difference with files several gigabytes in size.

As a sanity check, maybe look to see what the actual time is, not just the % difference between timings with different buffer sizes. The actual time may be wildly different than the actual steady state performance of the disk if it’s reading the data from a cache.

It may also be possible at a low level to pass a flag to encourage the OS to disable or flush the cache, but I don’t know if there’s any way to instruct the hard drive to do the same.

ljedrz · 2018-08-01T07:30:30Z

I only pasted the benchmark for 4MB as a reference - I tested files from 4KB up to 16MB. I'm not an expert with low-level hard drive stuff, just hoping my attempt can help someone else.

retep998 · 2018-08-06T04:08:49Z

Ideally you would test with a file approximately twice as big as you have RAM, so if you have 8GB of RAM you would want to test with a 16GB file.

rbtcollins · 2019-05-18T18:56:48Z

@spease extended open flags are accessible via https://doc.rust-lang.org/std/os/windows/fs/trait.OpenOptionsExt.html
FWIW lots of bennchmarks exist showing synchronous IO calls improve when matching IO operations to the underlying block size (e.g. 4MB for modern SSD's) across multiple OS's, so its very likely that there are improvements to gain here. However, 4MB is stack-breaking size by default, and heap allocations will destroy potential improvements (and further, if we're optimising, we probably want overlapped IO and a ring of buffers rather than just one buffer which will always leave one device idle).... so its complex;)

main-- · 2020-04-11T10:12:21Z

heap allocations will destroy potential improvements

How come? io::copy would allocate once at the beginning. Heap memory isn't inherently slower and one memory allocation is basically free compared to the heavy IO that's happening afterwards.

retep998 · 2020-04-11T16:05:22Z

But we also don't want to be allocating 4MB every time we do io::copy as someone might be doing a bunch of small copies.

the8472 · 2020-04-11T16:18:08Z

possible approaches:

thread local buffers or a buffer pool
stat the input before deciding on the buffer size (this is what gnu cp does I think)
heap-allocate a bigger buffer after a few iterations of the copy loop with an on-stack buffer
specialize copy operations from file to memory to read directly into the already allocated target, thus skipping the need for a buffer

main-- · 2020-04-11T21:52:45Z

specialize copy operations from file to memory to read directly into the already allocated target, thus skipping the need for a buffer

I think some kind of specialization for BufRead sources would be ideal. Because in this case, io::copy should not buffer at all and simply use the reader's buffer. Effectively this also gives you a customizable buffer size by wrapping your File in a BufReader with whatever size you want.

As of today, if your source is BufRead then io::copy is always the wrong answer. I got a substantial performance improvement using a simple loop instead:

loop {
    let buf = reader.fill_buf()?;
    if buf.is_empty() { break; }
    let consumed = file_out.write(buf)?;
    drop(buf);
    reader.consume(consumed);
}

rbtcollins · 2020-04-12T02:44:35Z

@main-- a single heap allocation is entirely reasonable for a single invocation of copy of a lot of data. For many copies of a small amount of data, one ends up with a lot of allocations again (consider for instance copying data out of a tar, where we have views on the data and there's no actual synchronous IO etc happening.

I like some of the suggested possibilities with specialisation.

the8472 · 2020-04-13T18:28:30Z

I think some kind of specialization for BufRead sources would be ideal.

It's a nice general optimization, but not necesarily ideal if you use a Vec as Write. Adding a BufRead on the reader size does let you control the buffer size but it still introduces an unnecessary memcopy compared to File::read_to_end.

Perhaps a writer-side optimization would be more flexible. Then you can specialize either for BufWriter or Vec

main-- · 2020-04-14T10:27:17Z

That sounds rather uncommon to me though. After all, read_to_end is provided by the Read trait so it's always available. The only case I can see where you would invoke io::copy with a Vec as destination is some generic interfaces that takes any W: Write. But IMO such interfaces are ill-designed, as they could simply implement Read instead and offer a much more flexible API.

the8472 · 2020-04-14T11:49:41Z

yes, generic code taking a write was the concern. E.g. tar-rs has a tar archive Builder which takes a Write and is fed with Read items. I'm not saying that creating an in-memory archive would be a good idea, just pointing out that these patterns do exist.

Niketin · 2020-06-26T22:42:21Z

I had same kind of issue but with different library (indicatif). The copy rate was around 70 MB/s (inside single NVMe SSD). The problem was actually with running the debug version of my program. When I built and ran the release version, the speed jumped to over 1 GB/s.

Let io::copy reuse BufWriter buffers This optimization will allow users to implicitly set the buffer size for io::copy by wrapping the writer into a `BufWriter` if the default block size is insufficient, which should fix rust-lang#49921 Due to min_specialization limitations this approach only works with `BufWriter` but not for `BufReader<R>` since `R` is unconstrained and thus the necessary specialization on `R: Read` is not always applicable. Once specialization becomes more powerful this optimization could be extended to look at the reader and writer side and use whichever buffer is larger.

the8472 · 2021-02-01T19:27:07Z

With #78641 merged you can now (on next nightly) use BufWriter to control the chunk size of io::copy.
Since this relies on specialization it only works if you pass a BufWriter directly, any other wrapper around it would disable this optimization.

kennytm added I-slow Issue: Problems and improvements with respect to performance of generated code. C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Apr 13, 2018

main-- mentioned this issue Apr 13, 2020

Specialize io::copy for BufRead sources #71091

Closed

bgilbert mentioned this issue Jul 2, 2020

*: use our own std::io::copy() implementation coreos/coreos-installer#290

Merged

the8472 mentioned this issue Jul 22, 2020

Specialize std::io::copy for files, sockets, or pipes on Linux #74426

Closed

the8472 mentioned this issue Nov 1, 2020

Let io::copy reuse BufWriter buffers #78641

Merged

bors closed this as completed in a7a6f01 Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std::io::copy performance may be 20x slower than it could be? #49921

std::io::copy performance may be 20x slower than it could be? #49921

spease commented Apr 13, 2018

the8472 commented Apr 14, 2018

Voultapher commented Jun 10, 2018

spease commented Jun 10, 2018

ljedrz commented Aug 1, 2018

spease commented Aug 1, 2018

ljedrz commented Aug 1, 2018

retep998 commented Aug 6, 2018

rbtcollins commented May 18, 2019

main-- commented Apr 11, 2020

retep998 commented Apr 11, 2020

the8472 commented Apr 11, 2020

main-- commented Apr 11, 2020 •

edited

Loading

rbtcollins commented Apr 12, 2020

the8472 commented Apr 13, 2020

main-- commented Apr 14, 2020

the8472 commented Apr 14, 2020

Niketin commented Jun 26, 2020

the8472 commented Feb 1, 2021 •

edited

Loading

std::io::copy performance may be 20x slower than it could be? #49921

std::io::copy performance may be 20x slower than it could be? #49921

Comments

spease commented Apr 13, 2018

the8472 commented Apr 14, 2018

Voultapher commented Jun 10, 2018

spease commented Jun 10, 2018

ljedrz commented Aug 1, 2018

spease commented Aug 1, 2018

ljedrz commented Aug 1, 2018

retep998 commented Aug 6, 2018

rbtcollins commented May 18, 2019

main-- commented Apr 11, 2020

retep998 commented Apr 11, 2020

the8472 commented Apr 11, 2020

main-- commented Apr 11, 2020 • edited Loading

rbtcollins commented Apr 12, 2020

the8472 commented Apr 13, 2020

main-- commented Apr 14, 2020

the8472 commented Apr 14, 2020

Niketin commented Jun 26, 2020

the8472 commented Feb 1, 2021 • edited Loading

main-- commented Apr 11, 2020 •

edited

Loading

the8472 commented Feb 1, 2021 •

edited

Loading