Use a larger buffer size for the incremental write benchmarks? #1

oconnor663 · 2020-02-12T17:06:10Z

As you noted, the Rust Hasher implementation doesn't allocate a large internal buffer, so small incremental writes end up going down a slower serial codepath that doesn't benefit as much from SIMD. Would it be possible to use a larger buffer size for benchmarking that implementation? 8 KiB would be enough to get AVX2 going, and 16 KiB is enough if your machine supports AVX-512. There are also marginal benefits to using something bigger (b3sum on our master branch uses 64 KiB), even when IO system calls aren't a factor, because it means more of the parent nodes can be hashed in parallel.

As it happens, we're about to land some handwritten assembly implementations from @sneves in BLAKE3-team/BLAKE3. When we release version 0.2 shortly, it'll be interesting to see whether we get any measurable improvement in these benchmarks.

It's really exciting to see the performance numbers of this crate. Thank you for building it!

The text was updated successfully, but these errors were encountered:

zeebo · 2020-02-12T17:45:50Z

I wanted to highlight that there's a performance "gotcha": you have to make sure to use the API correctly to get the expected performance. While I was working on this, that fact surprised me, and I'd expect it would surprise many users unfamiliar with the implementation details. Having a large internal buffer allows for more flexible (uneducated lol) user behavior. There's a downside to having a large internal buffer though, and that's shown in the much worse performance at the (arguably more important) small string case.

I'm happy to work with you on how to express the tradeoff, but I'd like to highlight it somehow. For example, I could add more exposition or pull all of the small string cases into their own graph to show the downsides more prominantly.

Handwritten assembly? Exciting! I'll be sure to check it out and pull any applicable techniques. 😄

Edit: I just noticed that the Rust crate now does have a reset method. I'll be sure to try out the benchmarks with that and update my readme.

oconnor663 · 2020-02-12T18:42:41Z

abstract thoughts about buffering

Yeah there are a few different tradeoffs between having a big buffer and not having one. The advantage to having one is exactly this scenario: the user is writing with a relatively small copy buffer, and we can accumulate writes internally until we have enough chunks to drive the fast SIMD implementation. But there are two immediate disadvantages:

Having a huge struct isn't great. The current Hasher is a bit under 2 KB on the call stack. Expanding that to 18 KB would be a pretty big jump. For comparison, that's almost the entire memory footprint of a compact TLS implementation. Most callers probably don't care very much about this in practice, but the implementation aims to support at least some reasonably constrained embedded use cases, if not all. There's also the performance penalty you noted for the short input case.
Copying into an internal buffer isn't free. To get maximum performance, we'd want to identify the case where the caller's buffer is big enough to drive SIMD without any further buffering, and we'd want to compress directly from there. (That's what we do today, but here it would be a separate codepath from the internal buffering one.) Performance-conscious callers would still want to use a large buffer of their own, so that Hasher's internal buffering codepath never triggers.

Another factor to consider here is multithreading. Once multithreading is in the picture, the optimal buffer size is a much more difficult question. It's generally on the order of 1 MiB, with the specific size differing substantially from one processor to another. Handling this with internal buffering would probably require dynamic allocation and some kind of growth strategy. And then at some point we have to confront the fact that, if we really want optimized incremental hashing, we need to make sure that IO happens on a background thread, so that all the workers don't have to go to sleep every time the buffer gets refilled. This is a lot of complexity, and I think leaving it up to the caller works better.

So yeah, all those things put together made me decide not to include internal buffering.

thinking about highlighting the difference

Maybe we could benchmark the incremental Rust version twice, once with a large buffer size and once with a small one? That would be kind of analogous to "Go" and "Go (Reset)", using two different colors in the same chart.

zeebo · 2020-02-12T19:19:06Z

Yeah, this library has the advantage of not having to worry about the embedded use case (thanks Go!), does a trick to use the buffer directly without a copy if there's enough extra data provided (https://github.com/zeebo/blake3/blob/master/blake3.go#L33-L35), and I don't plan to support multithreading. And while it's 3-4x slower for small strings, in absolute terms it's on the order of ~100ns and you can reclaim almost all of that time by using Reset.

It sounds like the worry is that the graph shows that you can't have good performance with any size of incremental writes and I definitely don't want to give that impression. Given that, I will

Separate out the small size case to better highlight the downside of the tradeoff
Do the "Rust (1k)" and "Rust (8k)" suggestion
Include some better discussion of the tradeoffs involved
Put the changes in a PR so there's opportunity for feedback

Let me know if you'd like to see something else happen, otherwise I'll link the PR here in a bit.

oconnor663 · 2020-02-12T19:28:27Z

That sounds great! Thanks for getting to it so quickly.

zeebo · 2020-02-12T22:24:38Z

I've opened the pull request: #2

zeebo · 2020-02-13T15:55:25Z

Closing as resolved by #2.

zeebo closed this as completed Feb 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a larger buffer size for the incremental write benchmarks? #1

Use a larger buffer size for the incremental write benchmarks? #1

oconnor663 commented Feb 12, 2020 •

edited

zeebo commented Feb 12, 2020 •

edited

oconnor663 commented Feb 12, 2020

zeebo commented Feb 12, 2020

oconnor663 commented Feb 12, 2020

zeebo commented Feb 12, 2020

zeebo commented Feb 13, 2020

Use a larger buffer size for the incremental write benchmarks? #1

Use a larger buffer size for the incremental write benchmarks? #1

Comments

oconnor663 commented Feb 12, 2020 • edited

zeebo commented Feb 12, 2020 • edited

oconnor663 commented Feb 12, 2020

abstract thoughts about buffering

thinking about highlighting the difference

zeebo commented Feb 12, 2020

oconnor663 commented Feb 12, 2020

zeebo commented Feb 12, 2020

zeebo commented Feb 13, 2020

oconnor663 commented Feb 12, 2020 •

edited

zeebo commented Feb 12, 2020 •

edited