-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a larger buffer size for the incremental write benchmarks? #1
Comments
I wanted to highlight that there's a performance "gotcha": you have to make sure to use the API correctly to get the expected performance. While I was working on this, that fact surprised me, and I'd expect it would surprise many users unfamiliar with the implementation details. Having a large internal buffer allows for more flexible (uneducated lol) user behavior. There's a downside to having a large internal buffer though, and that's shown in the much worse performance at the (arguably more important) small string case. I'm happy to work with you on how to express the tradeoff, but I'd like to highlight it somehow. For example, I could add more exposition or pull all of the small string cases into their own graph to show the downsides more prominantly. Handwritten assembly? Exciting! I'll be sure to check it out and pull any applicable techniques. 😄 Edit: I just noticed that the Rust crate now does have a reset method. I'll be sure to try out the benchmarks with that and update my readme. |
abstract thoughts about bufferingYeah there are a few different tradeoffs between having a big buffer and not having one. The advantage to having one is exactly this scenario: the user is writing with a relatively small copy buffer, and we can accumulate writes internally until we have enough chunks to drive the fast SIMD implementation. But there are two immediate disadvantages:
Another factor to consider here is multithreading. Once multithreading is in the picture, the optimal buffer size is a much more difficult question. It's generally on the order of 1 MiB, with the specific size differing substantially from one processor to another. Handling this with internal buffering would probably require dynamic allocation and some kind of growth strategy. And then at some point we have to confront the fact that, if we really want optimized incremental hashing, we need to make sure that IO happens on a background thread, so that all the workers don't have to go to sleep every time the buffer gets refilled. This is a lot of complexity, and I think leaving it up to the caller works better. So yeah, all those things put together made me decide not to include internal buffering. thinking about highlighting the differenceMaybe we could benchmark the incremental Rust version twice, once with a large buffer size and once with a small one? That would be kind of analogous to "Go" and "Go (Reset)", using two different colors in the same chart. |
Yeah, this library has the advantage of not having to worry about the embedded use case (thanks Go!), does a trick to use the buffer directly without a copy if there's enough extra data provided (https://github.com/zeebo/blake3/blob/master/blake3.go#L33-L35), and I don't plan to support multithreading. And while it's 3-4x slower for small strings, in absolute terms it's on the order of ~100ns and you can reclaim almost all of that time by using Reset. It sounds like the worry is that the graph shows that you can't have good performance with any size of incremental writes and I definitely don't want to give that impression. Given that, I will
Let me know if you'd like to see something else happen, otherwise I'll link the PR here in a bit. |
That sounds great! Thanks for getting to it so quickly. |
I've opened the pull request: #2 |
Closing as resolved by #2. |
As you noted, the Rust
Hasher
implementation doesn't allocate a large internal buffer, so small incremental writes end up going down a slower serial codepath that doesn't benefit as much from SIMD. Would it be possible to use a larger buffer size for benchmarking that implementation? 8 KiB would be enough to get AVX2 going, and 16 KiB is enough if your machine supports AVX-512. There are also marginal benefits to using something bigger (b3sum
on our master branch uses 64 KiB), even when IO system calls aren't a factor, because it means more of the parent nodes can be hashed in parallel.As it happens, we're about to land some handwritten assembly implementations from @sneves in BLAKE3-team/BLAKE3. When we release version 0.2 shortly, it'll be interesting to see whether we get any measurable improvement in these benchmarks.
It's really exciting to see the performance numbers of this crate. Thank you for building it!
The text was updated successfully, but these errors were encountered: