New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add simd sha256 intrinsics for x86 machines #1962
Conversation
Thanks for this! I confirm I have similar benchmark improvements on my machine. The error in CI is just formatting, a Any rationale on choosing the code to base the implementation on over using the code in bitcoin core or in sha2 crate? Is having some fuzzing between hardware and software implementation a good idea? |
@RCasatta the cited code is public domain. If we took from Bitcoin Core we'd have to ask the author to let us relicense under CC1.0 (which would probably be fine, I assume it's sipa or bluematt, but it's still extra work). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 957d162
No strong reason. Bitcoin core code is also based on the same implementation. The
As for sha2 crate, I was not sure if the we would need permission to relicense it based on CC0. I am sure we can jut copy and they would be okay, with it. But this one just seemed simpler. I don't mind switching to another implementation if that makes things easier. |
As for fuzzing, I don't think it would be required, because this code either works or fails completely. There are no data dependant branches. Maybe, @apoelstra has thoughts here. |
I agree with that. I think in the past we've had fuzztests that check compatibility between our crate and the RustCrypto one, and it (unsurprisingly) never ever found anything. |
I notice the C implementation doesn't create a new function call for each block, although I suppose creating 64 new stack frames probably doesn't impact performance. |
A private function that is only called once is pretty-much guaranteed to be inlined. |
Yeah, I didn't want to touch other parts of the code or change the architecture of the code by moving the loop into |
Optimizing just sha is nice, but there's also a lot to be gained interleaving multiple sha256 operations at least in merkle tree building - we are very often taking 2/4/8/etc neighboring blocks of 64 bytes and hashing each block at once. IIRC in Bitcoin Core pre-sha-ni it made a huge difference, though I'm not sure what the improvement was interleaving post-sha-ni. No reason to hold this PR up on that or anything, just noting it in case you're excited to do more :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 546c012
I did some superficial Overall, one sha256d(3 transforms) computation done without transform2 takes about 192 _mm_sha256rnds2_epu32 and 84 _mm_sha256msg1_epu32. Based on my superficial evaluation, there is still some performance benefit to be gained, but certainly not a significant one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be lying if I said I groked this code, can we have a test (possibly during fuzzing) that hashes with the software implementation and the intrinsics one and asserts they are the same? (Although I forget why we stopped fuzzing against the Rust Crypto's version and if the reason we did that counters my request.)
#[cfg(all(feature = "std", any(target_arch = "x86", target_arch = "x86_64")))] | ||
#[target_feature(enable = "sha,sse2,ssse3,sse4.1")] | ||
unsafe fn process_block_simd_x86_intrinsics(&mut self) { | ||
// Code translated and based on from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Code translated and based on from | |
// Code translated from and based on |
My expectation is that all the fixed tests fail if the implementation was wrong. They are checked indirectly as the we get the same hash value in all fixed tests with and without simd. I wanted to add some tests for this. I don't know how to test this in CI. Add target=native and hope that the CI machine has instructions to test it? |
If you can get a test vector that passes and a test vector that fails such a test, you'd pretty-much need to break the hash function. I don't think there's any need for testing correctness, beyond the existing fixed test vectors. |
Oh yes , my bad, I should have |
What's the plan from here @sanket1729, are we going to do the other hashes from noloader code too (sha1, sha512)? |
@tcharding, can do. But I don't think those would be nearly as impactful as sha256 change. Other hash functions are just not used that frequently in bitcoin. sha256 is in used transaction hash, block hash, checksums etc. I think our efforts are better spent improving the sha256 hash engine. I think that having functions for Transform2, Transform4 and Transform8 might be more useful. I don't plan on doing that anytime soon. So it's up for grabs if anyone is interested. |
Fully agreed that we only really need to put work into sha2, and any spare effort that people want to put into this should be focused on sha2. (I wouldn't refuse work on other hash functions if somebody did it, of course, but if they're making a choice upfront, sha2 is all that really matters.) |
Cool, cheers. |
So my point about performance here wasn't about the instruction count, but rather the data inverleaving allowing the CPU to better parallelize internally, which can allow for much higher throughput without fewer instructions. |
Ah thanks for elaborating. Makes much more sense now.
…On Thu, Aug 3, 2023, 11:45 PM Matt Corallo ***@***.***> wrote:
I did some superficial simd instruction counting for 2/4/8 blocks
interleaving methods from bitcoin core.
So my point about performance here wasn't about the instruction count, but
rather the data inverleaving allowing the CPU to better parallelize
internally, which can allow for much higher throughput without fewer
instructions.
—
Reply to this email directly, view it on GitHub
<#1962 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABUQEOPPDULLNXLOFCWQM63XTSLAHANCNFSM6AAAAAA2Z3LIJ4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
What's the status of this one @apoelstra? Are you are hoping for another review? |
Oh, I don't think I noticed your ACK @tcharding. Can you re-ack with the commit ID? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 546c012
I must have clicked "approve" before somehow, woops. All good now. |
This is my first time dabbling into architecture specific code and simd. The algorithm is a word to word translation of the C code from https://github.com/noloader/SHA-Intrinsics/blob/4899efc81d1af159c1fd955936c673139f35aea9/sha256-x86.c .
Some benchmarks:
With simd
Without simd