-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crypto.sha2: Use intrinsics for SHA-256 on x86-64 and AArch64 #13272
Conversation
In future it would be nice if we use inline assembly instead of LLVM intrinsics to enable the self-hosted backends to use these features too. |
e6afae7
to
faebc31
Compare
Good point! Switched to inline asm for this PR. In general, I think inline assembly can interact very differently with the optimizer versus intrinsics, but in this case it's actually a 20% win for Intel x86-64: Intel is over 1 GB/s now (4x improvement vs. master) |
faebc31
to
3ed22de
Compare
Eugh... Intel x86-64 loves the new inline assembly, but AMD hates it:
I'll go digging through the assembly and see if I can figure out why. |
What about arm64? Comparable, or ahead, or behind? If the latter, that's bad timing as I wanted to start testing it out in zld repo. |
arm64 on my M1 mac is about the same (<5% delta) |
Awesome! Thanks for confirming! |
Is there any trick to building any of this locally? On my M1Pro I am getting:
|
not sure if helpful but I have a script that benches SHA hashing comparing BoringSSL's implementation to Zig's implementation This is for an old build of Zig without this PR's changes
|
@kubkon what does |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stellar work! Thanks @topolarity!
c73b079
to
d197819
Compare
Doesn't this make it impossible to do sha256 at comptime? |
Good point, I hadn't thought of that. Yeah, it breaks comptime evaluation for SHA-256. Without #868 (or a related hack), the only way to fix that is to create a separate comptime method that users call at comptime. That means it's viral, too: Any function that calls SHA-256 must split into a comptime and runtime version, as well as their callers and so on. |
For hash functions, comptime-support can indeed be an important thing to have. That leaves us with either dynamic dispatch (which doesn't match the rest, and the plan to eventually do it), or, as you pointed out, a way to check if the function is called at comptime or not. Doing this, even with a hack as a starter, sounds like a much better route that duplicating every function to have a comptine variant. |
pub fn isComptime() bool {
var t = true;
var a: u8 = 0;
var b: u16 = 0;
const x = if (t) a else b;
return @TypeOf(x) == u8;
} |
That's a nice trick! |
Hmm not sure what's up with the CI failures - I don't understand why this would be happening in this branch but not master branch. Are you sure this is rebased against master branch? In master branch I see this:
However in this CI run I see this:
Possibly indicating an old tarball. |
There's probably plenty of room to optimize these further in the future, but for the moment this gives ~3x improvement on Intel x86-64 processors, ~5x on AMD, and ~10x on M1 Macs. These extensions are very new - Most processors prior to 2020 do not support them. AVX-512 is a slightly older alternative that we could use on Intel for a much bigger performance bump, but it's been fused off on Intel's latest hybrid architectures and it relies on computing independent SHA hashes in parallel. In contrast, these SHA intrinsics provide the usual single-threaded, single-stream interface, and should continue working on new processors. AArch64 also has SHA-512 intrinsics that we could take advantage of in the future
This feature detection must be done at comptime so that we avoid generating invalid ASM for the target.
This gets us most of the way back to the performance I had when I was using the LLVM intrinsics: - Intel Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz: 190.67 MB/s (w/o intrinsics) -> 1285.08 MB/s - AMD EPYC 7763 (VM) @ 2.45 GHz: 240.09 MB/s (w/o intrinsics) -> 1360.78 MB/s - Apple M1: 216.96 MB/s (w/o intrinsics) -> 2133.69 MB/s Minor changes to this source can swing performance from 400 MB/s to 1400 MB/s or... 20 MB/s, depending on how it interacts with the optimizer. I have a sneaking suspicion that despite LLVM inheriting GCC's extremely strict inline assembly semantics, its passes are rather skittish around inline assembly (and almost certainly, its instruction cost models can assume nothing)
Comptime code can't execute assembly code, so we need some way to force comptime code to use the generic path. This should be replaced with whatever is implemented for ziglang#868, when that day comes. I am seeing that the result for the hash is incorrect in stage1 and crashes stage2, so presumably this never worked correctly. I will follow up on that soon.
cf2bec9
to
f9fe548
Compare
Oops, I thought CI merged the PR branch into master automatically, but I see now that this doesn't apply changes to the CI script itself. Just pushed a proper rebase - should be ready for review/merge assuming it passes. |
This also fixes a bug where the feature gating was not taking effect at comptime due to ziglang#6768
Re: the CI failures, I think
has to be changed to
See #6768 for more details. Once we get rid of stage1, we can make |
Thanks for the review! Yep, #6768 took me by surprise. Should be all fixed up now 👍 |
I don't think it's necessary to add another comptime method. I made some changes like below and run zig test lib/std/std.zig --zig-lib-dir lib The tests built and passed. The hash functions should already support running at comptile time, no need to add another one, am I right? |
There's probably plenty of room to optimize these further in the future, but for the moment this gives ~5-6x improvement on x86-64, and ~10x on M1 AArch64 Macs.
These extensions are very new. Most Intel processors prior to 2020 do not support them. AMD has supported them since Ryzen.
P.S. @kubkon I haven't fixed the CPU features for LLVM in stage2 yet, but at least this gives you something to play with until that's sorted 🙂