Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About 10% speed up: tweak diagonal shuffle #4

Merged
merged 1 commit into from
May 3, 2019

Conversation

grafi-tt
Copy link
Contributor

@grafi-tt grafi-tt commented May 3, 2019

Hi, thank you for this awesome Blake2 implementation! When I read the source, I've come up with an idea to improvement.

This PR changed diagonal routines to shift a, c and d, instead of b, c and d. Since data dependency on b is critical, the change improves performance. I confirmed the code passes all self-check tests.

On my Broadwell Skylake laptop, Blake2b cps improved from ~3.0 to ~2.7. I'm running Linux on VMWare, so measurement was unstable.

BTW the tweak is applicable to other Blake2 / Chacha implementations. I guess someone already did the same thing, but I couldn't find any clue...

@grafi-tt
Copy link
Contributor Author

grafi-tt commented May 3, 2019

BTW the tweak is applicable to other Blake2 / Chacha implementations. I guess someone already did the same thing, but I couldn't find any clue...

I found it in the asm code. Sorry for having made noise. https://github.com/floodyberry/chacha-opt/blob/master/app/extensions/chacha/chacha_ssse3-64.inc#L586-L590

@sneves
Copy link
Owner

sneves commented May 3, 2019

Thank you, this is very nice! The latency improvement up is indeed real and significant, and it looks like a couple of ChaCha implementations did use this trick before, though I had never noticed it.

@sneves sneves merged commit b372392 into sneves:master May 3, 2019
SergiySW added a commit to SergiySW/raiblocks that referenced this pull request May 14, 2019
sneves/blake2-avx2#4
About 10% speed up: tweak diagonal shuffle
SergiySW added a commit to nanocurrency/nano-node that referenced this pull request May 17, 2019
sneves/blake2-avx2#4
About 10% speed up: tweak diagonal shuffle
@oconnor663
Copy link

oconnor663 commented Jun 13, 2019

Novice question: Why is the data dependency on b critical? Also, would it make sense to port similar optimizations to the SSE implementations of BLAKE2s?

oconnor663 added a commit to oconnor663/blake2_simd that referenced this pull request Jun 14, 2019
The original source for these optimizations is
sneves/blake2-avx2#4

Libsodium committed them at
jedisct1/libsodium@80206ad
@sneves
Copy link
Owner

sneves commented Jun 17, 2019

You have the row round

a += b; d ^= a; d >>>= 32;
c += d; b ^= c; b >>>= 24;
a += b; d ^= a; d >>>= 16;
c += d; b ^= c; b >>>= 63;

Because b is the last element to be changed in the round (in each row in parallel), the next diagonal round needs to wait for the result of b to finish, then shuffle it, then proceed. But if you shuffle the other words instead, these shuffles can be done earlier, for example in parallel with the computation of b.

If you draw a DAG of all the computations involved here, this optimization reduces the maximum depth of the graph, assuming infinite parallelism is available. The longest path from input to output values in this graph is the critical path, and is what determines the minimal possible latency of the computation (subject to other CPU parallelism constraints, etc etc).

And yes, it makes perfect sense to do the same with BLAKE2s and SSE. I did this on the Wireguard implementation a while ago.

@oconnor663
Copy link

oconnor663/blake2_simd@e26796e, contributed by Sean Gulley, implements the SSE4.1 version of this optimization for BLAKE2s.

@xtremertx
Copy link

Hi, I have found this commit by a mistake but it got my interest. You have mentioned that this optimalization can be used for chacha algorithm as there is clearly same computation involved. It would be cool if you could review my testing chacha implementation, will try to upload it as repository during this weekend, hopefully I will find some time to finish it.

str4d added a commit to str4d/stream-ciphers that referenced this pull request Aug 9, 2021
The `b` state word is on the hot path, so we pivot the diagonalization
to move the shuffles onto the other state words. See the code comment,
or sneves/blake2-avx2#4 for additional details.
tarcieri pushed a commit to RustCrypto/stream-ciphers that referenced this pull request Aug 9, 2021
* chacha20: Add a `backend::avx2::StateWord` helper union

This removes a bunch of instructions for accessing the 128-bit lanes.

* chacha20: Rename backend state words to match RFC 7539

* chacha20: Optimise diagonalization in SSE2 and AVX2 backends

The `b` state word is on the hot path, so we pivot the diagonalization
to move the shuffles onto the other state words. See the code comment,
or sneves/blake2-avx2#4 for additional details.
@JayDDee
Copy link

JayDDee commented Aug 20, 2023

Novice question: Why is the data dependency on b critical? Also, would it make sense to port similar optimizations to the SSE implementations of BLAKE2s?

Late but maybe still useful...

There are 3 main reasons why B is special:

  • It's the last variable written to before the shuffles and the first one read after,
  • B does more work than the other variables such as message injection,
  • the 63 bit rotation of B, just before the shuffles, is particularly slow compared to 16, 24 & 32 bits that can be optimized using byte shuffle.

Shuffling A instead of B addresses all these issues:

  • eliminates a critical dependency on B between the last write before shuffles and the first read after shuffles,
  • more balanced workload among all variables by shifting some work from overworked B to underworked A,
  • shuffle of other variables can be done in parallel with slow 63 bit rotation of B.

This appllies to all Blakes, Salsa and Chaha except that Chacha & Salsa don't have message injection so they don't suffer as much latency to start with and benefit less from these changes.

The bit rotation optimization issue is moot with AVX512 or the upcoming AVX10 due to the availability of the VROR instruction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants