About 10% speed up: tweak diagonal shuffle #4

grafi-tt · 2019-05-03T07:03:52Z

Hi, thank you for this awesome Blake2 implementation! When I read the source, I've come up with an idea to improvement.

This PR changed diagonal routines to shift a, c and d, instead of b, c and d. Since data dependency on b is critical, the change improves performance. I confirmed the code passes all self-check tests.

On my ~~Broadwell~~ Skylake laptop, Blake2b cps improved from ~3.0 to ~2.7. I'm running Linux on VMWare, so measurement was unstable.

BTW the tweak is applicable to other Blake2 / Chacha implementations. I guess someone already did the same thing, but I couldn't find any clue...

grafi-tt · 2019-05-03T08:01:05Z

BTW the tweak is applicable to other Blake2 / Chacha implementations. I guess someone already did the same thing, but I couldn't find any clue...

I found it in the asm code. Sorry for having made noise. https://github.com/floodyberry/chacha-opt/blob/master/app/extensions/chacha/chacha_ssse3-64.inc#L586-L590

sneves · 2019-05-03T16:37:27Z

Thank you, this is very nice! The latency improvement up is indeed real and significant, and it looks like a couple of ChaCha implementations did use this trick before, though I had never noticed it.

sneves/blake2-avx2#4 About 10% speed up: tweak diagonal shuffle

oconnor663 · 2019-06-13T18:10:24Z

Novice question: Why is the data dependency on b critical? Also, would it make sense to port similar optimizations to the SSE implementations of BLAKE2s?

The original source for these optimizations is sneves/blake2-avx2#4 Libsodium committed them at jedisct1/libsodium@80206ad

sneves · 2019-06-17T01:45:01Z

You have the row round

a += b; d ^= a; d >>>= 32;
c += d; b ^= c; b >>>= 24;
a += b; d ^= a; d >>>= 16;
c += d; b ^= c; b >>>= 63;

Because b is the last element to be changed in the round (in each row in parallel), the next diagonal round needs to wait for the result of b to finish, then shuffle it, then proceed. But if you shuffle the other words instead, these shuffles can be done earlier, for example in parallel with the computation of b.

If you draw a DAG of all the computations involved here, this optimization reduces the maximum depth of the graph, assuming infinite parallelism is available. The longest path from input to output values in this graph is the critical path, and is what determines the minimal possible latency of the computation (subject to other CPU parallelism constraints, etc etc).

And yes, it makes perfect sense to do the same with BLAKE2s and SSE. I did this on the Wireguard implementation a while ago.

oconnor663 · 2019-07-22T18:43:49Z

oconnor663/blake2_simd@e26796e, contributed by Sean Gulley, implements the SSE4.1 version of this optimization for BLAKE2s.

based on: BLAKE2/BLAKE2#57 sneves/blake2-avx2#4

xtremertx · 2020-07-24T13:10:50Z

Hi, I have found this commit by a mistake but it got my interest. You have mentioned that this optimalization can be used for chacha algorithm as there is clearly same computation involved. It would be cool if you could review my testing chacha implementation, will try to upload it as repository during this weekend, hopefully I will find some time to finish it.

The `b` state word is on the hot path, so we pivot the diagonalization to move the shuffles onto the other state words. See the code comment, or sneves/blake2-avx2#4 for additional details.

* chacha20: Add a `backend::avx2::StateWord` helper union This removes a bunch of instructions for accessing the 128-bit lanes. * chacha20: Rename backend state words to match RFC 7539 * chacha20: Optimise diagonalization in SSE2 and AVX2 backends The `b` state word is on the hot path, so we pivot the diagonalization to move the shuffles onto the other state words. See the code comment, or sneves/blake2-avx2#4 for additional details.

JayDDee · 2023-08-20T17:10:22Z

Novice question: Why is the data dependency on b critical? Also, would it make sense to port similar optimizations to the SSE implementations of BLAKE2s?

Late but maybe still useful...

There are 3 main reasons why B is special:

It's the last variable written to before the shuffles and the first one read after,
B does more work than the other variables such as message injection,
the 63 bit rotation of B, just before the shuffles, is particularly slow compared to 16, 24 & 32 bits that can be optimized using byte shuffle.

Shuffling A instead of B addresses all these issues:

eliminates a critical dependency on B between the last write before shuffles and the first read after shuffles,
more balanced workload among all variables by shifting some work from overworked B to underworked A,
shuffle of other variables can be done in parallel with slow 63 bit rotation of B.

This appllies to all Blakes, Salsa and Chaha except that Chacha & Salsa don't have message injection so they don't suffer as much latency to start with and benefit less from these changes.

The bit rotation optimization issue is moot with AVX512 or the upcoming AVX10 due to the availability of the VROR instruction.

tweak diagonal shuffle in order to relax dependency on a

b0da1ff

sneves merged commit b372392 into sneves:master May 3, 2019

grafi-tt mentioned this pull request May 4, 2019

Tweak Chacha and Blake diagonalization to hide latency on b cryptocorrosion/cryptocorrosion#18

Merged

SergiySW added a commit to SergiySW/raiblocks that referenced this pull request May 14, 2019

Apply Blake2b AVX2 changes

b2a595d

sneves/blake2-avx2#4 About 10% speed up: tweak diagonal shuffle

SergiySW mentioned this pull request May 17, 2019

Apply Blake2b AVX2 changes nanocurrency/nano-node#1994

Merged

SergiySW added a commit to nanocurrency/nano-node that referenced this pull request May 17, 2019

Apply Blake2b AVX2 changes (#1994)

c0e2692

sneves/blake2-avx2#4 About 10% speed up: tweak diagonal shuffle

oconnor663 added a commit to oconnor663/blake2_simd that referenced this pull request Jun 14, 2019

port BLAKE2b AVX2 optimizations from libsodium 1.0.18

32065b5

The original source for these optimizations is sneves/blake2-avx2#4 Libsodium committed them at jedisct1/libsodium@80206ad

oconnor663 mentioned this pull request Jul 22, 2019

Diagonal shuffle optimization for BLAKE2s BLAKE2/BLAKE2#56

Open

grafi-tt mentioned this pull request Aug 3, 2019

Updated Blake2s code to match diagonal shuffle tweak done in blake2-a… BLAKE2/BLAKE2#57

Merged

saucecontrol added a commit to saucecontrol/Blake2Fast that referenced this pull request Jun 13, 2020

faster SIMD implementations

6313bcd

based on: BLAKE2/BLAKE2#57 sneves/blake2-avx2#4

oconnor663 mentioned this pull request Apr 10, 2022

I need to clarify some things BLAKE3-team/BLAKE3#241

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About 10% speed up: tweak diagonal shuffle #4

About 10% speed up: tweak diagonal shuffle #4

grafi-tt commented May 3, 2019 •

edited

Loading

grafi-tt commented May 3, 2019

sneves commented May 3, 2019

oconnor663 commented Jun 13, 2019 •

edited

Loading

sneves commented Jun 17, 2019 •

edited

Loading

oconnor663 commented Jul 22, 2019

xtremertx commented Jul 24, 2020

JayDDee commented Aug 20, 2023 •

edited

Loading

About 10% speed up: tweak diagonal shuffle #4

About 10% speed up: tweak diagonal shuffle #4

Conversation

grafi-tt commented May 3, 2019 • edited Loading

grafi-tt commented May 3, 2019

sneves commented May 3, 2019

oconnor663 commented Jun 13, 2019 • edited Loading

sneves commented Jun 17, 2019 • edited Loading

oconnor663 commented Jul 22, 2019

xtremertx commented Jul 24, 2020

JayDDee commented Aug 20, 2023 • edited Loading

grafi-tt commented May 3, 2019 •

edited

Loading

oconnor663 commented Jun 13, 2019 •

edited

Loading

sneves commented Jun 17, 2019 •

edited

Loading

JayDDee commented Aug 20, 2023 •

edited

Loading