[For review] ChaCha SSE2 optimizations #616

randombit · 2016-09-01T17:33:34Z

Keeps the same 16-word state but each time ChaCha runs to expand an output it generates 4 blocks at once instead of just 1. With unrolling/interleaving allows a lot of ILP in the SSE2 version which seems important. My original single block SSE2 version in 858e3be was only ~5-10% faster than the scalar version, I suspect because ILP there was very low (almost every single result depended on the result of some earlier instruction). This same principle can be expanded to an 8-way AVX2 ChaCha later on.

This version is ~60% faster than scalar on Skylake, seeing about 830 MB/s here. Probably performance is not great on x86-32 due to only 8 architectural SSE registers.

Main driver of this work is having a very fast RNG for use by NewHope (#613) but faster ChaCha is useful in lots of places.

Also worth noting this takes a different tact from the traditional providers model used elsewhere - if SSE2 is available we jump to it or otherwise use the scalar version, but it's all with just ChaCha, no ChaCha_SSE2 and company. I expect adopting this model across the library will be the final resolution to #477

But not any ChaCha20 tests due to no long test inputs. Add one.

randombit · 2016-09-02T09:23:51Z

Checking on an older processor, 2.4 GHz Westmere, best of 5 runs for 1 second, scalar was 240.5 MB/s SSE2 402.4 MB/s so again roughly 60% speedup from the current work.

It would be easy to translate this method to AVX2, but I will keep that for a future PR.

randombit · 2016-09-02T12:20:32Z

I don't plan further commits on this branch

codecov-io · 2016-09-02T13:21:35Z

Current coverage is 79.14% (diff: 80.00%)

Merging #616 into master will increase coverage by 0.07%

@@             master       #616   diff @@
==========================================
  Files           376        377     +1   
  Lines         34005      34216   +211   
  Methods        3922       3923     +1   
  Messages          0          0          
  Branches       3719       3727     +8   
==========================================
+ Hits          26890      27081   +191   
- Misses         7086       7106    +20   
  Partials         29         29

Powered by Codecov. Last update e4656be...ac3d1ea

randombit added 6 commits September 1, 2016 13:20

SSE2 ChaCha

858e3be

ChaCha 4 ways

e358acf

4x interleaved SSE2

fc4b34d

Missing increment in SSE2 version, broke ChaCha20Poly1305 tests

3a887fa

But not any ChaCha20 tests due to no long test inputs. Add one.

Correct macro check

3fc924b

Avoid _mm_set_epi64x which is missing on 32-bit MSVC 12

ac3d1ea

randombit merged commit ac3d1ea into master Sep 5, 2016

randombit added a commit that referenced this pull request Sep 5, 2016

Merge GH #616 ChaCha SSE2 optimizations

5178ba7

randombit deleted the chacha-vec branch September 5, 2016 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[For review] ChaCha SSE2 optimizations #616

[For review] ChaCha SSE2 optimizations #616

randombit commented Sep 1, 2016

randombit commented Sep 2, 2016

randombit commented Sep 2, 2016

codecov-io commented Sep 2, 2016

[For review] ChaCha SSE2 optimizations #616

[For review] ChaCha SSE2 optimizations #616

Conversation

randombit commented Sep 1, 2016

randombit commented Sep 2, 2016

randombit commented Sep 2, 2016

codecov-io commented Sep 2, 2016

Current coverage is 79.14% (diff: 80.00%)