Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[For review] ChaCha SSE2 optimizations #616

Merged
merged 6 commits into from Sep 5, 2016
Merged

[For review] ChaCha SSE2 optimizations #616

merged 6 commits into from Sep 5, 2016

Conversation

randombit
Copy link
Owner

Keeps the same 16-word state but each time ChaCha runs to expand an output it generates 4 blocks at once instead of just 1. With unrolling/interleaving allows a lot of ILP in the SSE2 version which seems important. My original single block SSE2 version in 858e3be was only ~5-10% faster than the scalar version, I suspect because ILP there was very low (almost every single result depended on the result of some earlier instruction). This same principle can be expanded to an 8-way AVX2 ChaCha later on.

This version is ~60% faster than scalar on Skylake, seeing about 830 MB/s here. Probably performance is not great on x86-32 due to only 8 architectural SSE registers.

Main driver of this work is having a very fast RNG for use by NewHope (#613) but faster ChaCha is useful in lots of places.

Also worth noting this takes a different tact from the traditional providers model used elsewhere - if SSE2 is available we jump to it or otherwise use the scalar version, but it's all with just ChaCha, no ChaCha_SSE2 and company. I expect adopting this model across the library will be the final resolution to #477

@randombit
Copy link
Owner Author

Checking on an older processor, 2.4 GHz Westmere, best of 5 runs for 1 second, scalar was 240.5 MB/s SSE2 402.4 MB/s so again roughly 60% speedup from the current work.

It would be easy to translate this method to AVX2, but I will keep that for a future PR.

@randombit
Copy link
Owner Author

I don't plan further commits on this branch

@codecov-io
Copy link

Current coverage is 79.14% (diff: 80.00%)

Merging #616 into master will increase coverage by 0.07%

@@             master       #616   diff @@
==========================================
  Files           376        377     +1   
  Lines         34005      34216   +211   
  Methods        3922       3923     +1   
  Messages          0          0          
  Branches       3719       3727     +8   
==========================================
+ Hits          26890      27081   +191   
- Misses         7086       7106    +20   
  Partials         29         29          

Powered by Codecov. Last update e4656be...ac3d1ea

@randombit randombit merged commit ac3d1ea into master Sep 5, 2016
randombit added a commit that referenced this pull request Sep 5, 2016
@randombit randombit deleted the chacha-vec branch September 5, 2016 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants