[For review] ChaCha SSE2 optimizations #616
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Keeps the same 16-word state but each time ChaCha runs to expand an output it generates 4 blocks at once instead of just 1. With unrolling/interleaving allows a lot of ILP in the SSE2 version which seems important. My original single block SSE2 version in 858e3be was only ~5-10% faster than the scalar version, I suspect because ILP there was very low (almost every single result depended on the result of some earlier instruction). This same principle can be expanded to an 8-way AVX2 ChaCha later on.
This version is ~60% faster than scalar on Skylake, seeing about 830 MB/s here. Probably performance is not great on x86-32 due to only 8 architectural SSE registers.
Main driver of this work is having a very fast RNG for use by NewHope (#613) but faster ChaCha is useful in lots of places.
Also worth noting this takes a different tact from the traditional providers model used elsewhere - if SSE2 is available we jump to it or otherwise use the scalar version, but it's all with just
ChaCha
, noChaCha_SSE2
and company. I expect adopting this model across the library will be the final resolution to #477