Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Optim] SIMD for Pornin's GCD inverse #62

Closed
mratsim opened this issue Mar 9, 2021 · 3 comments
Closed

[Optim] SIMD for Pornin's GCD inverse #62

mratsim opened this issue Mar 9, 2021 · 3 comments

Comments

@mratsim
Copy link
Contributor

mratsim commented Mar 9, 2021

The current implementation of inversion follows closely @pornin's paper and uses an inner loop with 4 31-bit integers stored in 2 64-bit registers:

$code.=<<___;
.type __inner_loop_31,\@abi-omnipotent
.align 32
__inner_loop_31: ################# by Thomas Pornin
mov \$0x7FFFFFFF80000000, $fg0 # |f0|=1, |g0|=0
mov \$0x800000007FFFFFFF, $fg1 # |f1|=0, |g1|=1
mov \$0x7FFFFFFF7FFFFFFF, $bias
.Loop_31:

https://github.com/pornin/bingcd/blob/700cc4d/src/gf25519.c#L1098-L1145

image

Instead it's possible to store 4 62-bit integers in 2 128-bit SIMD which are available in x86-64 (which implies SSE2) and ARM Neon.
This allows the inner loop to use 62 iterations instead of 31 and actually should remove the need for 2 distinct loops.

The main benefits would be:

  • for variable time inversion since we could eliminate more than 31 0 bits in a single iteration.
  • reducing bookkeeping since the hot loop is bigger and there is no more slow loop.
  • reducing general register pressure on x86-64 (if it was an issue).

However I do not know if in pure SSE2 it's possible to emulate CMOV efficiently and conditional swap.

In terms of use-cases:

  • inversion is used once for pairing but only represent a small portion of the total cost
  • inversion will likely be used for polynomial commitments for inverse FFTs
@pornin
Copy link

pornin commented Mar 9, 2021

Take care not to do too many iterations here. If you do t iterations in the inner loop, you need to work with words of at least 2t+2 bits (t "low bits" and t+2 "high bits"). My code with 64-bit words thus uses 31 iterations, for a 33/31 split. The two extra "high" bits are needed for the approximation to remain valid. It is conceivable that a single extra bit would be enough, but the proof in the paper does not cover that. It is known that with no extra bit, there are "bad cases" in which the algorithm does not work.

With SSE2, the following are tempting:

  • Use 128-bit-or-so approximations, for maybe up to 63 inner iterations.
  • Pack the "update factors" together for faster swaps (i.e. f0 and g0 in the same register, and similarly for f1 and g1).

Unfortunately, there is no conditional move in SSE2 (or even AVX2); you need to emulate that with boolean operations. Compounding the effect is that there is no unsigned comparison opcode. You have pcmpgtd for 32-bit signed comparison, and (on SSE4.2 and later) pcmpgtq for 64-bit signed comparison (the latter has a large 3-cycle latency, though). You can also use psubq to make a 64-bit subtraction and then "extend" the sign bit, but the extension requires some extra instructions (and thus latency), of course. In all cases, you cannot do an unsigned 64-bit comparison, which forces you to use at most 63 bits for the "high bits", and thus a maximum of 61 iterations in the inner loop, not 62 (unless you take care to prove that a single extra bit is enough, but this does not seem to be easy and it might not even be true, for all I know).

You can make a sort-of unsigned 64-bit comparison by adding an offset (i.e. to compare x with y, both unsigned over 64 bits, you can do a signed comparison between x-2^63 and y-2^63) but that leads to extra additions at some point in the algorithm, which is usually not worth it.

I have tried it for a few hours and could not make the inner loop as fast as the version with integer registers, but that does not mean that it is not possible, only that it is likely to require some effort. Also, I tried only on Skylake CPUs; things may be different on other kinds of cores (and especially on ARM/Neon).

@dot-asm
Copy link
Collaborator

dot-asm commented Mar 11, 2021

Note that ctq and arm modules do use wider approximations. It's all about delicate balance between how fast you can get the inner loop spin vs. how fast/slow multiplications are in comparison. The inner loop in turn is mostly about instruction-level parallelism that the target processor can achieve, which is also quite delicate matter. Thomas's inner loop happens to achieve close to the maximum and in combination with fast multiplications it turns to be the best choice for mulx-capable processors. You won't do better in SIMD, because the amount of instructions won't be significantly lower in comparison to scalar (because instruction capabilities are not as rich), while maximum possible ILP would be lower. Especially on most Cortex cores. Well, one can mix scalar and vector, but then you'll introduce additional dependencies (and instructions) as you "move" flag values to the vector register. [Also, mixed instruction scheduling tends to be more problematic for out-of-order execution logic on x86, which manifests in slightly lower than expected throughput.]

@mratsim mratsim closed this as completed Mar 11, 2021
@mratsim
Copy link
Contributor Author

mratsim commented Mar 11, 2021

Thank you both for the detailed analysis. Closing then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants