[Optim] SIMD for Pornin's GCD inverse #62

mratsim · 2021-03-09T09:22:01Z

The current implementation of inversion follows closely @pornin's paper and uses an inner loop with 4 31-bit integers stored in 2 64-bit registers:

blst/src/asm/ctx_inverse_mod_384-x86_64.pl

Lines 904 to 912 in 575c3dd

    
           $code.=<<___; 
        
           .type	__inner_loop_31,\@abi-omnipotent 
        
           .align	32 
        
           __inner_loop_31:		################# by Thomas Pornin 
        
           	mov	\$0x7FFFFFFF80000000, $fg0	# |f0|=1, |g0|=0 
        
           	mov	\$0x800000007FFFFFFF, $fg1	# |f1|=0, |g1|=1 
        
           	mov	\$0x7FFFFFFF7FFFFFFF, $bias 
        
           .Loop_31:

https://github.com/pornin/bingcd/blob/700cc4d/src/gf25519.c#L1098-L1145

Instead it's possible to store 4 62-bit integers in 2 128-bit SIMD which are available in x86-64 (which implies SSE2) and ARM Neon.
This allows the inner loop to use 62 iterations instead of 31 and actually should remove the need for 2 distinct loops.

The main benefits would be:

for variable time inversion since we could eliminate more than 31 0 bits in a single iteration.
reducing bookkeeping since the hot loop is bigger and there is no more slow loop.
reducing general register pressure on x86-64 (if it was an issue).

However I do not know if in pure SSE2 it's possible to emulate CMOV efficiently and conditional swap.

In terms of use-cases:

inversion is used once for pairing but only represent a small portion of the total cost
inversion will likely be used for polynomial commitments for inverse FFTs

pornin · 2021-03-09T14:16:26Z

Take care not to do too many iterations here. If you do t iterations in the inner loop, you need to work with words of at least 2t+2 bits (t "low bits" and t+2 "high bits"). My code with 64-bit words thus uses 31 iterations, for a 33/31 split. The two extra "high" bits are needed for the approximation to remain valid. It is conceivable that a single extra bit would be enough, but the proof in the paper does not cover that. It is known that with no extra bit, there are "bad cases" in which the algorithm does not work.

With SSE2, the following are tempting:

Use 128-bit-or-so approximations, for maybe up to 63 inner iterations.
Pack the "update factors" together for faster swaps (i.e. f0 and g0 in the same register, and similarly for f1 and g1).

Unfortunately, there is no conditional move in SSE2 (or even AVX2); you need to emulate that with boolean operations. Compounding the effect is that there is no unsigned comparison opcode. You have pcmpgtd for 32-bit signed comparison, and (on SSE4.2 and later) pcmpgtq for 64-bit signed comparison (the latter has a large 3-cycle latency, though). You can also use psubq to make a 64-bit subtraction and then "extend" the sign bit, but the extension requires some extra instructions (and thus latency), of course. In all cases, you cannot do an unsigned 64-bit comparison, which forces you to use at most 63 bits for the "high bits", and thus a maximum of 61 iterations in the inner loop, not 62 (unless you take care to prove that a single extra bit is enough, but this does not seem to be easy and it might not even be true, for all I know).

You can make a sort-of unsigned 64-bit comparison by adding an offset (i.e. to compare x with y, both unsigned over 64 bits, you can do a signed comparison between x-2^63 and y-2^63) but that leads to extra additions at some point in the algorithm, which is usually not worth it.

I have tried it for a few hours and could not make the inner loop as fast as the version with integer registers, but that does not mean that it is not possible, only that it is likely to require some effort. Also, I tried only on Skylake CPUs; things may be different on other kinds of cores (and especially on ARM/Neon).

dot-asm · 2021-03-11T14:46:48Z

Note that ctq and arm modules do use wider approximations. It's all about delicate balance between how fast you can get the inner loop spin vs. how fast/slow multiplications are in comparison. The inner loop in turn is mostly about instruction-level parallelism that the target processor can achieve, which is also quite delicate matter. Thomas's inner loop happens to achieve close to the maximum and in combination with fast multiplications it turns to be the best choice for mulx-capable processors. You won't do better in SIMD, because the amount of instructions won't be significantly lower in comparison to scalar (because instruction capabilities are not as rich), while maximum possible ILP would be lower. Especially on most Cortex cores. Well, one can mix scalar and vector, but then you'll introduce additional dependencies (and instructions) as you "move" flag values to the vector register. [Also, mixed instruction scheduling tends to be more problematic for out-of-order execution logic on x86, which manifests in slightly lower than expected throughput.]

mratsim · 2021-03-11T14:50:23Z

Thank you both for the detailed analysis. Closing then.

mratsim closed this as completed Mar 11, 2021

mratsim mentioned this issue Feb 8, 2022

Fast modular inversion mratsim/constantine#172

Merged

mratsim mentioned this issue Jun 15, 2023

Implement faster inversion privacy-scaling-explorations/halo2curves#28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optim] SIMD for Pornin's GCD inverse #62

[Optim] SIMD for Pornin's GCD inverse #62

mratsim commented Mar 9, 2021

pornin commented Mar 9, 2021

dot-asm commented Mar 11, 2021

mratsim commented Mar 11, 2021

[Optim] SIMD for Pornin's GCD inverse #62

[Optim] SIMD for Pornin's GCD inverse #62

Comments

mratsim commented Mar 9, 2021

pornin commented Mar 9, 2021

dot-asm commented Mar 11, 2021

mratsim commented Mar 11, 2021