Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark vs MCL on x86 #47

Closed
mratsim opened this issue Apr 8, 2020 · 3 comments
Closed

Benchmark vs MCL on x86 #47

mratsim opened this issue Apr 8, 2020 · 3 comments

Comments

@mratsim
Copy link
Contributor

mratsim commented Apr 8, 2020

How to reproduce

MCL

git clone https://github.com/herumi/mcl
cd mcl

make -j ${ncpu}
make bin/bls12_test.exe # even on Linux

bin/bls12_test.exe

nim-blscurve

git clone https://github.com/status-im/nim-blscurve
cd nim-blscurve
nimble bench

Results

On i9-9980XE. Note: Overclocked at 4.1GHz while nominal clock is 3.0 GHz so the cycle count is off by a factor 4.1/3.0

Reading results:

  • G2 multiplications is sign
  • pairing is composed among others of Miller Loop and Final Exponentiation
    pairing is verify

MCL using JIT (x86-only)

Highlighted the important parts

$  bin/bls12_test.exe
JIT 1
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT      198.538Kclk <-------------------------
G1::mul        207.139Kclk <-------------------------
G1::add          1.252Kclk <-------------------------
G1::dbl        880.09 clk
G2::mulCT      383.361Kclk <-------------------------
G2::mul        407.538Kclk <-------------------------
G2::add          3.424Kclk <-------------------------
G2::dbl          2.169Kclk
GT::pow        802.285Kclk
G1::setStr chk 289.740Kclk
G1::setStr       2.606Kclk
G2::setStr chk 723.913Kclk
G2::setStr       5.467Kclk
hashAndMapToG1 225.867Kclk
hashAndMapToG2 467.456Kclk
Fp::add         14.27 clk
Fp::sub          9.28 clk
Fp::neg          8.05 clk
Fp::mul        107.01 clk
Fp::sqr         98.88 clk
Fp::inv          4.542Kclk
Fp2::add        19.23 clk
Fp2::sub        14.30 clk
Fp2::neg        13.90 clk
Fp2::mul       305.68 clk
Fp2::mul_xi     22.51 clk
Fp2::sqr       228.53 clk
Fp2::inv         5.111Kclk
FpDbl::addPre   13.24 clk
FpDbl::subPre   13.21 clk
FpDbl::add      19.52 clk
FpDbl::sub      12.88 clk
FpDbl::mulPre   46.70 clk
FpDbl::sqrPre   40.28 clk
FpDbl::mod      58.60 clk
Fp2Dbl::mulPre  203.47 clk
Fp2Dbl::sqrPre  113.19 clk
GT::add        114.57 clk
GT::mul          7.663Kclk
GT::sqr          5.490Kclk
GT::inv         16.616Kclk
FpDbl::mulPre   46.79 clk
pairing          2.237Mclk <-------------------------
millerLoop     906.605Kclk <-------------------------
finalExp         1.329Mclk <-------------------------
precomputeG2   201.104Kclk
precomputedML  686.266Kclk
millerLoopVec    4.616Mclk
ctest:module=finalExp
finalExp   1.338Mclk
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1  32.201Kclk
naiveG2  17.132Kclk
calcBN2  64.564Kclk
naiveG2  46.680Kclk
BLS12_381
calcBN1  77.113Kclk
naiveG1  56.032Kclk
calcBN2 157.989Kclk
naiveG2 129.702Kclk
ctest:module=eth2
mapToG2  org-cofactor 904.585Kclk
mapToG2 fast-cofactor 476.747Kclk
ctest:module=deserialize
verifyOrder(1)
deserializeG1 345.128Kclk
deserializeG2 842.046Kclk
verifyOrder(0)
deserializeG1  57.978Kclk
deserializeG2 121.257Kclk
ctest:name=bls12_test, module=8, total=3600, ok=3600, ng=0, exception=0

MCL using Assembly from LLVM i256 and i384 (x86 and ARM)

$  bin/bls12_test.exe -m llvm_mont
JIT 1
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT      208.328Kclk <-------------------------
G1::mul        212.953Kclk <-------------------------
G1::add          1.279Kclk <-------------------------
G1::dbl        919.59 clk
G2::mulCT      522.604Kclk <-------------------------
G2::mul        537.523Kclk <-------------------------
G2::add          4.638Kclk <-------------------------
G2::dbl          2.704Kclk
GT::pow        916.182Kclk
G1::setStr chk 301.920Kclk
G1::setStr       2.684Kclk
G2::setStr chk 935.164Kclk
G2::setStr       5.723Kclk
hashAndMapToG1 239.817Kclk
hashAndMapToG2 582.365Kclk
Fp::add         13.37 clk
Fp::sub         13.86 clk
Fp::neg          8.78 clk
Fp::mul        110.30 clk
Fp::sqr        110.73 clk
Fp::inv          4.594Kclk
Fp2::add        25.68 clk
Fp2::sub        27.10 clk
Fp2::neg        19.76 clk
Fp2::mul       453.66 clk
Fp2::mul_xi     35.83 clk
Fp2::sqr       254.18 clk
Fp2::inv         5.168Kclk
FpDbl::addPre   14.72 clk
FpDbl::subPre   14.64 clk
FpDbl::add      19.02 clk
FpDbl::sub      17.55 clk
FpDbl::mulPre   54.56 clk
FpDbl::sqrPre   51.65 clk
FpDbl::mod      80.36 clk
Fp2Dbl::mulPre  243.72 clk
Fp2Dbl::sqrPre  145.11 clk
GT::add        168.85 clk
GT::mul          8.658Kclk
GT::sqr          6.252Kclk
GT::inv         20.416Kclk
FpDbl::mulPre   53.03 clk
pairing          2.717Mclk <-------------------------
millerLoop       1.105Mclk <-------------------------
finalExp         1.550Mclk <-------------------------
precomputeG2   269.778Kclk
precomputedML  852.854Kclk
millerLoopVec    5.598Mclk
ctest:module=finalExp
finalExp   1.547Mclk
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1  34.282Kclk
naiveG2  17.954Kclk
calcBN2  65.749Kclk
naiveG2  48.280Kclk
BLS12_381
calcBN1  82.024Kclk
naiveG1  60.722Kclk
calcBN2 167.170Kclk
naiveG2 139.621Kclk
ctest:module=eth2
mapToG2  org-cofactor   1.150Mclk
mapToG2 fast-cofactor 585.033Kclk
ctest:module=deserialize
verifyOrder(1)
deserializeG1 366.132Kclk
deserializeG2   1.152Mclk
verifyOrder(0)
deserializeG1  62.822Kclk
deserializeG2 130.251Kclk
ctest:name=bls12_test, module=8, total=3600, ok=3600, ng=0, exception=0

Nim-blscurve using Milagro

Warmup: 0.9038 s, result 224 (displayed to avoid compiler optimizing warmup away)


Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true
Using Milagro with 64-bit limbs
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz



⚠️ Cycles measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
i.e. a 20% overclock will be about 20% off (assuming no dynamic frequency scaling)

=================================================================================================================

Scalar multiplication G1                                   2399.307 ops/s       416787 ns/op      1250378 cycles
Scalar multiplication G2                                    890.692 ops/s      1122722 ns/op      3368208 cycles
EC add G1                                                911577.028 ops/s         1097 ns/op         3291 cycles
EC add G2                                                323310.702 ops/s         3093 ns/op         9281 cycles
Pairing (Milagro builtin double pairing)                    420.039 ops/s      2380729 ns/op      7142276 cycles
Pairing (Multi-Pairing with delayed Miller and Exp)         417.711 ops/s      2393999 ns/op      7182089 cycles

⚠️ Warning: using draft v5 of IETF Hash-To-Curve (HKDF-based).
           This is an outdated draft.

Hash to G2 (Draft #5)                                       937.034 ops/s      1067197 ns/op      3201632 cycles

Conclusion

(Our cycles and MCL clocks/clk are the same unit.)

Scalar Multiplication G2 / Signing is about 8x slower
Pairing / Verification is about 3x slower

@mratsim
Copy link
Contributor Author

mratsim commented Jun 4, 2020

Explanation on scalar multiplication slowness:

  • The order r is of bit width 255, but the benchmark is using 381 bits
    proc benchScalarMultG1*(iters: int) =
    var x = generator1()
    var scal: BIG_384
    random(scal)
    bench("Scalar multiplication G1", iters):
    x.mul(scal)
    proc benchScalarMultG2*(iters: int) =
    var x = generator2()
    var scal: BIG_384
    random(scal)
    bench("Scalar multiplication G2", iters):
    x.mul(scal)

    This means 50% more operations

@mratsim
Copy link
Contributor Author

mratsim commented Jun 14, 2020

Actually, looking in the code, scalar mul uses nbits which does check (and leaks) the number of exponent bits:

int BIG_384_58_nbits(BIG_384_58 a)
{
int bts,k=NLEN_384_58-1;
BIG_384_58 t;
chunk c;
BIG_384_58_copy(t,a);
BIG_384_58_norm(t);
while (k>=0 && t[k]==0) k--;
if (k<0) return 0;
bts=BASEBITS_384_58*k;
c=t[k];
while (c!=0)
{
c/=2;
bts++;
}
return bts;
}

nb=1+(BIG_384_58_nbits(t)+3)/4;
/* convert exponent to signed 4-bit window */
for (i=0; i<nb; i++)
{
w[i]=BIG_384_58_lastbits(t,5)-16;
BIG_384_58_dec(t,w[i]);
BIG_384_58_norm(t);
BIG_384_58_fshr(t,4);
}
w[nb]=BIG_384_58_lastbits(t,5);

However there is a special path in pair_BLS381.c to use endomorphism acceleration if activated

/* Multiply P by e in group G2 */
void PAIR_BLS381_G2mul(ECP2_BLS381 *P,BIG_384_58 e)
{
#ifdef USE_GS_G2_BLS381 /* Well I didn't patent it :) */
int i,np,nn;
ECP2_BLS381 Q[4];
FP2_BLS381 X;
FP_BLS381 fx,fy;
BIG_384_58 x,y,u[4];
FP_BLS381_rcopy(&fx,Fra_BLS381);
FP_BLS381_rcopy(&fy,Frb_BLS381);
FP2_BLS381_from_FPs(&X,&fx,&fy);
#if SEXTIC_TWIST_BLS381==M_TYPE
FP2_BLS381_inv(&X,&X);
FP2_BLS381_norm(&X);
#endif
BIG_384_58_rcopy(y,CURVE_Order_BLS381);
gs(u,e);
ECP2_BLS381_copy(&Q[0],P);
for (i=1; i<4; i++)
{
ECP2_BLS381_copy(&Q[i],&Q[i-1]);
ECP2_BLS381_frob(&Q[i],&X);
}
for (i=0; i<4; i++)
{
np=BIG_384_58_nbits(u[i]);
BIG_384_58_modneg(x,u[i],y);
nn=BIG_384_58_nbits(x);
if (nn<np)
{
BIG_384_58_copy(u[i],x);
ECP2_BLS381_neg(&Q[i]);
}
BIG_384_58_norm(u[i]);
}
ECP2_BLS381_mul4(P,Q,u);
#else
ECP2_BLS381_mul(P,e);
#endif
}
which should be benchmarked as well
and probably be used instead of the generic ecp_mul

@mratsim
Copy link
Contributor Author

mratsim commented Sep 24, 2020

Updated MCL perf as of https://github.com/herumi/mcl/tree/b390d6d4ffded57727ff752fb0209e0d397ed946 (sept 21)

JIT 1
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT      224.863Kclk
G1::mul        207.081Kclk
G1::add          1.225Kclk
G1::dbl        900.69 clk
G2::mulCT      464.260Kclk
G2::mul        402.097Kclk
G2::add          3.414Kclk
G2::dbl          2.120Kclk
GT::pow        684.362Kclk
G1::setStr chk 289.395Kclk
G1::setStr       2.485Kclk
G2::setStr chk 702.704Kclk
G2::setStr       5.159Kclk
hashAndMapToG1 221.856Kclk
hashAndMapToG2 453.844Kclk
Fp::add         13.62 clk
Fp::sub          9.28 clk
Fp::neg          8.05 clk
Fp::mul        101.33 clk
Fp::sqr         98.98 clk
Fp::inv          4.536Kclk
Fp::pow         52.610Kclk
Fr::add         10.24 clk
Fr::sub          9.57 clk
Fr::neg          5.88 clk
Fr::mul         53.96 clk
Fr::sqr         54.48 clk
Fr::inv          1.895Kclk
Fr::pow         19.889Kclk
Fp2::add        21.71 clk
Fp2::sub        16.92 clk
Fp2::neg        14.85 clk
Fp2::mul       310.27 clk
Fp2::mul_xi     29.50 clk
Fp2::sqr       225.28 clk
Fp2::inv         5.093Kclk
FpDbl::addPre    9.66 clk
FpDbl::subPre    9.60 clk
FpDbl::add      15.96 clk
FpDbl::sub      10.69 clk
FpDbl::mulPre   46.81 clk
FpDbl::sqrPre   40.39 clk
FpDbl::mod      58.99 clk
Fp2Dbl::mulPre  187.35 clk
Fp2Dbl::sqrPre  111.38 clk
GT::add        114.84 clk
GT::mul          6.364Kclk
GT::sqr          4.603Kclk
GT::inv         15.891Kclk
FpDbl::mulPre   46.84 clk
pairing          1.980Mclk
millerLoop     841.630Kclk
finalExp         1.115Mclk
precomputeG2   202.196Kclk
precomputedML  613.377Kclk
millerLoopVec    4.277Mclk
ctest:module=finalExp
finalExp   1.103Mclk
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1  32.363Kclk
naiveG2  17.237Kclk
calcBN2  64.593Kclk
naiveG2  46.927Kclk
BLS12_381
calcBN1  76.468Kclk
naiveG1  55.494Kclk
calcBN2 156.801Kclk
naiveG2 128.367Kclk
ctest:module=deserialize
verifyOrder(1)
deserializeG1 346.770Kclk
deserializeG2 818.290Kclk
verifyOrder(0)
deserializeG1  59.815Kclk
deserializeG2 119.463Kclk
ctest:module=verifyG1
ctest:module=verifyG2
ctest:name=bls12_test, module=9, total=3727, ok=3727, ng=0, exception=0

Pairing has been accelerated by 14%

@mratsim mratsim closed this as completed Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant