Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks vs MCL on ARM #28

Closed
mratsim opened this issue Nov 12, 2019 · 3 comments
Closed

Benchmarks vs MCL on ARM #28

mratsim opened this issue Nov 12, 2019 · 3 comments

Comments

@mratsim
Copy link
Member

mratsim commented Nov 12, 2019

For reference, this is MCL speed on ARM-32 Rpi 4

https://github.com/mratsim/mcl/blob/2b318e84/bench-arm32-pi4.log

JIT 0
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT        1.932msec
G1::mul          1.976msec
G1::add         11.765usec
G1::dbl          8.713usec
G2::mulCT        4.389msec
G2::mul          4.527msec
G2::add         39.784usec
G2::dbl         21.466usec
GT::pow          7.413msec
G1::setStr chk   2.834msec
G1::setStr      12.919usec
G2::setStr chk   7.554msec
G2::setStr      26.452usec
hashAndMapToG1   2.722msec
hashAndMapToG2   5.766msec
Fp::add         47.45nsec
Fp::sub         56.04nsec
Fp::neg         29.36nsec
Fp::mul        894.09nsec
Fp::sqr          1.371usec
Fp::inv        121.102usec
Fp2::add        93.46nsec
Fp2::sub       109.45nsec
Fp2::neg        58.77nsec
Fp2::mul         4.102usec
Fp2::mul_xi    116.80nsec
Fp2::sqr         1.948usec
Fp2::inv       129.694usec
FpDbl::addPre   83.59nsec
FpDbl::subPre   84.77nsec
FpDbl::add      83.48nsec
FpDbl::sub      86.79nsec
FpDbl::mulPre  888.74nsec
FpDbl::sqrPre  846.63nsec
FpDbl::mod     525.81nsec
Fp2Dbl::mulPre    3.015usec
Fp2Dbl::sqrPre    1.910usec
GT::add        570.59nsec
GT::mul         71.228usec
GT::sqr         49.861usec
GT::inv        258.619usec
FpDbl::mulPre  888.75nsec
pairing         21.020msec
millerLoop       9.109msec
finalExp        11.902msec
precomputeG2     2.191msec
precomputedML    6.881msec
millerLoopVec   47.449msec
ctest:module=finalExp
finalExp  11.990msec
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1 499.901usec
naiveG2 214.120usec
calcBN2 981.032usec
naiveG2 725.049usec
BLS12_381
calcBN1   1.122msec
naiveG1 690.237usec
calcBN2   2.271msec
naiveG2   1.818msec
ctest:module=eth2
mapToG2  org-cofactor  10.949msec
mapToG2 fast-cofactor   6.192msec
ctest:name=bls12_test, module=7, total=832, ok=832, ng=0, exception=0

Reproduction command:

git clone git@github.com:herumi/mcl
cd mcl
make bin/bls12_test.exe MCL_USE_GMP=0 MCL_USE_OPENSSL=0
bin/bls12_test.exe
@mratsim mratsim changed the title Benchmarks vs MCL Benchmarks vs MCL on ARM Nov 12, 2019
@mratsim
Copy link
Member Author

mratsim commented Apr 9, 2020

On the same Raspberry Pi 4b on ARM
Broadcom BCM2711, Quad core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5GHz

Note: I am using 32-bit Raspbian so Milagro was compiled with 32-bit limbs, you can expect 2x more perf in 64-bit mode:

Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true                                          
⚠️ Warning: using Milagro with 32-bit limbs

=================================================================================================================    

Scalar multiplication G1                                    194.349 ops/s      5145377 ns/op                         
Scalar multiplication G2                                     71.983 ops/s     13892224 ns/op                         
EC add G1                                                 75746.099 ops/s        13202 ns/op                         
EC add G2                                                 25377.490 ops/s        39405 ns/op                         
Pairing (Milagro builtin double pairing)                     38.146 ops/s     26214868 ns/op                         
Pairing (Multi-Pairing with delayed Miller and Exp)          38.008 ops/s     26310474 ns/op                         

⚠️ Warning: using draft v5 of IETF Hash-To-Curve (HKDF-based).                                                        
           This is an outdated draft.

Hash to G2 (Draft #5)                                        75.663 ops/s     13216584 ns/op

@mratsim
Copy link
Member Author

mratsim commented Apr 9, 2020

On a Huawei P20 Lite phone (2018 entry/midrange phone)
Processor HiSilicon Kirin 659 2360MHz (ARM v8)

Milagro in 64-bit mode

Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true
Using Milagro with 64-bit limbs

=================================================================================================================

Scalar multiplication G1                                    284.394 ops/s      3516247 ns/op
Scalar multiplication G2                                    102.853 ops/s      9722601 ns/op
EC add G1                                                105864.916 ops/s         9446 ns/op
EC add G2                                                 36911.265 ops/s        27092 ns/op
Pairing (Milagro builtin double pairing)                     50.957 ops/s     19624477 ns/op
Pairing (Multi-Pairing with delayed Miller and Exp)          50.068 ops/s     19972640 ns/op

⚠️ Warning: using draft v5 of IETF Hash-To-Curve (HKDF-based).
           This is an outdated draft.

Hash to G2 (Draft #5)                                       117.156 ops/s      8535621 ns/op

@mratsim
Copy link
Member Author

mratsim commented Apr 9, 2020

MCL on the same phone, 64-bit as well

EC Add G2 is 2x faster than Milagro
Scalar multiplication is 2x faster than Milagro
Pairing is 2.5x faster than Milagro

JIT 0
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT      741.964usec
G1::mul        760.436usec
G1::add          4.797usec
G1::dbl          3.202usec
G2::mulCT        1.627msec
G2::mul          1.675msec
G2::add         14.409usec
G2::dbl          8.494usec
GT::pow          2.598msec
G1::setStr chk   1.078msec
G1::setStr      11.047usec
G2::setStr chk   2.926msec
G2::setStr      22.819usec
hashAndMapToG1   1.373msec
hashAndMapToG2   2.767msec
Fp::add         23.30nsec
Fp::sub         22.19nsec
Fp::neg         18.86nsec
Fp::mul        410.98nsec
Fp::sqr        412.34nsec
Fp::inv        147.942usec
Fp2::add        47.42nsec
Fp2::sub        45.46nsec
Fp2::neg        34.94nsec
Fp2::mul         1.395usec
Fp2::mul_xi     70.49nsec
Fp2::sqr       917.64nsec
Fp2::inv       151.215usec
FpDbl::addPre   41.34nsec
FpDbl::subPre   43.81nsec
FpDbl::add      44.43nsec
FpDbl::sub      42.14nsec
FpDbl::mulPre  243.68nsec
FpDbl::sqrPre  159.24nsec
FpDbl::mod     228.10nsec
Fp2Dbl::mulPre  912.43nsec
Fp2Dbl::sqrPre  552.47nsec
GT::add        293.75nsec
GT::mul         24.773usec
GT::sqr         17.505usec
GT::inv        202.780usec
FpDbl::mulPre  243.44nsec
pairing          7.627msec
millerLoop       3.251msec
finalExp         4.361msec
precomputeG2   823.918usec
precomputedML    2.424msec
millerLoopVec   18.560msec
ctest:module=finalExp
finalExp   4.355msec
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1 386.328usec
naiveG2  75.901usec
calcBN2 730.271usec
naiveG2 456.292usec
BLS12_381
calcBN1 695.115usec
naiveG1 243.682usec
calcBN2   1.440msec
naiveG2 935.823usec
ctest:module=eth2
mapToG2  org-cofactor   4.951msec
mapToG2 fast-cofactor   3.194msec               ctest:module=deserialize                        verifyOrder(1)                                  deserializeG1   1.500msec
deserializeG2   3.913msec                       verifyOrder(0)
deserializeG1 432.423usec                       deserializeG2   1.009msec
ctest:name=bls12_test, module=8, total=3600, ok=3600, ng=0, exception=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant