Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question: modular multiplication/square look slower than GMP (in my benchs) #266

Closed
mario-tux opened this issue Feb 4, 2019 · 2 comments

Comments

@mario-tux
Copy link

I usually considered MPIR faster than GMP in almost all operations but I recently faced some disorienting benchmark of more complex operations (I was trying to implement, upon MPIR/GMP operations, a modular exponentiation with sliding window on a fixed base exploiting precomputation).

In order to investigate I did some micro-benchmarks on the elementary operations. These are the reports:

cpu: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz NO-TURBO
compiler: GNU GCC v.8.2.1
mpi library: GMP v.6.1.2
modulus size: 2048 bits
small exponent size: 144 bits
medium exponent size: 256 bits
large exponent size: 2048 bits
modular exponentiation with small exponent: 322.150 μs (±22.734 μs = 0.07%)
modular exponentiation with medium exponent: 560.644 μs (±68.556 μs = 0.12%)
modular exponentiation with full exponent: 4219.308 μs (±375.934 μs = 0.09%)
modular exponentiation with 2^k medium exponent: 479.741 μs (±35.888 μs = 0.07%)
integer multiplication between two large numbers: 0.943 μs (±0.437 μs = 0.46%)
modular reduction of a 2x large number: 0.248 μs (±0.175 μs = 0.71%)
modular multiplication between two large numbers: 3.107 μs (±0.609 μs = 0.20%)
unbalanced modular multiplication between a large number and a medium one: 0.671 μs (±0.260 μs = 0.39%)
modular square of a large number: 2.765 μs (±0.589 μs = 0.21%)
integer addition between two large numbers: 0.046 μs (±0.083 μs = 1.81%)
modular addition between two large numbers: 0.317 μs (±0.217 μs = 0.68%)
cpu: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz NO-TURBO
compiler: GNU GCC v.8.2.1
mpi library: MPIR v.3.0.0
modulus size: 2048 bits
small exponent size: 144 bits
medium exponent size: 256 bits
large exponent size: 2048 bits
modular exponentiation with small exponent: 272.561 μs (±36.348 μs = 0.13%)
modular exponentiation with medium exponent: 476.236 μs (±66.424 μs = 0.14%)
modular exponentiation with full exponent: 3502.220 μs (±259.195 μs = 0.07%)
modular exponentiation with 2^k medium exponent: 405.858 μs (±51.426 μs = 0.13%)
integer multiplication between two large numbers: 0.761 μs (±0.425 μs = 0.56%)
modular reduction of a 2x large number: 0.132 μs (±0.154 μs = 1.17%)
modular multiplication between two large numbers: 9.734 μs (±1.721 μs = 0.18%)
unbalanced modular multiplication between a large number and a medium one: 0.760 μs (±0.405 μs = 0.53%)
modular square of a large number: 9.556 μs (±1.988 μs = 0.21%)
integer addition between two large numbers: 0.045 μs (±0.101 μs = 2.25%)
modular addition between two large numbers: 0.348 μs (±0.219 μs = 0.63%)

All the exponentiations are faster MPIR (as expected). What is slower in MPIR are: modular multiplication (mul+mod) and modular square (mul+mod).

Some details:

  • I locally recompiled both the MPI libraries using only the "fat mode" option;
  • on the "integer multiplication" test I use mpz_mul(n3, n1, n2), so non-inplace operation;
  • on the "modular reduction" test I use mpz_mod(n3, n1, mod) with n1 with the double size of mod;
  • on the "multiplication multiplication" test I use mpz_mul(n3, n1, n2); mpz_mod(n3, n3, mod); it looks slower than just the sum of the two previous tests: my guess is that it is related to the cache usage; but the main fact is that this operation is VERY slow on MPIR w.r.t. GMP!
  • on the "square multiplication" test I use mpz_mul(n3, n1, n1); mpz_mod(n3, n3, mod);
  • all the tests are benchmarked on a 6 seconds period sampling time using cpu cycles counter in assembly; the reported time is the median followed by the standard deviation.

I would like to understand such behavior and, maybe, to get a better performance of my fixed base exponentiation using MPIR (now it very slow because of the square/multiplication steps).

Side note: why GMP/MPIR never included modular exponentiation exploiting fixed base? :)

@wbhart
Copy link
Owner

wbhart commented Feb 4, 2019

Your processor looks very recent, but we don't have a lot of support for recent CPUs. Are you sure it is supported properly in MPIR? What does ./config.guess return?

Basically MPIR development ground to a halt some years ago, as we made the project community supported. So far, few people have stepped up to continue development.

I'm not sure that fat mode is a good idea if you want performance. You are best letting the system configure MPIR for your processor automatically. Fat mode will add extra overhead on every operation.

Incidentally, Intel CPU cycle counting is no longer reliable. It no longer counts CPU cycles, and the numbers you get from the counter depend on many factors. You are better off using an actual clock for timings.

@mario-tux
Copy link
Author

mario-tux commented Feb 4, 2019

Your processor looks very recent, but we don't have a lot of support for recent CPUs. Are you sure it is supported properly in MPIR? What does ./config.guess return?

This is the result with MPIR: skylakeavx-unknown-linux-gnu. GMP recognize it as skylake-pc-linux-gnu.

I'm not sure that fat mode is a good idea if you want performance. You are best letting the system configure MPIR for your processor automatically. Fat mode will add extra overhead on every operation.

Oh, this is new for me. I was misleaded by description: I didn't think the cpu detection and version managament was so bad. I disabled it and all the strange behaviors are gone (with better numbers too). Thanks.

Incidentally, Intel CPU cycle counting is no longer reliable. It no longer counts CPU cycles, and the numbers you get from the counter depend on many factors. You are better off using an actual clock for timings.

I studied the topic a bit: in the past there where problems but now with modern CPUs supporting instruction rdtscp and cpu features (using Linux flag names) constant_tsc and nonstop_tsc all the issues should be gone. It should be the most precise method with low (sampling) overhead. I think the high precision clock implementation of all the main OSs are based on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants