Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast modular exponentiation #40160

Closed
trevp mannequin opened this issue Apr 17, 2004 · 27 comments
Closed

fast modular exponentiation #40160

trevp mannequin opened this issue Apr 17, 2004 · 27 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@trevp
Copy link
Mannequin

trevp mannequin commented Apr 17, 2004

BPO 936813
Nosy @tim-one, @gpshead, @mdickinson, @tiran
Files
  • long_pow.diff
  • longobject_diffs.tar.gz
  • long_mont.diff
  • long_mont2.diff
  • long_mont3.diff
  • long_mont4.diff
  • long_pow_2005_9_28.diff: update to CVS head
  • long_pow_2009_12_30.diff
  • long_pow_2009_12_31.diff
  • time_powmod.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/mdickinson'
    closed_at = <Date 2009-12-31.19:39:35.222>
    created_at = <Date 2004-04-17.08:16:31.000>
    labels = ['interpreter-core', 'type-feature']
    title = 'fast modular exponentiation'
    updated_at = <Date 2009-12-31.19:39:35.220>
    user = 'https://bugs.python.org/trevp'

    bugs.python.org fields:

    activity = <Date 2009-12-31.19:39:35.220>
    actor = 'mark.dickinson'
    assignee = 'mark.dickinson'
    closed = True
    closed_date = <Date 2009-12-31.19:39:35.222>
    closer = 'mark.dickinson'
    components = ['Interpreter Core']
    creation = <Date 2004-04-17.08:16:31.000>
    creator = 'trevp'
    dependencies = []
    files = ['5930', '5931', '5932', '5933', '5934', '5935', '5936', '15703', '15712', '15713']
    hgrepos = []
    issue_num = 936813
    keywords = ['patch']
    message_count = 27.0
    messages = ['45791', '45792', '45793', '45794', '45795', '45796', '45797', '45798', '45799', '45800', '45801', '45802', '45803', '59785', '59825', '82063', '82498', '82535', '82536', '97003', '97031', '97054', '97098', '97101', '97102', '97104', '97108']
    nosy_count = 5.0
    nosy_names = ['tim.peters', 'gregory.p.smith', 'mark.dickinson', 'trevp', 'christian.heimes']
    pr_nums = []
    priority = 'normal'
    resolution = 'accepted'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue936813'
    versions = ['Python 3.1', 'Python 2.7']

    @trevp
    Copy link
    Mannequin Author

    trevp mannequin commented Apr 17, 2004

    For crypto-sized numbers, Python mod-exp is several
    times slower than GMP or OpenSSL (6x or more). Those
    libraries do crazy special-case stuff, + assembly,
    platform-specific tuning, and so on.

    However, there's some low-hanging fruit: this patch has
    a few basic optimizations giving a 2-3x speedup for
    numbers in the 1000-8000 bit range (that's what I've
    mostly tested; but the patch should improve, or at
    least not hurt, everything else):

    • x_mul() is special-cased for squaring, which is
      almost twice as fast as general multiplication.

    • x_mul() uses pointers instead of indices for
      iteration, giving ~10% speedup (under VC6).

    • the right-to-left square-and-multiply exponentiation
      algorithm is replaced with a left-to-right
      square-and-multiply, which takes advantage of small bases.

    • when the exponent is above a certain size, "k-ary"
      exponentiation is used to reduce the number of
      multiplications via precalculation.

    • when the modulus is odd, Montgomery reduction is used.

    • the Karatsuba cutoff seems too low. For
      multiplicands in the range of 500-5000 bits, Karatsuba
      slows multiplication by around ~25% (VC6sp4, Intel P4M
      1.7 Ghz). For larger numbers, the benefits of
      Karatsuba are less than they could be.

    Currently, the cutoff is 35 digits (525 bits). I've
    tried 70, 140, 280, and 560. 70, 140, and 280 are
    roughly the same: they don't slow down small values,
    and they have good speedup on large ones. 560 is not
    quite as good for large values, but at least it doesn't
    hurt small ones.

    I know this is platform-dependent, but I think we
    should err on the side of making the cutoff too high
    and losing some optimization, instead of putting it too
    low and slowing things down. I suggest 70.

    A couple misc. things:

    • Negative exponents with a modulus no longer give an
      error, when the base is coprime with the modulus.
      Instead, it calculates the multiplicative inverse of
      the base with respect to the modulus, using the
      extended euclidean algorithm, and exponentiates that.

    Libraries like GMP and LibTomMath work the same way.
    Being able to take inverses mod a number is useful for
    cryptography (e.g. RSA, DSA, and Elgamal).

    • The diff includes patch 923643, which supports
      converting longs to byte-strings. Ignore the last few
      diff entries, if you don't want this.

    • I haven't looked into harmonizing with int_pow().
      Something may have to be done.

    @trevp trevp mannequin assigned tim-one Apr 17, 2004
    @trevp trevp mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Apr 17, 2004
    @trevp trevp mannequin assigned tim-one Apr 17, 2004
    @trevp trevp mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Apr 17, 2004
    @trevp
    Copy link
    Mannequin Author

    trevp mannequin commented Jul 13, 2004

    Logged In: YES
    user_id=973611

    Uploading 2nd version of longobject.diff - the only change
    is that patch 923643 is removed from this diff. That was a
    diff for converting longs to byte-strings, which I
    unnecessarily left in.

    @tim-one
    Copy link
    Member

    tim-one commented Jul 17, 2004

    Logged In: YES
    user_id=31435

    Notes after a brief eyeball scan:

    Note that the expression

    a & 1 == 1

    groups as

    a & (1 == 1)

    in C -- comparisons have higher precedence in C than bit-
    fiddling operators. Stuff like that is usually best resolved by
    explicitly parenthesizing any "impure" expression fiddling with
    bits. In this case, in a boolean expression plain

    a & 1

    has the hoped-for effect. and is clearer anyway.

    Would be better to use "**" than "^" in comments when
    exponentiation is intended, since "^" means xor in both
    Python and C.

    Doc changes are needed, because you're changing visible
    semantics in some cases.

    Tests are needed, especially for new semantics.

    l_invmod can return NULL for more than one reason, but one
    of its callers ignores this, assuming that all NULL returns are
    due to lack of coprimality. It's unreasonable to, e.g., replace
    a MemoryError with a complaint about coprimality; this needs
    reworking. l_invmod should probably set an exception in
    the "not coprime" case. As is, it's a weird function,
    sometimes setting an exception when it returns NULL, but not
    setting one when coprimality doesn't obtain. That makes life
    difficult for callers (which isn't apparent in the patch,
    because its callers are currently ignoring this issue).

    The Montgomery reduction gimmicks grossly complicate this
    patch -- they're long-winded and hard to follow. That may
    come with the territory, but it's the only part of the patch
    that made me want to vomit <wink>. I'd be happier if it
    weren't there, for aesthetic, clarity, and maintainability
    reasons. How much of a speedup does it actually buy?

    You're right that int pow must deliver the same results as
    long pow, so code is needed for that too. "short int"
    versus "unbounded int" is increasingly meant to be an invisible
    internal implementation detail in Python. I'm also in favor of
    giving this meaning to modular negative exponents, btw, so
    no problem with that. An easy way would be to change int
    pow to delegate to long pow when this is needed.

    Pragmatics: there's a better chance of making 2.4 if the
    patch were done in bite-size stages. For example, no doc
    changes are needed to switch to 5-ary left-to-right
    exponentation, and that has no effect on the int
    implementation either, etc. A patch that did just that much
    probably would have gone in a long time ago.

    @trevp
    Copy link
    Mannequin Author

    trevp mannequin commented Jul 19, 2004

    Logged In: YES
    user_id=973611

    Tim, thanks for the feedback. I'm uploading a new patch
    against CVS latest that fixes those issues, and adds docs
    and tests. Also, I cleaned up the code quite a bit, and got
    it properly handling (I hope) all the varied combinations of
    ints/longs, positives/negatives/zeros etc..

    Unfortunately, Montgomery is the bulk of the speedup:
    http://trevp.net/long_pow/

    But I could split out the negative exponent handling into a
    separate patch, if that would be easier.

    Anyways, I'd like to add more tests for the exponentiation
    stuff. Aside from that, I think the patch is complete. And
    more robust than previously, though I still wouldn't trust
    it until another person or two gives it a serious
    looking-over....

    @tim-one
    Copy link
    Member

    tim-one commented Jul 21, 2004

    Logged In: YES
    user_id=31435

    Pragmatics are a real problem here, Trevor. I don't foresee
    being able to make a solid block of sufficient hours to give to
    reviewing this before Python 2.4 is history (which is why I've
    left this patch unassigned, BTW -- I just can't promise to
    make enough time). So if nobody else can volunteer to
    review it, that alone is likely to leave the patch sitting here
    unapplied.

    But there are several independent changes in this patch, and
    it *could* be broken into several smaller patches. I tossed
    that bait out before, but you didn't bite. You should <wink>.

    @trevp
    Copy link
    Mannequin Author

    trevp mannequin commented Jul 22, 2004

    Logged In: YES
    user_id=973611

    Pragmatics isn't my strong suit... but I get your drift :-).
    I split it into 3 diffs:

    1. x_mul optimizations: (pointers instead of indices,
      special-case squaring, changing Karatsuba cutoff)
    2. rewriting long_pow() for left-to-right 5-ary
    3. Montgomery reduction. This also includes l_invmod(),
      since it's necessary for Montgomery.

    I've left out the code which exposes l_invmod() to the user
    (and associated docs, tests, and intobject changes). We
    could slap that on afterwards or not...

    Anyways, these are applied sequentially:
    longobject.c + longobject1.diff = longobject1.c
    longobject1.c + longobject2.diff = longobject2.c
    longobject2.c + longobject2.diff = longobject3.c

    Should I open new tracker items for them?

    @tim-one
    Copy link
    Member

    tim-one commented Aug 29, 2004

    Logged In: YES
    user_id=31435

    Checked in the first part of the patch, with major format
    changes (Python's C coding standard is hard 8-column tabs),
    and minor code changes:

    Include/longintrepr.h 2.15
    Misc/ACKS 1.280
    Misc/NEWS 1.1119
    Objects/longobject.c 1.162

    I don't know whether it's possible for me to get to part 2 of
    the patch before 2.4a3, but I would like to. It seems plainly
    impossible that I'll be able to get to part 3 before 2.4a3.

    @tim-one
    Copy link
    Member

    tim-one commented Aug 30, 2004

    Logged In: YES
    user_id=31435

    Same deal with the 2nd part of the patch (major format
    changes, minor code changes). Incidentally fixed an old leak
    bug in long_pow() during the review. Added code to raise a
    compile-time error (C) if SHIFT isn't divisible by 5, and
    removed long_pow's new hardcoded assumption that SHIFT is
    exactly 15.

    Include/longintrepr.h 2.16
    Misc/NEWS 1.1120
    Objects/longobject.c 1.163

    This is cool stuff (& thank you!), but I'm sorry to say I can't
    foresee making time for the 3rd part of the patch for weeks.

    @trevp
    Copy link
    Mannequin Author

    trevp mannequin commented Sep 13, 2004

    Logged In: YES
    user_id=973611

    Here's the 3rd part of the patch (long_mont.diff; Montgomery
    Reduction), diff'd against 2.4a3 and cleaned up a bit.

    Note that this doesn't include negative exponent handling.
    If this patch is accepted, I'll make a new tracker item for
    that, since it's not an optimization, just an "opportunistic
    feature" (it builds on one of the helper functions needed
    for Montgomery).

    @trevp
    Copy link
    Mannequin Author

    trevp mannequin commented Oct 4, 2004

    Logged In: YES
    user_id=973611

    I did more code review, testing, and timing. The only
    change in this new patch (long_mont2.diff) is a couple
    "int"s were changed to "digits"s, and it's against CVS head.

    As far as testing, I used the random module and GMPY to
    check it on ~3 million random input values. That's about an
    hour of testing. I'll leave the tests running for a few
    days and see if anything crops up.

    As far as timing, I updated the benchmarks with a new
    machine (OpenBSD):
    http://trevp.net/long_pow/
    On 3 different machines, Montgomery gives a speedup of 2x,
    3x, and 4x. That dwarfs what we've done so far, so I'm
    crossing my fingers for 2.4. Let me know if I can explain
    or improve the code, or anything..

    (The below crypto library comes with a "book" which has an
    explanation of Montgomery I found helpful):
    http://math.libtomcrypt.org/download.html

    @trevp
    Copy link
    Mannequin Author

    trevp mannequin commented Oct 4, 2004

    Logged In: YES
    user_id=973611

    oops. Good thing for random testing, carry propagation was
    buggy. Submitting long_mont3.diff.

    @trevp
    Copy link
    Mannequin Author

    trevp mannequin commented Oct 5, 2004

    Logged In: YES
    user_id=973611

    Montgomery has a fixed cost, so it slows down small
    exponents. For example modular squaring is slowed ~5x. I
    added a MONTGOMERY_CUTOFF to take care of this. Submitting
    long_mont4.diff.

    @trevp
    Copy link
    Mannequin Author

    trevp mannequin commented Sep 29, 2005

    Logged In: YES
    user_id=973611

    I updated this patch to CVS head, but didn't change it
    otherwise. It's still a bit hairy. However, it's also
    still a big speedup (see benchmarks from 2004-10-03).

    If I can do anything to help this make it in 2.5, let me know.

    @tiran
    Copy link
    Member

    tiran commented Jan 12, 2008

    Re-targeting for 2.6
    We should discuss it at the bug day.

    @tiran tiran added type-feature A feature request or enhancement labels Jan 12, 2008
    @tiran
    Copy link
    Member

    tiran commented Jan 12, 2008

    Mark, as the second math guru in our team, you are probably interested
    in these patches. Trevor has written an interesting patch to optimize
    longs for cryptographic problems. In fact it might be two patches now,
    one for the Montgomery Reduction and one containing other optimizations.
    The 2005 patch applies almost cleanly.

    @mdickinson
    Copy link
    Member

    I'll see if I can find time to look at this; I'm currently looking at
    various ways to improve long integer arithmetic in 2.7/3.1.

    @mdickinson mdickinson assigned mdickinson and unassigned tim-one Feb 14, 2009
    @trevp
    Copy link
    Mannequin Author

    trevp mannequin commented Feb 19, 2009

    Hi Mark,

    Let me know if I can give you any help with this. The original patch
    was split into 3 parts. The only part remaining unapplied is the
    Montgomery Reduction.

    It appeared to be a significant speedup when I was last testing, and is
    frequently used in other modular exponentiation libraries, but it does,
    admittedly, complicate the code.

    @mdickinson
    Copy link
    Member

    Thanks, Trevor. I'm currently working on the 15-bit -> 30-bit digit
    change (bpo-4258), since it seems sensible to get that in before
    considering other optimizations, but I certainly hope to find time to look
    at this before the 3.1 release cycle gets underway.

    @mdickinson
    Copy link
    Member

    By the way, I'd be interested to know if you (Trevor) have any thoughts on
    the multiplication optimizations that are in the patch

    30bit_longdigit13+optimizations.patch

    in the bpo-4258 discussion. These have been giving me some quite
    spectacular speedups (4 or 5 times faster) for multiplication on
    64-bit machines; smaller speedups on 32-bit. Currently, those speedups
    render your earlier special-case-squaring patch obsolete; I wonder
    whether there's a way to get the same speedup for squaring.

    @mdickinson
    Copy link
    Member

    This patch still(!) applies almost perfectly cleanly to trunk. On a 64-
    bit machine, I'm getting a failure in test_auto_overflow, coming from:

    >>> pow(0L, 0, 9223372036854775807)
    28051505152L

    I haven't looked hard to figure out where this is coming from, but my
    guess is that the 15-bitness of digits is hard-coded in the patch
    somewhere.

    My general feeling is that three-argument pow is such a little-used
    operation in Python that it's not worth the extra code to speed it up.

    @mdickinson
    Copy link
    Member

    Looks like the test failure is a result of a misplaced (twodigits) cast: replacing the line

    carry += (twodigits) ( (*aptr) + (u * (*mptr++)) );

    in function mont_reduce with

    carry += *aptr + (twodigits)u * *mptr++;

    fixes this.

    @mdickinson
    Copy link
    Member

    Here's a slightly modified version of Trevor's patch:

    • update to apply cleanly to trunk
    • fix misplaced twodigits cast described above
    • add 'PyLong_' prefix to BASE, MASK and SHIFT
    • no need for _PyLong_Copy in l_invmod: just copy and INCREF the
      pointers
    • don't calculate d - q*c by hand in l_invmod, since l_divmod computes
      the remainder already
    • simplify reference counting in l_invmod
    • add a '* 1U' to a digit-by-digit multiplication, to avoid possible
      (in theory; unlikely in practice) undefined behaviour resulting
      from signed integer overflow.

    The rest of the patch looks fine to me, modulo some minor style issues.

    Two points:

    • it seems that l_invmod is only ever used to compute the inverse of a
      single-digit long modulo PyLong_BASE. Mightn't it be more efficient to
      simply do a twodigit/digit-based calculation here instead, to reduce the
      overhead of setting up the Montgomery reductions?

    • the current algorithm for three-argument pow involves many allocations
      and deallocations of integers; it seems to me that these could almost
      all be eliminated: for pow(a, b, c) with c an n-digit PyLong, just
      allocate 2*n (or possibly 2*n+1) digits of workspace to begin with and
      use that to store intermediate products; the Montgomery reduction could
      then be done more-or-less in place. This doesn't work with the non-
      Montgomery method, though, since l_divmod doesn't operate in place.

    @mdickinson
    Copy link
    Member

    Here's a second revision of Trevor's patch:

    • factor out the code for creating Montgomery representatives; this
      simplifies the changes to the main long_pow function
    • get rid of l_invmod and use a simple function for computing the
      negation of an inverse of a single digit modulo PyLong_BASE instead
    • Montgomery reduction wasn't being used for multidigit exponent b if
      the last digit of the exponent is small. Fix this.
    • add 'static' qualifier to mont_reduce function
    • use type Py_ssize_t instead of int for digit indices in mont_reduce
    • various other unimportant style fixes

    @mdickinson
    Copy link
    Member

    Some timings on my machine (OS X 10.6, 64-bit nondebug build, trunk r77157). These are just
    doing an RSA-like powmod pow(c, d, n), with n the product of two similarly-sized primes, d the
    inverse of 7 modulo eulerPhi(n), and c of similar magnitude to n.

    Without the patch:

    Mark-Dickinsons-MacBook-Pro:trunk dickinsm$ ./python.exe ../time_powmod.py
    64-bit modulus: 0.000031
    253-bit modulus: 0.000274
    1023-bit modulus: 0.008032

    With the patch:

    Mark-Dickinsons-MacBook-Pro:trunk-issue936813 dickinsm$ ./python.exe ../time_powmod.py
    64-bit modulus: 0.000025
    253-bit modulus: 0.000209
    1023-bit modulus: 0.006717

    So I'm seeing a speedup of 20-30%.

    I've attached the (rather primtive) timing script. Anyone else want to contribute timings?

    @mdickinson
    Copy link
    Member

    Hmm. For smaller inputs, I'm actually getting significant slowdowns:

    Unpatched:

    >>> timeit('pow(123, 123456789, 123456789L)')
    7.355183839797974

    Patched:

    >>> timeit('pow(123, 123456789, 123456789L)')
    8.873976945877075

    @mdickinson
    Copy link
    Member

    One more lot of timings, from Trevor's pow_benchmark.txt:

    Unpatched
    ---------
    1024 bits: 0.008256
    2048 bits: 0.052324
    3072 bits: 0.159689
    4096 bits: 0.357264

    Patched (percent speedup)
    -------
    1024 bits: 0.006576 (+25.5%)
    2048 bits: 0.045878 (+14.1%)
    3072 bits: 0.135740 (+17.6%)
    4096 bits: 0.310756 (+15.0%)

    I'm not quite sure why I'm not seeing the same level of speedup that Trevor originally
    reported. Perhaps this a result of some of the other optimizations that have been
    applied to the long implementation (30-bit digits, a reworked division algorithm), or
    perhaps the same level of speedup isn't available on 64-bit machines for some reason.
    I hope I haven't somehow negated the effects of the patch in my refactoring.

    @mdickinson
    Copy link
    Member

    Okay, I retested the original patch without any of my refactoring (besides
    fixing the twodigits cast), and got pretty much the same numbers.

    On a 32-bit non-debug trunk build (still on OS X 10.6), I get:

    Unpatched
    ---------

    Mark-Dickinsons-MacBook-Pro:trunk dickinsm$ ./python.exe
    ../pow_benchmark.py
    1024 bits: 0.033691
    2048 bits: 0.224796
    3072 bits: 0.712510
    4096 bits: 1.691484
    Mark-Dickinsons-MacBook-Pro:trunk dickinsm$ ./python.exe ../time_powmod.py
    64-bit modulus: 0.000054
    253-bit modulus: 0.000981
    1023-bit modulus: 0.034314

    Patched
    -------

    Mark-Dickinsons-MacBook-Pro:trunk-issue936813 dickinsm$ ./python.exe
    ../pow_benchmark.py
    1024 bits: 0.027317 (+23.3%)
    2048 bits: 0.181053 (+24.2%)
    3072 bits: 0.571688 (+24.6%)
    4096 bits: 1.251051 (+35.2%)
    Mark-Dickinsons-MacBook-Pro:trunk-issue936813 dickinsm$ ./python.exe
    ../time_powmod.py
    64-bit modulus: 0.000045 (+20.0%)
    253-bit modulus: 0.000773 (+26.9%)
    1023-bit modulus: 0.026983 (+27.2%)

    I must admit I was hoping for a bit more than this. IMO, these speedups
    aren't big enough to justify the extra complexity for what's really a bit
    of a niche function, so I'm going to reject this 3rd part and close this
    issue (but marking it as 'accepted' because most of the original 2004
    patch went in).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants