Skip to content

Conversation

@oscarbenjamin
Copy link
Contributor

@oscarbenjamin oscarbenjamin commented Sep 26, 2022

As identified in gh-95778 the algorithm used for decimal to binary conversion by int(string) has quadratic complexity. Following on from the refactor of PyLong_FromString in gh-96808 this commit implements a subquadratic algorithm for parsing strings from decimal and other bases leveraging the subquadratic complexity of integer multiplication.

This PR presents the algorithm discussed here:
https://discuss.python.org/t/int-str-conversions-broken-in-latest-python-bugfix-releases/18889/14?u=oscarbenjamin
Roughly this is algorithm 1.25 (FastIntegerInput) for base conversion as described in Richard P. Brent and Paul Zimmermann, Modern Computer Arithmetic:
https://members.loria.fr/PZimmermann/mca/mca-cup-0.5.9.pdf

The algorithm here delegates the computational work of base conversion to integer multiplication so that the cost of base conversion is O(M(n)*log(n)) where M(n) is the cost of multiplying n bit integers. CPython's implementation of multiplication currently tops out at the Karatsuba algorithm meaning that for large integers M(n) ~ n**1.58 but any improvements in multiplication would also apply to the routine added here. For now though what this means is that the complexity of parsing decimal or other base strings is such that a doubling of the input size means a tripling of the run time (rather than a quadrupling for the quadratic case). For example the times.py script shown below gives:

$ ./python times.py  # with the PR
N, times
1000, 2.6769300166051835e-05
2000, 7.497349997720448e-05
4000, 0.0002722875000472413
8000, 0.0009964619999664136
16000, 0.002764432699950703
32000, 0.005058207399997627
64000, 0.013556575300026453
128000, 0.04013246459999209
256000, 0.120897090499966
512000, 0.3636410644001444
1024000, 1.100168641199889
2048000, 3.272053180000512
4096000, 9.863306613000532
8192000, 29.544350371999826
16384000, 88.77739223000026

For comparison with main we have:

$ ./python times.py  # with main
N, times
1000, 2.0493500051088633e-05
2000, 6.1034600003040394e-05
4000, 0.00017917639997904188
8000, 0.000584913099919504
16000, 0.0018660276000446174
32000, 0.007193109999934677
64000, 0.027857269900050597
128000, 0.1132198066001365
256000, 0.4555842253999799
512000, 1.7836244715999783
1024000, 7.321736346999387
2048000, 29.017359752999255
4096000, 116.7394315090005

The CVE referenced by gh-95778 is:
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-10735
and has the text:

A flaw was found in python. In algorithms with quadratic time complexity
using non-binary bases, when using int("text"), a system could take 50ms
to parse an int string with 100,000 digits and 5s for 1,000,000 digits (float,
decimal, int.from_bytes(), and int() for binary bases 2, 4, 8, 16, and 32 are
not affected). The highest threat from this vulnerability is to system availability.

For comparison with this PR:

>>> import sys
>>> sys.set_int_max_str_digits(0)
>>> import time
>>> s = '1'*100_000
>>> t = time.time(); i = int(s); print(time.time() - t)
0.07359123229980469
>>> s = '1'*1_000_000
>>> t = time.time(); i = int(s); print(time.time() - t)
1.1337623596191406

That shows a slowdown of 15x when increasing the input size by 10x whereas the CVE refers to a 100x slowdown. The actual slowdown that I measure depends on the precise size but for very large inputs the expectation should be (as shown in the graph below) that the slowdown for a 10x bigger input is roughly:

>>> 10**1.58
38.018939632056124

It should be understood though that these relative slowdowns are compounded when further increasing the input size: for large enough inputs this PR can be arbitrarily many times faster than main but also the slowdown can be arbitrarily many times worse than linear.

I have not subjected this to any significant testing or benchmarking yet (any help with testing is obviously appreciated). Also there are probably no examples in the existing test suite that would exercise the codepaths introduced here because of the high value of the threshold for the algorithm. There are two key parameters that can probably be optimised but I've only tried a few values to gauge roughly what the best values might be (and only on one system).

This script can give an idea of what this PR could mean for performance:

# times.py

from time import perf_counter
import sys

try:
    sys.set_int_max_str_digits(0)
except:
    pass

def main(use_gmpy2=False):
    if use_gmpy2:
        import gmpy2
        intfunc = gmpy2.mpz
    else:
        intfunc = int

    nrepeats = 10
    N = 1000
    T = 0

    print("N, times")
    while T < 60:
        s = '1' * N
        start = perf_counter()
        for _ in range(nrepeats):
            i = intfunc(s)
        T = (perf_counter() - start) / nrepeats
        print(N, ', ', T, sep='')

        if T > 1:
            nrepeats = 1
        if T > 60:
            break
        N *= 2

main(*sys.argv[1:])

This script can plot the results:

# plot.py

import matplotlib.pyplot as plt
from os.path import splitext
import sys

filenames = sys.argv[1:]
data = {}
for filename in filenames:
    name = splitext(filename)[0]
    Ns = []
    Ts = []
    with open(filename) as fin:
        next(fin)
        for line in fin:
            N, T = line.split(',')
            Ns.append(int(N))
            Ts.append(float(T))
    data[name] = {'N': Ns, 'T': Ts}

fig, ax = plt.subplots(1, 1)
for name, values in data.items():
    ax.loglog(values['N'], values['T'], label=name, marker='.')
ax.legend()
ax.set_xlabel('Decimal digits')
ax.set_ylabel('Time (seconds)')

fig.savefig('times.svg')
plt.show()

With those scripts you can do:

$ ./python times.py > pr.times # With the PR
$ ./python times.py > main.times # With main
$ ./python times.py gmp > gmp.times # With gmpy2 installed
$ ./python plot.py *.times # With matplotlib installed

Then you should see this plot showing the different asymptotic complexity for this PR as compared to the quadratic algorithm used in main:
times
There are a few factors that lead to GMP having better performance than this PR but most notably for larger integers is the used of FFT-based multiplication giving M(n) ~ n*log(n). In any case this PR has much better performance than main as is more dramatically illustrated in a plot using a linear scale:
times

As identified in pythongh-95778 the algorithm used for decimal to binary
conversion by int(string) has quadratic complexity. Following on from
the reafctor of PyLong_FromString in pythongh-96808 this commit implements a
subquadratic algorithm for parsing strings from decimal and other bases
leveraging the subquadratic complexity of integer multiplication.
@tim-one
Copy link
Member

tim-one commented Sep 26, 2022

@oscarbenjamin, please look at Neil's gh-96673. There's no need to have two PRs aiming at the same thing. The str->int there is also limited by Karatsuba, but takes special advantage of base 10. The int->str there takes advantage of the decimal module's asymptotically much-better-still multiplication. There's another version of str->int I'm looking at that also exploits decimal, but, alas, is slower before the string reaches some millions of digits.

Note that Neil's PR doesn't try to improve the speed of str->int for non-power-of-2 bases other than 10. Nobody cares 😉. But, if you care, I suggest building on Neil's PR's approach, which leaves this miserable code in Python instead of adding mountains of micro-level C fiddling. For inputs large enough for this code to pay off at all, essentially all the time is spent in CPython's int * code (whether CPython's Karatsuba or decimal's FFT-like transform) - interpreter overhead for leaving the code in Python is insignificant in context.

@oscarbenjamin
Copy link
Contributor Author

The str->int there is also limited by Karatsuba, but takes special advantage of base 10.

Does the special advantage exceed anything in this PR?

@oscarbenjamin
Copy link
Contributor Author

The int->str there takes advantage of the decimal module's asymptotically much-better-still multiplication

Okay I see where you're going with this. I'll take a closer look at what the decimal module has...

@bjorn-martinsson
Copy link

bjorn-martinsson commented Sep 26, 2022

One thing I want to note is that saying that the time complexity is O(M(n)*log(n)) is not technically correct.

If you do the analysis carefully, then you will notice that if you assume M(n) = O(n*log(n)*loglog(n)) then you do get O(M(n)*log(n)).
But if assume M(n) = O(n**1.58) then you get O(M(n)).

@tim-one
Copy link
Member

tim-one commented Sep 26, 2022

The str->int there is also limited by Karatsuba, but takes special advantage of base 10.

Does the special advantage exceed anything in this PR?

You need to decide for yourself. At least 5 people have contributed to the code in his PR so far, and that's what I'm encouraging you to join in on.

I don't have the time or the inclination to stare at piles of C code here 😉. The str->int in the other PR, for example, doesn't need to allocate memory for lists. It does need memory to store a dict with O(log log n) entries, to cache exact powers of 10. Except they're really powers of 5. 10**j = (5*2)**j = 5**j * 2**j = (5**j) << j, and those trailing zero bits are a lot cheaper to obtain via shifting than via asking multiplication to produce them.

The other PR's algorithms are recursive, which allows them to be compact and elegant. Since we spend almost all cycles in * implementations, the overheads of recursion in Python just don't matter.

@oscarbenjamin
Copy link
Contributor Author

please look at Neil's gh-96673. The str->int there is also limited by Karatsuba, but takes special advantage of base 10.

My timings don't show any benefit resulting from taking "special advantage of base 10". Both PRs seem to be about the same for large inputs and gh-96673 seems to be slower than this PR for inputs of around 1000-2000 digits:
times
That's timing int(string) with strings generated by:

def randstr(N):
    return ''.join(random.choices('123456789', k=N))

I don't have the time or the inclination to stare at piles of C code here

Fair enough but I don't think that the code here is particularly complicated.

The other PR's algorithms are recursive, which allows them to be compact and elegant. Since we spend almost all cycles in * implementations, the overheads of recursion in Python just don't matter.

There is also the question of memory usage. I think that this PR could be made to have a peek memory usage not much bigger than the input string (as mentioned in the code comments). I haven't analysed the approach in gh-96673 in detail but from a quick look I expect that the memory overhead is a few times greater.

@FedericoStra
Copy link

FedericoStra commented Sep 28, 2022

I don't have the time or the inclination to stare at piles of C code here

Fair enough but I don't think that the code here is particularly complicated.

Just a quick question (I haven't read any actual code).

Is there any reason why Cython can't be used to get the C-like performance whilst coding in ~Python (i.e. without having to worry about the various PyMem_Malloc, PyMem_Free, Py_DECREF, Py_XDECREF)? Is someone exploring this path?

[Edit] The licensing seems fine: from cython/COPYING.txt:

The output of a Cython compilation is NOT considered a derivative
work of Cython. Specifically, though the compilation process may
embed snippets of varying lengths into the final output, these
snippets, as embedded in the output, do not encumber the resulting
output with any license restrictions.

@pochmann
Copy link
Contributor

pochmann commented Oct 2, 2022

and gh-96673 seems to be slower than this PR for inputs of around 1000-2000 digits:

If I'm not mistaken, it's not intended to be used for such short strings:

cpython/Objects/longobject.c

Lines 2696 to 2705 in 59c81da

#if WITH_PYLONG_MODULE
if (digits > 6000 && base == 10) {
/* Switch to _pylong.int_from_string() */
return pylong_int_from_string(start, end, res);
}
#endif
/* Use the quadratic algorithm for non binary bases. */
return long_from_non_binary_base(start, end, digits, base, res);
}
}

@oscarbenjamin
Copy link
Contributor Author

and gh-96673 seems to be slower than this PR for inputs of around 1000-2000 digits:

If I'm not mistaken

Yes, that's what I thought. It's probably just a blip somehow in the timing script I used (I remember that I did repeat the timings because it seemed odd but got the same results again). The script was not really intended to target small run times.

@oscarbenjamin
Copy link
Contributor Author

Closing after gh-96673 was merged.

@oscarbenjamin oscarbenjamin deleted the pr_subquadratic branch May 8, 2023 12:13
@oscarbenjamin oscarbenjamin restored the pr_subquadratic branch May 8, 2023 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants