gh-90716: Refactor PyLong_FromString to separate concerns #96808

oscarbenjamin · 2022-09-13T21:36:48Z

This is a preliminary PR to refactor PyLong_FromString which is currently quite messy and has spaghetti like code that mixes up different concerns as well as duplicating logic.

In particular:

PyLong_FromString now only handles sign, base and prefix detection and calls a new function long_from_string_base to parse the main body of the string.
The long_from_string_base function handles all string validation and then calls long_from_binary_base or a new function long_from_non_binary_base to construct the actual PyLong.
The existing long_from_binary_base function is simplified by factoring duplicated logic to long_from_string_base.
The new function long_from_non_binary_base factors out much of the code from PyLong_FromString including in particular the quadratic algorithm reffered to in CVE-2020-10735: Prevent DoS by large int<->str conversions #95778 so that this can be seen separately from unrelated concerns such as string validation.

I intend to follow up on this with a PR to improve the algorithm used for decimal and other non binary bases but I think that would be a lot easier to do after this refactoring. I could also submit that algorithm in the same PR but I thought it would be easier to review this refactoring separately from a change of algorithm.

Issue: Quadratic time internal base conversions #90716

bedevere-bot · 2022-09-13T21:36:51Z

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

ghost · 2022-09-13T21:37:17Z

All commit authors signed the Contributor License Agreement.

bedevere-bot · 2022-09-13T21:39:32Z

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

gpshead · 2022-09-14T00:59:31Z

Please create a new github feature request issue to track this work.

oscarbenjamin · 2022-09-14T01:15:33Z

Please create a new github feature request issue to track this work.

I don't really understand this workflow but does #96812 work for this?

gpshead · 2022-09-14T02:01:46Z

yep, that makes sense, though we've already got an issue for that - edited/redirected to it. :)

mdickinson · 2022-09-14T16:59:34Z

Thanks for the PR! I can look at this this weekend.

mdickinson

@oscarbenjamin It would be useful to have a summary of the changes in the PR description. If I'm understanding correctly:

long_from_binary_base has been modified to no longer count characters or normalise its result
there's a new function long_from_non_binary_base which has exactly the same signature as long_from_binary_base, but is intended for conversions from strings not of base 2, 4, 8, 16 or 32
another new function long_from_string_base encapsulates parsing, validation, digit counting, and length validation, and dispatches to long_from_binary_base or long_from_non_binary_base as appropriate
The top-level PyLong_FromString handles signs, base-wrangling (including the special case base=0) and prefixes like 0x, and then hands things off to long_from_string_base; it also performs the final normalization

Is the above reasonably accurate? I assume the point is that long_from_non_binary_base can now be a target for optimization.

mdickinson · 2022-09-20T17:12:18Z

Objects/longobject.c

-/* *str points to the first digit in a string of base `base` digits.  base
- * is a power of 2 (2, 4, 8, 16, or 32).  *str is set to point to the first
- * non-digit (which may be *str!).  A normalized int is returned.
+/* `start` and `end` point to the start end of a string of base `base` digits.


Suggested change

/* `start` and `end` point to the start end of a string of base `base` digits.

/* `start` and `end` point to the start and end of a string of base `base` digits.

mdickinson · 2022-09-20T17:13:30Z

Objects/longobject.c

- * is a power of 2 (2, 4, 8, 16, or 32).  *str is set to point to the first
- * non-digit (which may be *str!).  A normalized int is returned.
+/* `start` and `end` point to the start end of a string of base `base` digits.
+ * base is a power of 2 (2, 4, 8, 16, or 32). An unnormalized int is returned.


Please could you add a description of the new digits parameter? (It would be good to clarify that it ignores underscores, so is not the same as end - start.)

mdickinson · 2022-09-20T17:28:20Z

Objects/longobject.c

+    /*
+     * long_from_string_base is the main workhorse. It sets str to the first
+     * null byte or the first invalid character and either:
+     *
+     * - Returns -1 for a SyntaxError.
+     * - Returns 0 and sets z to NULL for MemoryError/OverflowError.
+     * - Sets z to an unsigned, unnormalized PyLong (success!).
+     */


We could probably lose most of this comment given the comprehensive description just before the long_from_string_base function itself.

mdickinson · 2022-09-20T17:45:18Z

@oscarbenjamin

It would be useful to have a summary of the changes in the PR description.

Gah. Please ignore. Reading fail.

oscarbenjamin · 2022-09-20T20:06:54Z

Thanks @mdickinson for the review. I think the last commit addresses all comments (I had to rebase to get CI to run but otherwise the first two commits are unchanged).

Is the above reasonably accurate? I assume the point is that long_from_non_binary_base can now be a target for optimization.

Yes, exactly. I have a branch with an implementation of subquadratic int(string). Exact details can vary but the idea would be something like rename long_from_non_binary_base to long_from_base_quadratic and add a separate long_from_base_subquadratic. The subquadratic function would call the quadratic one to parse segments of the string up to the size of the Karatsuba cutoff and then build those up to the main result using integer multiplication.

mdickinson

LGTM.

Before this goes in, do you want to edit Misc/ACKS to add your name in? (Completely optional.)

bedevere-bot · 2022-09-21T15:04:00Z

🤖 New build scheduled with the buildbot fleet by @mdickinson for commit e7b3ac1 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

mdickinson · 2022-09-21T15:04:15Z

Running this on all buildbots, to be on the safe side.

oscarbenjamin · 2022-09-21T16:59:30Z

do you want to edit Misc/ACKS to add your name in?

Looks like I'm already in there:

cpython/Misc/ACKS

Line 151 in 4b81139

Oscar Benjamin

That's from a long time ago though.

mdickinson · 2022-09-21T17:34:48Z

@nascheme This PR will likely conflict with #96673; do those conflicts look manageable?

nascheme · 2022-09-21T17:38:42Z

It will conflict but I can fix up my PR, shouldn't be too hard to do.

nascheme · 2022-09-21T19:24:44Z

I have a rebased version of #96673 on top of this. So that shouldn't hold up merging this one, if we think this improves the code. I think the idea is good but I haven't reviewed the actual code.

mdickinson · 2022-09-25T09:07:06Z

Merging. Refactoring of the longobject.c core isn't something we do very often, for the usual reasons (risk of unintended consequences, disruption of existing PRs, etc.), but I think it's warranted here, and we have long enough before 3.12 is released to find and iron out any issues. The code LGTM.

As identified in pythongh-95778 the algorithm used for decimal to binary conversion by int(string) has quadratic complexity. Following on from the reafctor of PyLong_FromString in pythongh-96808 this commit implements a subquadratic algorithm for parsing strings from decimal and other bases leveraging the subquadratic complexity of integer multiplication.

oscarbenjamin · 2022-09-26T00:23:05Z

Thanks @mdickinson.

I've opened gh-90716 as a follow up. I realised while working through that that the separation of concerns introduced here makes the return values of long_from_binary_base and long_from_non_binary_base redundant: they always return 0. Those functions could be changed to return PyLongObject* rather than int. I can make that change in gh-90716 if it seems worthwhile.

bedevere-bot added the awaiting review label Sep 13, 2022

oscarbenjamin force-pushed the pr_pylong branch from 80fa8d4 to 4005568 Compare September 13, 2022 21:39

oscarbenjamin force-pushed the pr_pylong branch from c50f304 to 361f7e7 Compare September 13, 2022 21:47

gpshead changed the title ~~gh-95778: Refactor PyLong_FromString to separate concerns~~ Refactor PyLong_FromString to separate concerns Sep 14, 2022

gpshead requested review from mdickinson and tim-one September 14, 2022 01:00

gpshead added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Sep 14, 2022

oscarbenjamin changed the title ~~Refactor PyLong_FromString to separate concerns~~ gh-96812: Refactor PyLong_FromString to separate concerns Sep 14, 2022

gpshead changed the title ~~gh-96812: Refactor PyLong_FromString to separate concerns~~ gh-90716: Refactor PyLong_FromString to separate concerns Sep 14, 2022

mdickinson reviewed Sep 20, 2022

View reviewed changes

oscarbenjamin added 3 commits September 20, 2022 20:59

Refactor PyLong_FromString to separate concerns

5edf818

Add news entry

f9869ba

Refactor PyLong_FromString: update comments

e7b3ac1

oscarbenjamin force-pushed the pr_pylong branch from a1ff3b2 to e7b3ac1 Compare September 20, 2022 19:59

mdickinson approved these changes Sep 21, 2022

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels Sep 21, 2022

mdickinson added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 21, 2022

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 21, 2022

mdickinson requested a review from nascheme September 21, 2022 17:35

mdickinson merged commit 817fa28 into python:main Sep 25, 2022

bedevere-bot removed the awaiting merge label Sep 25, 2022

nascheme mentioned this pull request Sep 25, 2022

gh-90716: add _pylong.py module #96673

Merged

oscarbenjamin mentioned this pull request Sep 26, 2022

gh-90716: Use subquadratic algorithms for int(string) #97550

Closed

oscarbenjamin deleted the pr_pylong branch September 26, 2022 00:23

	/* `start` and `end` point to the start end of a string of base `base` digits.
	/* `start` and `end` point to the start and end of a string of base `base` digits.

Uh oh!

gh-90716: Refactor PyLong_FromString to separate concerns #96808

gh-90716: Refactor PyLong_FromString to separate concerns #96808

Uh oh!

Conversation

oscarbenjamin commented Sep 13, 2022 • edited by gpshead Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-bot commented Sep 13, 2022

Uh oh!

ghost commented Sep 13, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bedevere-bot commented Sep 13, 2022

Uh oh!

gpshead commented Sep 14, 2022

Uh oh!

oscarbenjamin commented Sep 14, 2022

Uh oh!

gpshead commented Sep 14, 2022

Uh oh!

mdickinson commented Sep 14, 2022

Uh oh!

mdickinson left a comment

Choose a reason for hiding this comment

Uh oh!

mdickinson Sep 20, 2022

Choose a reason for hiding this comment

Uh oh!

mdickinson Sep 20, 2022

Choose a reason for hiding this comment

Uh oh!

mdickinson Sep 20, 2022

Choose a reason for hiding this comment

Uh oh!

mdickinson commented Sep 20, 2022

Uh oh!

oscarbenjamin commented Sep 20, 2022

Uh oh!

mdickinson left a comment

Choose a reason for hiding this comment

Uh oh!

bedevere-bot commented Sep 21, 2022

Uh oh!

mdickinson commented Sep 21, 2022

Uh oh!

oscarbenjamin commented Sep 21, 2022

Uh oh!

mdickinson commented Sep 21, 2022

Uh oh!

nascheme commented Sep 21, 2022

Uh oh!

nascheme commented Sep 21, 2022

Uh oh!

mdickinson commented Sep 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oscarbenjamin commented Sep 26, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

oscarbenjamin commented Sep 13, 2022 •

edited by gpshead

Loading

ghost commented Sep 13, 2022 •

edited by ghost

Loading

mdickinson commented Sep 25, 2022 •

edited

Loading