-
-
Notifications
You must be signed in to change notification settings - Fork 33.5k
gh-90716: Refactor PyLong_FromString to separate concerns #96808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
80fa8d4 to
4005568
Compare
|
Most changes to Python require a NEWS entry. Please add it using the blurb_it web app or the blurb command-line tool. |
c50f304 to
361f7e7
Compare
|
Please create a new github feature request issue to track this work. |
I don't really understand this workflow but does #96812 work for this? |
|
yep, that makes sense, though we've already got an issue for that - edited/redirected to it. :) |
|
Thanks for the PR! I can look at this this weekend. |
mdickinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oscarbenjamin It would be useful to have a summary of the changes in the PR description. If I'm understanding correctly:
long_from_binary_basehas been modified to no longer count characters or normalise its result- there's a new function
long_from_non_binary_basewhich has exactly the same signature aslong_from_binary_base, but is intended for conversions from strings not of base2,4,8,16or32 - another new function
long_from_string_baseencapsulates parsing, validation, digit counting, and length validation, and dispatches tolong_from_binary_baseorlong_from_non_binary_baseas appropriate - The top-level
PyLong_FromStringhandles signs, base-wrangling (including the special case base=0) and prefixes like0x, and then hands things off tolong_from_string_base; it also performs the final normalization
Is the above reasonably accurate? I assume the point is that long_from_non_binary_base can now be a target for optimization.
Objects/longobject.c
Outdated
| /* *str points to the first digit in a string of base `base` digits. base | ||
| * is a power of 2 (2, 4, 8, 16, or 32). *str is set to point to the first | ||
| * non-digit (which may be *str!). A normalized int is returned. | ||
| /* `start` and `end` point to the start end of a string of base `base` digits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /* `start` and `end` point to the start end of a string of base `base` digits. | |
| /* `start` and `end` point to the start and end of a string of base `base` digits. |
Objects/longobject.c
Outdated
| * is a power of 2 (2, 4, 8, 16, or 32). *str is set to point to the first | ||
| * non-digit (which may be *str!). A normalized int is returned. | ||
| /* `start` and `end` point to the start end of a string of base `base` digits. | ||
| * base is a power of 2 (2, 4, 8, 16, or 32). An unnormalized int is returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please could you add a description of the new digits parameter? (It would be good to clarify that it ignores underscores, so is not the same as end - start.)
Objects/longobject.c
Outdated
| /* | ||
| * long_from_string_base is the main workhorse. It sets str to the first | ||
| * null byte or the first invalid character and either: | ||
| * | ||
| * - Returns -1 for a SyntaxError. | ||
| * - Returns 0 and sets z to NULL for MemoryError/OverflowError. | ||
| * - Sets z to an unsigned, unnormalized PyLong (success!). | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably lose most of this comment given the comprehensive description just before the long_from_string_base function itself.
Gah. Please ignore. Reading fail. |
a1ff3b2 to
e7b3ac1
Compare
|
Thanks @mdickinson for the review. I think the last commit addresses all comments (I had to rebase to get CI to run but otherwise the first two commits are unchanged).
Yes, exactly. I have a branch with an implementation of subquadratic |
mdickinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Before this goes in, do you want to edit Misc/ACKS to add your name in? (Completely optional.)
|
🤖 New build scheduled with the buildbot fleet by @mdickinson for commit e7b3ac1 🤖 If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again. |
|
Running this on all buildbots, to be on the safe side. |
Looks like I'm already in there: Line 151 in 4b81139
That's from a long time ago though. |
|
It will conflict but I can fix up my PR, shouldn't be too hard to do. |
|
I have a rebased version of #96673 on top of this. So that shouldn't hold up merging this one, if we think this improves the code. I think the idea is good but I haven't reviewed the actual code. |
|
Merging. Refactoring of the |
As identified in pythongh-95778 the algorithm used for decimal to binary conversion by int(string) has quadratic complexity. Following on from the reafctor of PyLong_FromString in pythongh-96808 this commit implements a subquadratic algorithm for parsing strings from decimal and other bases leveraging the subquadratic complexity of integer multiplication.
As identified in pythongh-95778 the algorithm used for decimal to binary conversion by int(string) has quadratic complexity. Following on from the reafctor of PyLong_FromString in pythongh-96808 this commit implements a subquadratic algorithm for parsing strings from decimal and other bases leveraging the subquadratic complexity of integer multiplication.
As identified in pythongh-95778 the algorithm used for decimal to binary conversion by int(string) has quadratic complexity. Following on from the reafctor of PyLong_FromString in pythongh-96808 this commit implements a subquadratic algorithm for parsing strings from decimal and other bases leveraging the subquadratic complexity of integer multiplication.
|
Thanks @mdickinson. I've opened gh-90716 as a follow up. I realised while working through that that the separation of concerns introduced here makes the return values of |
This is a preliminary PR to refactor
PyLong_FromStringwhich is currently quite messy and has spaghetti like code that mixes up different concerns as well as duplicating logic.In particular:
PyLong_FromStringnow only handles sign, base and prefix detection and calls a new functionlong_from_string_baseto parse the main body of the string.long_from_string_basefunction handles all string validation and then callslong_from_binary_baseor a new functionlong_from_non_binary_baseto construct the actualPyLong.long_from_binary_basefunction is simplified by factoring duplicated logic tolong_from_string_base.long_from_non_binary_basefactors out much of the code fromPyLong_FromStringincluding in particular the quadratic algorithm reffered to in CVE-2020-10735: Prevent DoS by large int<->str conversions #95778 so that this can be seen separately from unrelated concerns such as string validation.I intend to follow up on this with a PR to improve the algorithm used for decimal and other non binary bases but I think that would be a lot easier to do after this refactoring. I could also submit that algorithm in the same PR but I thought it would be easier to review this refactoring separately from a change of algorithm.