Skip to content

Conversation

@oscarbenjamin
Copy link
Contributor

@oscarbenjamin oscarbenjamin commented Sep 13, 2022

This is a preliminary PR to refactor PyLong_FromString which is currently quite messy and has spaghetti like code that mixes up different concerns as well as duplicating logic.

In particular:

  • PyLong_FromString now only handles sign, base and prefix detection and calls a new function long_from_string_base to parse the main body of the string.
  • The long_from_string_base function handles all string validation and then calls long_from_binary_base or a new function long_from_non_binary_base to construct the actual PyLong.
  • The existing long_from_binary_base function is simplified by factoring duplicated logic to long_from_string_base.
  • The new function long_from_non_binary_base factors out much of the code from PyLong_FromString including in particular the quadratic algorithm reffered to in CVE-2020-10735: Prevent DoS by large int<->str conversions #95778 so that this can be seen separately from unrelated concerns such as string validation.

I intend to follow up on this with a PR to improve the algorithm used for decimal and other non binary bases but I think that would be a lot easier to do after this refactoring. I could also submit that algorithm in the same PR but I thought it would be easier to review this refactoring separately from a change of algorithm.

@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@ghost
Copy link

ghost commented Sep 13, 2022

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-bot
Copy link

Most changes to Python require a NEWS entry.

Please add it using the blurb_it web app or the blurb command-line tool.

@gpshead
Copy link
Member

gpshead commented Sep 14, 2022

Please create a new github feature request issue to track this work.

@gpshead gpshead changed the title gh-95778: Refactor PyLong_FromString to separate concerns Refactor PyLong_FromString to separate concerns Sep 14, 2022
@gpshead gpshead added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Sep 14, 2022
@oscarbenjamin oscarbenjamin changed the title Refactor PyLong_FromString to separate concerns gh-96812: Refactor PyLong_FromString to separate concerns Sep 14, 2022
@oscarbenjamin
Copy link
Contributor Author

Please create a new github feature request issue to track this work.

I don't really understand this workflow but does #96812 work for this?

@gpshead gpshead changed the title gh-96812: Refactor PyLong_FromString to separate concerns gh-90716: Refactor PyLong_FromString to separate concerns Sep 14, 2022
@gpshead
Copy link
Member

gpshead commented Sep 14, 2022

yep, that makes sense, though we've already got an issue for that - edited/redirected to it. :)

@mdickinson
Copy link
Member

Thanks for the PR! I can look at this this weekend.

Copy link
Member

@mdickinson mdickinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oscarbenjamin It would be useful to have a summary of the changes in the PR description. If I'm understanding correctly:

  • long_from_binary_base has been modified to no longer count characters or normalise its result
  • there's a new function long_from_non_binary_base which has exactly the same signature as long_from_binary_base, but is intended for conversions from strings not of base 2, 4, 8, 16 or 32
  • another new function long_from_string_base encapsulates parsing, validation, digit counting, and length validation, and dispatches to long_from_binary_base or long_from_non_binary_base as appropriate
  • The top-level PyLong_FromString handles signs, base-wrangling (including the special case base=0) and prefixes like 0x, and then hands things off to long_from_string_base; it also performs the final normalization

Is the above reasonably accurate? I assume the point is that long_from_non_binary_base can now be a target for optimization.

/* *str points to the first digit in a string of base `base` digits. base
* is a power of 2 (2, 4, 8, 16, or 32). *str is set to point to the first
* non-digit (which may be *str!). A normalized int is returned.
/* `start` and `end` point to the start end of a string of base `base` digits.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* `start` and `end` point to the start end of a string of base `base` digits.
/* `start` and `end` point to the start and end of a string of base `base` digits.

* is a power of 2 (2, 4, 8, 16, or 32). *str is set to point to the first
* non-digit (which may be *str!). A normalized int is returned.
/* `start` and `end` point to the start end of a string of base `base` digits.
* base is a power of 2 (2, 4, 8, 16, or 32). An unnormalized int is returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please could you add a description of the new digits parameter? (It would be good to clarify that it ignores underscores, so is not the same as end - start.)

Comment on lines 2649 to 2656
/*
* long_from_string_base is the main workhorse. It sets str to the first
* null byte or the first invalid character and either:
*
* - Returns -1 for a SyntaxError.
* - Returns 0 and sets z to NULL for MemoryError/OverflowError.
* - Sets z to an unsigned, unnormalized PyLong (success!).
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably lose most of this comment given the comprehensive description just before the long_from_string_base function itself.

@mdickinson
Copy link
Member

@oscarbenjamin

It would be useful to have a summary of the changes in the PR description.

Gah. Please ignore. Reading fail.

@oscarbenjamin
Copy link
Contributor Author

Thanks @mdickinson for the review. I think the last commit addresses all comments (I had to rebase to get CI to run but otherwise the first two commits are unchanged).

Is the above reasonably accurate? I assume the point is that long_from_non_binary_base can now be a target for optimization.

Yes, exactly. I have a branch with an implementation of subquadratic int(string). Exact details can vary but the idea would be something like rename long_from_non_binary_base to long_from_base_quadratic and add a separate long_from_base_subquadratic. The subquadratic function would call the quadratic one to parse segments of the string up to the size of the Karatsuba cutoff and then build those up to the main result using integer multiplication.

Copy link
Member

@mdickinson mdickinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Before this goes in, do you want to edit Misc/ACKS to add your name in? (Completely optional.)

@mdickinson mdickinson added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 21, 2022
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @mdickinson for commit e7b3ac1 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Sep 21, 2022
@mdickinson
Copy link
Member

Running this on all buildbots, to be on the safe side.

@oscarbenjamin
Copy link
Contributor Author

do you want to edit Misc/ACKS to add your name in?

Looks like I'm already in there:

Oscar Benjamin

That's from a long time ago though.

@mdickinson
Copy link
Member

@nascheme This PR will likely conflict with #96673; do those conflicts look manageable?

@mdickinson mdickinson requested a review from nascheme September 21, 2022 17:35
@nascheme
Copy link
Member

It will conflict but I can fix up my PR, shouldn't be too hard to do.

@nascheme
Copy link
Member

I have a rebased version of #96673 on top of this. So that shouldn't hold up merging this one, if we think this improves the code. I think the idea is good but I haven't reviewed the actual code.

@mdickinson
Copy link
Member

mdickinson commented Sep 25, 2022

Merging. Refactoring of the longobject.c core isn't something we do very often, for the usual reasons (risk of unintended consequences, disruption of existing PRs, etc.), but I think it's warranted here, and we have long enough before 3.12 is released to find and iron out any issues. The code LGTM.

@mdickinson mdickinson merged commit 817fa28 into python:main Sep 25, 2022
oscarbenjamin added a commit to oscarbenjamin/cpython that referenced this pull request Sep 25, 2022
As identified in pythongh-95778 the algorithm used for decimal to binary
conversion by int(string) has quadratic complexity. Following on from
the reafctor of PyLong_FromString in pythongh-96808 this commit implements a
subquadratic algorithm for parsing strings from decimal and other bases
leveraging the subquadratic complexity of integer multiplication.
oscarbenjamin added a commit to oscarbenjamin/cpython that referenced this pull request Sep 25, 2022
As identified in pythongh-95778 the algorithm used for decimal to binary
conversion by int(string) has quadratic complexity. Following on from
the reafctor of PyLong_FromString in pythongh-96808 this commit implements a
subquadratic algorithm for parsing strings from decimal and other bases
leveraging the subquadratic complexity of integer multiplication.
oscarbenjamin added a commit to oscarbenjamin/cpython that referenced this pull request Sep 25, 2022
As identified in pythongh-95778 the algorithm used for decimal to binary
conversion by int(string) has quadratic complexity. Following on from
the reafctor of PyLong_FromString in pythongh-96808 this commit implements a
subquadratic algorithm for parsing strings from decimal and other bases
leveraging the subquadratic complexity of integer multiplication.
@oscarbenjamin
Copy link
Contributor Author

Thanks @mdickinson.

I've opened gh-90716 as a follow up. I realised while working through that that the separation of concerns introduced here makes the return values of long_from_binary_base and long_from_non_binary_base redundant: they always return 0. Those functions could be changed to return PyLongObject* rather than int. I can make that change in gh-90716 if it seems worthwhile.

@oscarbenjamin oscarbenjamin deleted the pr_pylong branch September 26, 2022 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

interpreter-core (Objects, Python, Grammar, and Parser dirs)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants