Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage of urllib.unquote and unquote_to_bytes #88500

Closed
mustafaelagamey mannequin opened this issue Jun 7, 2021 · 4 comments · Fixed by #96763
Closed

Reduce memory usage of urllib.unquote and unquote_to_bytes #88500

mustafaelagamey mannequin opened this issue Jun 7, 2021 · 4 comments · Fixed by #96763
Assignees
Labels
3.12 bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@mustafaelagamey
Copy link
Mannequin

mustafaelagamey mannequin commented Jun 7, 2021

BPO 44334
Nosy @terryjreedy, @gpshead, @orsenthil, @mustafaelagamey
PRs
  • bpo-44334: Use bytearray in urllib.unquote_to_bytes #26576
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2021-06-07.15:07:38.990>
    labels = ['extension-modules', '3.11', '3.9', '3.10', 'performance']
    title = 'Use bytearray in urllib.unquote_to_bytes'
    updated_at = <Date 2021-06-07.20:09:11.522>
    user = 'https://github.com/mustafaelagamey'

    bugs.python.org fields:

    activity = <Date 2021-06-07.20:09:11.522>
    actor = 'gregory.p.smith'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Extension Modules']
    creation = <Date 2021-06-07.15:07:38.990>
    creator = 'eng.mustafaelagamey'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 44334
    keywords = ['patch']
    message_count = 2.0
    messages = ['395280', '395281']
    nosy_count = 4.0
    nosy_names = ['terry.reedy', 'gregory.p.smith', 'orsenthil', 'eng.mustafaelagamey']
    pr_nums = ['26576']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue44334'
    versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

    @mustafaelagamey mustafaelagamey mannequin added 3.8 only security fixes extension-modules C modules in the Modules dir performance Performance or resource usage labels Jun 7, 2021
    @mustafaelagamey mustafaelagamey mannequin changed the title urllib cannot parse large data urllib.parse.parse_qsl cannot parse large data Jun 7, 2021
    @mustafaelagamey mustafaelagamey mannequin changed the title urllib cannot parse large data urllib.parse.parse_qsl cannot parse large data Jun 7, 2021
    @terryjreedy
    Copy link
    Member

    'eng' claimed in original title that "urllib.parse.parse_qsl cannot parse large data". On original PR, said problem with 6-7 millions bytes.

    Claim should be backed up by a generated example that fails with original code and succeeds with new code. Claims of 'faster' also needs some examples.

    Original PRs must nearly all propose merging a branch created from main into main. Performance enhancements are often not backported.

    @terryjreedy terryjreedy added 3.11 only security fixes and removed 3.8 only security fixes labels Jun 7, 2021
    @terryjreedy terryjreedy changed the title urllib.parse.parse_qsl cannot parse large data Use bytearray in urllib.unquote_to_bytes Jun 7, 2021
    @terryjreedy terryjreedy added 3.11 only security fixes and removed 3.8 only security fixes labels Jun 7, 2021
    @terryjreedy terryjreedy changed the title urllib.parse.parse_qsl cannot parse large data Use bytearray in urllib.unquote_to_bytes Jun 7, 2021
    @gpshead
    Copy link
    Member

    gpshead commented Jun 7, 2021

    fwiw this sort of thing may be reasonable to backport to 3.9 as it is more than just a performance enhancement but also a resource consumption bug and should result in no behavior change.

    """
    In case of form contain very large data ( in my case the string to parse was about 6000000 byte )
    Old code use list of bytes during parsing consumes a lot of memory
    New code will use bytearry , which use less memory
    """ - text from the original PR

    @gpshead gpshead added 3.9 only security fixes 3.10 only security fixes labels Jun 7, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel
    Copy link
    Member

    The PR was closed due to technicalities (pointing to the wrong branch, CLA) and the OP didn’t follow up.

    Unless someone object I will close this issue as well.

    @iritkatriel iritkatriel added the pending The issue will be closed if no feedback is provided label Sep 11, 2022
    gpshead added a commit to gpshead/cpython that referenced this issue Sep 12, 2022
    `urllib.unquote_to_bytes` and `urllib.unquote` could both potentially
    generate `O(len(string))` intermediate `bytes` or `str` objects while
    computing the unquoted final result depending on the input provided. As
    Python objects are relatively large, this could consume a lot of ram.
    
    This switches the implementation to using an expanding `bytearray` and a
    generator internally instead of precomputed `split()` style operations.
    @gpshead gpshead added stdlib Python modules in the Lib dir 3.12 bugs and security fixes and removed extension-modules C modules in the Modules dir 3.9 only security fixes labels Sep 12, 2022
    @gpshead gpshead changed the title Use bytearray in urllib.unquote_to_bytes Reduce memory usage of urllib.unquote and unquote_to_bytes Sep 12, 2022
    @gpshead gpshead removed 3.11 only security fixes 3.10 only security fixes labels Sep 12, 2022
    @gpshead
    Copy link
    Member

    gpshead commented Sep 12, 2022

    I created a new PR and included fixing a similar legacy design issue in unquote() as well as the original report's unquote_to_bytes(). Some performance microbenchmarks need running before I'll consider moving forward with it.

    If someone wanted to consider this a security issue it could be backported. It is at most a fixed constant factor (roughly $len(input) * sizeof(PyObject)$ memory consumption vs a maximally antagonistic input though. That doesn't smell DoS worthy.

    @iritkatriel iritkatriel removed the pending The issue will be closed if no feedback is provided label Sep 12, 2022
    @gpshead gpshead self-assigned this Nov 11, 2022
    gpshead added a commit that referenced this issue Dec 11, 2022
    `urllib.unquote_to_bytes` and `urllib.unquote` could both potentially generate `O(len(string))` intermediate `bytes` or `str` objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram.
    
    This switches the implementation to using an expanding `bytearray` and a generator internally instead of precomputed `split()` style operations.
    
    Microbenchmarks with some antagonistic inputs like `mess = "\u0141%%%20a%fe"*1000` show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal.  The slowdown scales consistently linear with input size as expected.
    
    Memory usage observed manually using `/usr/bin/time -v` on `python -m timeit` runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.
    
    Observed memory usage is ~1/2 for `unquote()` and <1/3 for `unquote_to_bytes()` using `python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)'` as a test.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.12 bugs and security fixes performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    Successfully merging a pull request may close this issue.

    3 participants