Reduce memory usage of urllib.unquote and unquote_to_bytes #88500

mustafaelagamey · 2021-06-07T15:07:39Z

BPO	44334
Nosy	@terryjreedy, @gpshead, @orsenthil, @mustafaelagamey
PRs	bpo-44334: Use bytearray in urllib.unquote_to_bytes #26576

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-06-07.15:07:38.990>
labels = ['extension-modules', '3.11', '3.9', '3.10', 'performance']
title = 'Use bytearray in urllib.unquote_to_bytes'
updated_at = <Date 2021-06-07.20:09:11.522>
user = 'https://github.com/mustafaelagamey'

bugs.python.org fields:

activity = <Date 2021-06-07.20:09:11.522>
actor = 'gregory.p.smith'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Extension Modules']
creation = <Date 2021-06-07.15:07:38.990>
creator = 'eng.mustafaelagamey'
dependencies = []
files = []
hgrepos = []
issue_num = 44334
keywords = ['patch']
message_count = 2.0
messages = ['395280', '395281']
nosy_count = 4.0
nosy_names = ['terry.reedy', 'gregory.p.smith', 'orsenthil', 'eng.mustafaelagamey']
pr_nums = ['26576']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue44334'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

PR: gh-88500: Reduce memory use of urllib.unquote #96763

The text was updated successfully, but these errors were encountered:

terryjreedy · 2021-06-07T19:42:17Z

'eng' claimed in original title that "urllib.parse.parse_qsl cannot parse large data". On original PR, said problem with 6-7 millions bytes.

Claim should be backed up by a generated example that fails with original code and succeeds with new code. Claims of 'faster' also needs some examples.

Original PRs must nearly all propose merging a branch created from main into main. Performance enhancements are often not backported.

gpshead · 2021-06-07T20:09:04Z

fwiw this sort of thing may be reasonable to backport to 3.9 as it is more than just a performance enhancement but also a resource consumption bug and should result in no behavior change.

"""
In case of form contain very large data ( in my case the string to parse was about 6000000 byte )
Old code use list of bytes during parsing consumes a lot of memory
New code will use bytearry , which use less memory
""" - text from the original PR

iritkatriel · 2022-09-11T10:20:22Z

The PR was closed due to technicalities (pointing to the wrong branch, CLA) and the OP didn’t follow up.

Unless someone object I will close this issue as well.

`urllib.unquote_to_bytes` and `urllib.unquote` could both potentially generate `O(len(string))` intermediate `bytes` or `str` objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram. This switches the implementation to using an expanding `bytearray` and a generator internally instead of precomputed `split()` style operations.

gpshead · 2022-09-12T08:19:48Z

I created a new PR and included fixing a similar legacy design issue in unquote() as well as the original report's unquote_to_bytes(). Some performance microbenchmarks need running before I'll consider moving forward with it.

If someone wanted to consider this a security issue it could be backported. It is at most a fixed constant factor (roughly $len(input) * sizeof(PyObject)$ memory consumption vs a maximally antagonistic input though. That doesn't smell DoS worthy.

`urllib.unquote_to_bytes` and `urllib.unquote` could both potentially generate `O(len(string))` intermediate `bytes` or `str` objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram. This switches the implementation to using an expanding `bytearray` and a generator internally instead of precomputed `split()` style operations. Microbenchmarks with some antagonistic inputs like `mess = "\u0141%%%20a%fe"*1000` show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal. The slowdown scales consistently linear with input size as expected. Memory usage observed manually using `/usr/bin/time -v` on `python -m timeit` runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile. Observed memory usage is ~1/2 for `unquote()` and <1/3 for `unquote_to_bytes()` using `python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)'` as a test.

mustafaelagamey mannequin added 3.8 only security fixes extension-modules C modules in the Modules dir performance Performance or resource usage labels Jun 7, 2021

mustafaelagamey mannequin changed the title ~~urllib cannot parse large data~~ urllib.parse.parse_qsl cannot parse large data Jun 7, 2021

terryjreedy added 3.11 only security fixes and removed 3.8 only security fixes labels Jun 7, 2021

terryjreedy changed the title ~~urllib.parse.parse_qsl cannot parse large data~~ Use bytearray in urllib.unquote_to_bytes Jun 7, 2021

terryjreedy added 3.11 only security fixes and removed 3.8 only security fixes labels Jun 7, 2021

terryjreedy changed the title ~~urllib.parse.parse_qsl cannot parse large data~~ Use bytearray in urllib.unquote_to_bytes Jun 7, 2021

gpshead added 3.9 only security fixes 3.10 only security fixes labels Jun 7, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

iritkatriel added the pending The issue will be closed if no feedback is provided label Sep 11, 2022

bedevere-bot mentioned this issue Sep 12, 2022

gh-88500: Reduce memory use of urllib.unquote #96763

Merged

gpshead added stdlib Python modules in the Lib dir 3.12 bugs and security fixes and removed extension-modules C modules in the Modules dir 3.9 only security fixes labels Sep 12, 2022

gpshead changed the title ~~Use bytearray in urllib.unquote_to_bytes~~ Reduce memory usage of urllib.unquote and unquote_to_bytes Sep 12, 2022

gpshead removed 3.11 only security fixes 3.10 only security fixes labels Sep 12, 2022

iritkatriel removed the pending The issue will be closed if no feedback is provided label Sep 12, 2022

gpshead mentioned this issue Sep 13, 2022

quote_from_bytes uses a lot of memory for larger bytestrings #95865

Closed

gpshead self-assigned this Nov 11, 2022

gpshead closed this as completed in #96763 Dec 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage of urllib.unquote and unquote_to_bytes #88500

Reduce memory usage of urllib.unquote and unquote_to_bytes #88500

mustafaelagamey mannequin commented Jun 7, 2021 •

edited by bedevere-bot

terryjreedy commented Jun 7, 2021

gpshead commented Jun 7, 2021

iritkatriel commented Sep 11, 2022

gpshead commented Sep 12, 2022

Reduce memory usage of urllib.unquote and unquote_to_bytes #88500

Reduce memory usage of urllib.unquote and unquote_to_bytes #88500

Comments

mustafaelagamey mannequin commented Jun 7, 2021 • edited by bedevere-bot

terryjreedy commented Jun 7, 2021

gpshead commented Jun 7, 2021

iritkatriel commented Sep 11, 2022

gpshead commented Sep 12, 2022

mustafaelagamey mannequin commented Jun 7, 2021 •

edited by bedevere-bot