Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib.parse.parse_qsl does not handle unicode data properly #74668

Closed
maxking opened this issue May 26, 2017 · 6 comments
Closed

urllib.parse.parse_qsl does not handle unicode data properly #74668

maxking opened this issue May 26, 2017 · 6 comments
Labels
3.11 only security fixes 3.12 bugs and security fixes 3.13 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@maxking
Copy link
Contributor

maxking commented May 26, 2017

BPO 30483
Nosy @maxking, @csabella, @cyrkov

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2017-05-26.09:49:46.950>
labels = ['3.8', 'type-bug', '3.7']
title = 'urllib.parse.parse_qsl does not handle unicode data properly'
updated_at = <Date 2020-04-16.10:52:01.356>
user = 'https://github.com/maxking'

bugs.python.org fields:

activity = <Date 2020-04-16.10:52:01.356>
actor = 'cyrkov'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = []
creation = <Date 2017-05-26.09:49:46.950>
creator = 'maxking'
dependencies = []
files = []
hgrepos = []
issue_num = 30483
keywords = []
message_count = 3.0
messages = ['294541', '318050', '366592']
nosy_count = 3.0
nosy_names = ['maxking', 'cheryl.sabella', 'cyrkov']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'test needed'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue30483'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

Linked PRs

@maxking
Copy link
Contributor Author

maxking commented May 26, 2017

After decoding percentage encoded name and values in the query string, it tries to _coerce_result or encode the result to ascii (which is the value of _implicit_encoding).

  File "/usr/lib/python3.6/urllib/parse.py", line 691, in parse_qsl
    value = _coerce_result(value)
  File "/usr/lib/python3.6/urllib/parse.py", line 95, in _encode_result
    return obj.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 1: ordinal not in range(128)

As seen in the partial traceback above, it breaks things when trying to parse unicode encode query string values.

@csabella
Copy link
Contributor

Would you be able to include an example for recreating this?

Looking at the code, it uses the ascii encoding for bytes (which can only contain ASCII literal characters) and should not be using that encoding for strings.

Thanks!

@csabella csabella added 3.7 (EOL) end of life 3.8 only security fixes type-bug An unexpected behavior, bug, or error labels May 29, 2018
@cyrkov
Copy link
Mannequin

cyrkov mannequin commented Apr 16, 2020

I have recently stumbled upon this bug, and I can present the example and a solution I've used.
The issue happens when we try to parse x-www-form-urlencoded of type bytes:

>>> from urllib.parse import urlencode, parse_qs
>>> urlencode([('v', 'ö')])
'v=%C3%B6'
>>> parse_qs('v=%C3%B6')
{'v': ['ö']}
>>> parse_qs(b'v=%C3%B6')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.6/urllib/parse.py", line 669, in parse_qs
    encoding=encoding, errors=errors)
  File "/usr/lib64/python3.6/urllib/parse.py", line 722, in parse_qsl
    value = _coerce_result(value)
  File "/usr/lib64/python3.6/urllib/parse.py", line 103, in _encode_result
    return obj.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 0: ordinal not in range(128)

This happens in the parse_qsl function because _coerce_result is a synonym of _encode_result and is called with default parameter encoding='ascii'. As far as I understand, it should be called with the encoding parameter of the parse_qsl function:

742c742
<             name = _coerce_result(name)
---
>             name = _coerce_result(name, encoding=encoding, errors=errors)
745c745
<             value = _coerce_result(value)
---
>             value = _coerce_result(value, encoding=encoding, errors=errors)

I am not sure whether I should commit this to the repo and create a pull request, as described in the devguide.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
roaanv pushed a commit to roaanv/cpython that referenced this issue Jun 13, 2022
roaanv added a commit to roaanv/cpython that referenced this issue Jun 13, 2022
@frostburn
Copy link

Can confirm that this issue exists in Python 3.8.10 and that cyrkov's solution works. (I manually re-implemented parse_qsl as a quick fix.)

@michaelkedar
Copy link

I've also run into this issue on Python 3.11.5 in code that calls parse_qsl on a flask.request.query_string that contains %-encoded Unicode characters

@iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 26, 2023
@serhiy-storchaka serhiy-storchaka added 3.11 only security fixes 3.12 bugs and security fixes 3.13 new features, bugs and security fixes and removed 3.8 only security fixes 3.7 (EOL) end of life labels Feb 21, 2024
@serhiy-storchaka
Copy link
Member

Both decoding and encoding can fail or lose information. To avoid this we should either use the lossless encoding or error handler ('latin1' or 'surrogateescape') for both directions, or omit decoding and encoding at all. The latter is usually more efficient, and can even be simpler, like in this case. #115771 supports arbitrary raw and percent-encoded bytes.

serhiy-storchaka added a commit that referenced this issue Mar 5, 2024
urllib.parse functions parse_qs() and parse_qsl() now support bytes
arguments containing raw and percent-encoded non-ASCII data.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 5, 2024
…honGH-115771)

urllib.parse functions parse_qs() and parse_qsl() now support bytes
arguments containing raw and percent-encoded non-ASCII data.
(cherry picked from commit bdba8ef)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 5, 2024
…honGH-115771)

urllib.parse functions parse_qs() and parse_qsl() now support bytes
arguments containing raw and percent-encoded non-ASCII data.
(cherry picked from commit bdba8ef)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this issue Mar 5, 2024
…-115771) (GH-116366)

urllib.parse functions parse_qs() and parse_qsl() now support bytes
arguments containing raw and percent-encoded non-ASCII data.
(cherry picked from commit bdba8ef)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this issue Mar 5, 2024
…-115771) (GH-116367)

urllib.parse functions parse_qs() and parse_qsl() now support bytes
arguments containing raw and percent-encoded non-ASCII data.
(cherry picked from commit bdba8ef)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Mar 14, 2024
* Restore support of None and other false values (fix regression
  introduced in pythongh-74668).
* Raise TypeError for non-zero integers and non-empty sequences.
serhiy-storchaka added a commit that referenced this issue Mar 16, 2024
* Restore support of None and other false values.
* Raise TypeError for non-zero integers and non-empty sequences.

The regressions were introduced in gh-74668
(bdba8ef).
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 16, 2024
…H-116801)

* Restore support of None and other false values.
* Raise TypeError for non-zero integers and non-empty sequences.

The regressions were introduced in pythongh-74668
(bdba8ef).
(cherry picked from commit 1069a46)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 16, 2024
…H-116801)

* Restore support of None and other false values.
* Raise TypeError for non-zero integers and non-empty sequences.

The regressions were introduced in pythongh-74668
(bdba8ef).
(cherry picked from commit 1069a46)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this issue Mar 16, 2024
) (GH-116894)

* Restore support of None and other false values.
* Raise TypeError for non-zero integers and non-empty sequences.

The regressions were introduced in gh-74668
(bdba8ef).
(cherry picked from commit 1069a46)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit that referenced this issue Mar 16, 2024
) (GH-116895)

* Restore support of None and other false values.
* Raise TypeError for non-zero integers and non-empty sequences.

The regressions were introduced in gh-74668
(bdba8ef).
(cherry picked from commit 1069a46)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
vstinner pushed a commit to vstinner/cpython that referenced this issue Mar 20, 2024
…H-116801)

* Restore support of None and other false values.
* Raise TypeError for non-zero integers and non-empty sequences.

The regressions were introduced in pythongh-74668
(bdba8ef).
adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
…honGH-115771)

urllib.parse functions parse_qs() and parse_qsl() now support bytes
arguments containing raw and percent-encoded non-ASCII data.
adorilson pushed a commit to adorilson/cpython that referenced this issue Mar 25, 2024
…H-116801)

* Restore support of None and other false values.
* Raise TypeError for non-zero integers and non-empty sequences.

The regressions were introduced in pythongh-74668
(bdba8ef).
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…honGH-115771)

urllib.parse functions parse_qs() and parse_qsl() now support bytes
arguments containing raw and percent-encoded non-ASCII data.
diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
…H-116801)

* Restore support of None and other false values.
* Raise TypeError for non-zero integers and non-empty sequences.

The regressions were introduced in pythongh-74668
(bdba8ef).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes 3.12 bugs and security fixes 3.13 new features, bugs and security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

6 participants