urllib.request.urlopen does not handle non-ASCII characters #48241

abadger · 2008-09-28T18:47:16Z

BPO	3991
Nosy	@orsenthil, @vstinner, @devdanzin, @ezio-melotti, @abadger, @bitdancer, @vadmium, @remilapeyre
Dependencies	bpo-9679: unicode DNS names in urllib, urlopen
Files	non_ascii_path.diff: Calls quote() on the request path if path.encode('ascii') fails issue3991.diff issue3991_2017-01-27.diff: Diff against 3.7.0a0

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2008-09-28.18:47:16.056>
labels = ['extension-modules', '3.8', 'type-bug', '3.7', 'easy']
title = 'urllib.request.urlopen does not handle non-ASCII characters'
updated_at = <Date 2019-02-20.20:02:12.619>
user = 'https://github.com/abadger'

bugs.python.org fields:

activity = <Date 2019-02-20.20:02:12.619>
actor = 'remi.lapeyre'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Extension Modules']
creation = <Date 2008-09-28.18:47:16.056>
creator = 'a.badger'
dependencies = ['9679']
files = ['12986', '29194', '46440']
hgrepos = []
issue_num = 3991
keywords = ['patch', 'easy']
message_count = 14.0
messages = ['73982', '74046', '74053', '74085', '74088', '74110', '74117', '81423', '182785', '219325', '285721', '286386', '286423', '286444']
nosy_count = 11.0
nosy_names = ['janssen', 'orsenthil', 'vstinner', 'ajaksu2', 'ezio.melotti', 'a.badger', 'r.david.murray', 'martin.panter', 'thezulk', 'Graham.Oliver', 'remi.lapeyre']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue3991'
versions = ['Python 3.4', 'Python 3.5', 'Python 3.6', 'Python 3.7', 'Python 3.8']

Linked PRs

abadger · 2008-09-28T18:47:15Z

Tested on python-3.0rc1 -- Linux Fedora 9

I wanted to make sure that python3.0 would handle url's in different
encodings. So I created two files on an apache server which were named
½ñ.html. One of the filenames was encoded in utf-8 and the other in
latin-1. Then I tried the following::

from urllib.request import urlopen
url = 'http://localhost/u/½ñ.html'
urlopen(url.encode('utf-8')).read()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.0/urllib/request.py", line 122, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python3.0/urllib/request.py", line 350, in open
    req.timeout = timeout
AttributeError: 'bytes' object has no attribute 'timeout'

The same thing happens if I give None for the two optional arguments
(data and timeout).

Next I tried using a raw Unicode string:

>>> urlopen(url).read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.0/urllib/request.py", line 122, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python3.0/urllib/request.py", line 359, in open
    response = self._open(req, data)
  File "/usr/lib/python3.0/urllib/request.py", line 377, in _open
    '_open', req)
  File "/usr/lib/python3.0/urllib/request.py", line 337, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.0/urllib/request.py", line 1082, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.0/urllib/request.py", line 1068, in do_open
    h.request(req.get_method(), req.get_selector(), req.data, headers)
  File "/usr/lib/python3.0/http/client.py", line 843, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.0/http/client.py", line 860, in _send_request
    self.putrequest(method, url, **skips)
  File "/usr/lib/python3.0/http/client.py", line 751, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position
7-8: ordinal not in range(128)

So, in python-3.0rc1, this method is badly broken.

janssen · 2008-09-29T20:47:31Z

As I read RFC 2396,

1.5: "A URI is a sequence of characters from a very
limited set, i.e. the letters of the basic Latin alphabet, digits,
and a few special characters."

2.4: "Data must be escaped if it does not have a representation using an
unreserved character; this includes data that does not correspond to
a printable character of the US-ASCII coded character set, or that
corresponds to any US-ASCII character that is disallowed, as
explained below."

So your URL string is invalid. You need to escape the characters properly.

(RFC 2396 is what the HTTP RFC cites as its authority on URLs.)

abadger · 2008-09-29T22:27:44Z

Possibly. This is a change from python-2.x's urlopen() which escaped
the URL automatically, though. I can see the case for having the user
call an escape function themselves instead of having urlopen() perform
the escape for them. However, that function would need to be written.
(The present parse.quote() method only quotes correctly if only the path
component is passed; there's no function to take a full URL and quote it
appropriately.)

Without such a function, a whole lot of code bases will have to reinvent
the wheel creating functions to parse the path out, run it through
urllib.parse.quote() and then pass the result to urlib.urlopen().

janssen · 2008-09-30T17:44:00Z

It's not immediately clear to me how an auto-quote function can be
written; as you say (and as the URI spec points out), you have to take
a URL apart before quoting it, and you can't parse an invalid URL,
which is what the input is.

Best to think of this as a difference from 2.x.

abadger · 2008-09-30T18:11:23Z

The purpose of such a function would be to take something that is not a
valid uri but 1) is a common way of expressing the way to get to the
resource and 2) follows certain rules and turns that into something that
is a valid uri. non-ASCii strings in the path are a good example of
this since there is a well defined method to encode the strings into the
URL if you are given a character encoding to apply to it.

My first, naive thought is that if the input can be parsed by
urlparse(), then there is a very good chance that we have the ability to
escape the string properly. Looking at the invalid uri that I gave, for
instance, if you additionally specified an encoding for the path element
there's no reason a function couldn't do the escaping.

What are example inputs that you are concerned about? I'll see if I can
come up with code that works with them.

janssen · 2008-10-01T03:24:17Z

I'm not concerned about any example inputs. I was just trying to
explain why this isn't a bug.

On the other hand, the IRI spec (RFC 3897) is another thing we might
try to implement for Python.

abadger · 2008-10-01T06:39:57Z

Oh, that's cool. I've been fine with this being a request for a needed
function to quote and unquote full urls rather than a bug in urlopen().

I think iri's are a distraction here, though. The RFC for iris even
says that specifications that call for uris and do not mention iris
should not take iris. So there's definitely a need for a function to
quote a full uri.

devdanzin · 2009-02-08T21:50:19Z

I think Toshio's usecase is important enough to deserve a fix (patch
attached) or a special-cased error message. IMO, newbies trying to fix
failures from urlopen may have a hard time figuring out the maze:

urlopen -> _opener -> open -> _open -> _call_chain -> http_open ->
do_open (and that's before leaving urllib!).

>>> from urllib.request import urlopen
>>> url = 'http://localhost/ñ.html'
>>> urlopen(url).read()
Traceback (most recent call last):
[...]
UnicodeEncodeError: 'ascii' codec can't encode character '\xf1' in
position 5: ordinal not in range(128)


If the newbie isn't completely lost by then, how about:
>>> from urllib.parse import quote
>>> urlopen(quote(url)).read()
Traceback (most recent call last):
[...]
ValueError: unknown url type: http%3A//localhost/%C3%B1.html

thezulk · 2013-02-23T17:58:20Z

This is a patch against 3.2 adding urllib.parse.quote_uri

It splits the URI in 5 parts (protocol, authentication, hostname, port and path) then runs urllib.parse.quote on the path and encodes the hostname to punycode if it's not in ascii.

It's not perfect, but should be usable in most cases.
I created some test cases aswell.

GrahamOliver · 2014-05-29T01:21:54Z

hello
I came across this bug when using 'ā' in a url
To get around the problem I used the 'URL encoded' version '%C4%81' instead of 'ā'
See this page
http://www.charbase.com/0101-unicode-latin-small-letter-a-with-macron
I tried using the 'puny code' for 'ā' 'xn--yda' but that didn't work

vadmium · 2017-01-18T11:40:38Z

bpo-9679: Focusses on encoding just the DNS name
bpo-20559: Maybe a duplicate, or opportunity for better documentation or error message as a bug fix?

Andreas’s patch just proposes a new function called quote_uri(). It would need documentation. We already have a quote() and quote_plus() function. Since it sounds like this is for IRIs (https://tools.ietf.org/html/rfc3987), would it be more appropriate to call it quote_iri()?

See revision cb09fdef19f5, especially the quote(safe=...) parameter, for how I avoided the double encoding problem.

thezulk · 2017-01-27T21:58:29Z

Changed the patch after pointers from vadmium.
And quote_uri is changed to quote_iri as martin.panter thought it was more appropriate.

vadmium · 2017-01-29T01:06:53Z

I’m not really an expert on non-ASCII URLs / IRIs. Maybe it is obvious to other people that this is a good general implementation, but for me to thoroughly review it I would need time to research the relevant RFCs, other implementations, suitability for the URL schemes listed at <https://docs.python.org/dev/library/urllib.parse.html\>, security implications, etc.

One problem problem with using urlunsplit() is it would strip empty URL components, e.g. quote_iri("http://example/file#") -> "http://example/file". See bpo-22852. This is highlighted by the file:///[. . .] → file:/[. . .] test case.

FYI Martin Panter and vadmium are both just me, no need to get too excited. :) I just updated my settings for Rietveld (code review), so hopefully that is more obvious now.

bitdancer · 2017-01-29T14:09:42Z

I believe the last time this subject was discussed the conclusion was that we really needed a full IRI module that conformed to the relevant RFCs, and that putting something on pypi would be one way to get there.

Someone should research the existing packages. It might be that we need something simpler than what exists, but whatever we do should be informed by what exists, I think.

mblahay · 2023-04-24T21:32:26Z

I am suggesting that this be changed to a documentation issue.

ambv · 2023-04-24T21:37:57Z

Agreed. The domain name would have to be encoded in Punycode, while the rest of the path uses percent-encoding, which depends on the byte representation that the server expects. It's unclear in general to which bytes, say, "ł" should translate on the server side.

Documenting that we're not doing automatic encoding of either domains or path elements is the most reasonable way forward given that it's been 15 years since this issue got created.

mblahay · 2023-04-25T19:39:35Z

@ambv Is this enough?

Open the URL url, which can be either a string containing a valid, properly encoded, URL or a Request object.

I am attempting to avoid specifically calling out the encodings to use as it may be difficult to communicate succinctly and they may change over time.

ambv · 2023-04-25T20:15:44Z

Yes, looks good. Let's see it on a PR.

…d Request (#103855) Co-authored-by: Łukasz Langa <lukasz@langa.pl>

…pen and Request (pythonGH-103855) (cherry picked from commit 44010d0) Co-authored-by: Michael Blahay <mblahay@users.noreply.github.com> Co-authored-by: Łukasz Langa <lukasz@langa.pl>

…open and Request (GH-103855) (#103891) (cherry picked from commit 44010d0) Co-authored-by: Michael Blahay <mblahay@users.noreply.github.com> Co-authored-by: Łukasz Langa <lukasz@langa.pl>

erlend-aasland · 2023-05-09T22:23:03Z

Resolved by:

gh-48241: Clarify URL needs to be encoded when provided to urlopen and Request #103855
[3.11] gh-48241: Clarify URL needs to be encoded when provided to urlopen and Request (GH-103855) #103891

abadger mannequin added the topic-unicode label Sep 28, 2008

janssen mannequin added the type-feature A feature request or enhancement label Oct 1, 2008

devdanzin mannequin added easy stdlib Python modules in the Lib dir labels Feb 12, 2009

vadmium added the 3.7 (EOL) end of life label Jan 18, 2017

dfrojas mannequin added extension-modules C modules in the Modules dir 3.8 only security fixes type-bug An unexpected behavior, bug, or error and removed stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement labels Feb 20, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

ambv removed type-bug An unexpected behavior, bug, or error extension-modules C modules in the Modules dir 3.8 only security fixes 3.7 (EOL) end of life labels Apr 24, 2023

ambv added docs Documentation in the Doc dir needs backport to 3.11 only security fixes labels Apr 24, 2023

AlexWaygood mentioned this issue Apr 25, 2023

gh-48241: Adding note about url needing to be encoded when provided to the urlopen and Request #103850

Closed

bedevere-bot mentioned this issue Apr 25, 2023

gh-48241: Clarify URL needs to be encoded when provided to urlopen and Request #103855

Merged

ambv added a commit that referenced this issue Apr 26, 2023

gh-48241: Clarify URL needs to be encoded when provided to urlopen an…

44010d0

…d Request (#103855) Co-authored-by: Łukasz Langa <lukasz@langa.pl>

bedevere-bot mentioned this issue Apr 26, 2023

[3.11] gh-48241: Clarify URL needs to be encoded when provided to urlopen and Request (GH-103855) #103891

Merged

erlend-aasland closed this as completed May 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urllib.request.urlopen does not handle non-ASCII characters #48241

urllib.request.urlopen does not handle non-ASCII characters #48241

abadger mannequin commented Sep 28, 2008 •

edited by bedevere-bot

Loading

abadger mannequin commented Sep 28, 2008

janssen mannequin commented Sep 29, 2008

abadger mannequin commented Sep 29, 2008

janssen mannequin commented Sep 30, 2008

abadger mannequin commented Sep 30, 2008

janssen mannequin commented Oct 1, 2008

abadger mannequin commented Oct 1, 2008

devdanzin mannequin commented Feb 8, 2009

thezulk mannequin commented Feb 23, 2013

GrahamOliver mannequin commented May 29, 2014

vadmium commented Jan 18, 2017

thezulk mannequin commented Jan 27, 2017

vadmium commented Jan 29, 2017

bitdancer commented Jan 29, 2017

mblahay commented Apr 24, 2023

ambv commented Apr 24, 2023

mblahay commented Apr 25, 2023

ambv commented Apr 25, 2023

erlend-aasland commented May 9, 2023

urllib.request.urlopen does not handle non-ASCII characters #48241

urllib.request.urlopen does not handle non-ASCII characters #48241

Comments

abadger mannequin commented Sep 28, 2008 • edited by bedevere-bot Loading

Linked PRs

abadger mannequin commented Sep 28, 2008

janssen mannequin commented Sep 29, 2008

abadger mannequin commented Sep 29, 2008

janssen mannequin commented Sep 30, 2008

abadger mannequin commented Sep 30, 2008

janssen mannequin commented Oct 1, 2008

abadger mannequin commented Oct 1, 2008

devdanzin mannequin commented Feb 8, 2009

thezulk mannequin commented Feb 23, 2013

GrahamOliver mannequin commented May 29, 2014

vadmium commented Jan 18, 2017

thezulk mannequin commented Jan 27, 2017

vadmium commented Jan 29, 2017

bitdancer commented Jan 29, 2017

mblahay commented Apr 24, 2023

ambv commented Apr 24, 2023

mblahay commented Apr 25, 2023

ambv commented Apr 25, 2023

erlend-aasland commented May 9, 2023

abadger mannequin commented Sep 28, 2008 •

edited by bedevere-bot

Loading