Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib.request.urlopen does not handle non-ASCII characters #48241

Closed
abadger mannequin opened this issue Sep 28, 2008 · 19 comments
Closed

urllib.request.urlopen does not handle non-ASCII characters #48241

abadger mannequin opened this issue Sep 28, 2008 · 19 comments
Labels
docs Documentation in the Doc dir easy needs backport to 3.11 only security fixes

Comments

@abadger
Copy link
Mannequin

abadger mannequin commented Sep 28, 2008

BPO 3991
Nosy @orsenthil, @vstinner, @devdanzin, @ezio-melotti, @abadger, @bitdancer, @vadmium, @remilapeyre
Dependencies
  • bpo-9679: unicode DNS names in urllib, urlopen
  • Files
  • non_ascii_path.diff: Calls quote() on the request path if path.encode('ascii') fails
  • issue3991.diff
  • issue3991_2017-01-27.diff: Diff against 3.7.0a0
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2008-09-28.18:47:16.056>
    labels = ['extension-modules', '3.8', 'type-bug', '3.7', 'easy']
    title = 'urllib.request.urlopen does not handle non-ASCII characters'
    updated_at = <Date 2019-02-20.20:02:12.619>
    user = 'https://github.com/abadger'

    bugs.python.org fields:

    activity = <Date 2019-02-20.20:02:12.619>
    actor = 'remi.lapeyre'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Extension Modules']
    creation = <Date 2008-09-28.18:47:16.056>
    creator = 'a.badger'
    dependencies = ['9679']
    files = ['12986', '29194', '46440']
    hgrepos = []
    issue_num = 3991
    keywords = ['patch', 'easy']
    message_count = 14.0
    messages = ['73982', '74046', '74053', '74085', '74088', '74110', '74117', '81423', '182785', '219325', '285721', '286386', '286423', '286444']
    nosy_count = 11.0
    nosy_names = ['janssen', 'orsenthil', 'vstinner', 'ajaksu2', 'ezio.melotti', 'a.badger', 'r.david.murray', 'martin.panter', 'thezulk', 'Graham.Oliver', 'remi.lapeyre']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue3991'
    versions = ['Python 3.4', 'Python 3.5', 'Python 3.6', 'Python 3.7', 'Python 3.8']

    Linked PRs

    @abadger
    Copy link
    Mannequin Author

    abadger mannequin commented Sep 28, 2008

    Tested on python-3.0rc1 -- Linux Fedora 9

    I wanted to make sure that python3.0 would handle url's in different
    encodings. So I created two files on an apache server which were named
    ½ñ.html. One of the filenames was encoded in utf-8 and the other in
    latin-1. Then I tried the following::

    from urllib.request import urlopen
    url = 'http://localhost/u/½ñ.html'
    urlopen(url.encode('utf-8')).read()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.0/urllib/request.py", line 122, in urlopen
        return _opener.open(url, data, timeout)
      File "/usr/lib/python3.0/urllib/request.py", line 350, in open
        req.timeout = timeout
    AttributeError: 'bytes' object has no attribute 'timeout'

    The same thing happens if I give None for the two optional arguments
    (data and timeout).

    Next I tried using a raw Unicode string:

    >>> urlopen(url).read()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.0/urllib/request.py", line 122, in urlopen
        return _opener.open(url, data, timeout)
      File "/usr/lib/python3.0/urllib/request.py", line 359, in open
        response = self._open(req, data)
      File "/usr/lib/python3.0/urllib/request.py", line 377, in _open
        '_open', req)
      File "/usr/lib/python3.0/urllib/request.py", line 337, in _call_chain
        result = func(*args)
      File "/usr/lib/python3.0/urllib/request.py", line 1082, in http_open
        return self.do_open(http.client.HTTPConnection, req)
      File "/usr/lib/python3.0/urllib/request.py", line 1068, in do_open
        h.request(req.get_method(), req.get_selector(), req.data, headers)
      File "/usr/lib/python3.0/http/client.py", line 843, in request
        self._send_request(method, url, body, headers)
      File "/usr/lib/python3.0/http/client.py", line 860, in _send_request
        self.putrequest(method, url, **skips)
      File "/usr/lib/python3.0/http/client.py", line 751, in putrequest
        self._output(request.encode('ascii'))
    UnicodeEncodeError: 'ascii' codec can't encode characters in position
    7-8: ordinal not in range(128)

    So, in python-3.0rc1, this method is badly broken.

    @abadger abadger mannequin added the topic-unicode label Sep 28, 2008
    @janssen
    Copy link
    Mannequin

    janssen mannequin commented Sep 29, 2008

    As I read RFC 2396,

    1.5: "A URI is a sequence of characters from a very
    limited set, i.e. the letters of the basic Latin alphabet, digits,
    and a few special characters."

    2.4: "Data must be escaped if it does not have a representation using an
    unreserved character; this includes data that does not correspond to
    a printable character of the US-ASCII coded character set, or that
    corresponds to any US-ASCII character that is disallowed, as
    explained below."

    So your URL string is invalid. You need to escape the characters properly.

    (RFC 2396 is what the HTTP RFC cites as its authority on URLs.)

    @abadger
    Copy link
    Mannequin Author

    abadger mannequin commented Sep 29, 2008

    Possibly. This is a change from python-2.x's urlopen() which escaped
    the URL automatically, though. I can see the case for having the user
    call an escape function themselves instead of having urlopen() perform
    the escape for them. However, that function would need to be written.
    (The present parse.quote() method only quotes correctly if only the path
    component is passed; there's no function to take a full URL and quote it
    appropriately.)

    Without such a function, a whole lot of code bases will have to reinvent
    the wheel creating functions to parse the path out, run it through
    urllib.parse.quote() and then pass the result to urlib.urlopen().

    @janssen
    Copy link
    Mannequin

    janssen mannequin commented Sep 30, 2008

    It's not immediately clear to me how an auto-quote function can be
    written; as you say (and as the URI spec points out), you have to take
    a URL apart before quoting it, and you can't parse an invalid URL,
    which is what the input is.

    Best to think of this as a difference from 2.x.

    @abadger
    Copy link
    Mannequin Author

    abadger mannequin commented Sep 30, 2008

    The purpose of such a function would be to take something that is not a
    valid uri but 1) is a common way of expressing the way to get to the
    resource and 2) follows certain rules and turns that into something that
    is a valid uri. non-ASCii strings in the path are a good example of
    this since there is a well defined method to encode the strings into the
    URL if you are given a character encoding to apply to it.

    My first, naive thought is that if the input can be parsed by
    urlparse(), then there is a very good chance that we have the ability to
    escape the string properly. Looking at the invalid uri that I gave, for
    instance, if you additionally specified an encoding for the path element
    there's no reason a function couldn't do the escaping.

    What are example inputs that you are concerned about? I'll see if I can
    come up with code that works with them.

    @janssen
    Copy link
    Mannequin

    janssen mannequin commented Oct 1, 2008

    I'm not concerned about any example inputs. I was just trying to
    explain why this isn't a bug.

    On the other hand, the IRI spec (RFC 3897) is another thing we might
    try to implement for Python.

    @janssen janssen mannequin added the type-feature A feature request or enhancement label Oct 1, 2008
    @abadger
    Copy link
    Mannequin Author

    abadger mannequin commented Oct 1, 2008

    Oh, that's cool. I've been fine with this being a request for a needed
    function to quote and unquote full urls rather than a bug in urlopen().

    I think iri's are a distraction here, though. The RFC for iris even
    says that specifications that call for uris and do not mention iris
    should not take iris. So there's definitely a need for a function to
    quote a full uri.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Feb 8, 2009

    I think Toshio's usecase is important enough to deserve a fix (patch
    attached) or a special-cased error message. IMO, newbies trying to fix
    failures from urlopen may have a hard time figuring out the maze:

    urlopen -> _opener -> open -> _open -> _call_chain -> http_open ->
    do_open (and that's before leaving urllib!).

    >>> from urllib.request import urlopen
    >>> url = 'http://localhost/ñ.html'
    >>> urlopen(url).read()
    Traceback (most recent call last):
    [...]
    UnicodeEncodeError: 'ascii' codec can't encode character '\xf1' in
    position 5: ordinal not in range(128)
    
    
    If the newbie isn't completely lost by then, how about:
    >>> from urllib.parse import quote
    >>> urlopen(quote(url)).read()
    Traceback (most recent call last):
    [...]
    ValueError: unknown url type: http%3A//localhost/%C3%B1.html

    @devdanzin devdanzin mannequin added easy stdlib Python modules in the Lib dir labels Feb 12, 2009
    @thezulk
    Copy link
    Mannequin

    thezulk mannequin commented Feb 23, 2013

    This is a patch against 3.2 adding urllib.parse.quote_uri

    It splits the URI in 5 parts (protocol, authentication, hostname, port and path) then runs urllib.parse.quote on the path and encodes the hostname to punycode if it's not in ascii.

    It's not perfect, but should be usable in most cases.
    I created some test cases aswell.

    @GrahamOliver
    Copy link
    Mannequin

    GrahamOliver mannequin commented May 29, 2014

    hello
    I came across this bug when using 'ā' in a url
    To get around the problem I used the 'URL encoded' version '%C4%81' instead of 'ā'
    See this page
    http://www.charbase.com/0101-unicode-latin-small-letter-a-with-macron
    I tried using the 'puny code' for 'ā' 'xn--yda' but that didn't work

    @vadmium
    Copy link
    Member

    vadmium commented Jan 18, 2017

    bpo-9679: Focusses on encoding just the DNS name
    bpo-20559: Maybe a duplicate, or opportunity for better documentation or error message as a bug fix?

    Andreas’s patch just proposes a new function called quote_uri(). It would need documentation. We already have a quote() and quote_plus() function. Since it sounds like this is for IRIs (https://tools.ietf.org/html/rfc3987), would it be more appropriate to call it quote_iri()?

    See revision cb09fdef19f5, especially the quote(safe=...) parameter, for how I avoided the double encoding problem.

    @vadmium vadmium added the 3.7 (EOL) end of life label Jan 18, 2017
    @thezulk
    Copy link
    Mannequin

    thezulk mannequin commented Jan 27, 2017

    Changed the patch after pointers from vadmium.
    And quote_uri is changed to quote_iri as martin.panter thought it was more appropriate.

    @vadmium
    Copy link
    Member

    vadmium commented Jan 29, 2017

    I’m not really an expert on non-ASCII URLs / IRIs. Maybe it is obvious to other people that this is a good general implementation, but for me to thoroughly review it I would need time to research the relevant RFCs, other implementations, suitability for the URL schemes listed at <https://docs.python.org/dev/library/urllib.parse.html\>, security implications, etc.

    One problem problem with using urlunsplit() is it would strip empty URL components, e.g. quote_iri("http://example/file#") -> "http://example/file". See bpo-22852. This is highlighted by the file:///[. . .] → file:/[. . .] test case.

    FYI Martin Panter and vadmium are both just me, no need to get too excited. :) I just updated my settings for Rietveld (code review), so hopefully that is more obvious now.

    @bitdancer
    Copy link
    Member

    I believe the last time this subject was discussed the conclusion was that we really needed a full IRI module that conformed to the relevant RFCs, and that putting something on pypi would be one way to get there.

    Someone should research the existing packages. It might be that we need something simpler than what exists, but whatever we do should be informed by what exists, I think.

    @dfrojas dfrojas mannequin added extension-modules C modules in the Modules dir 3.8 only security fixes type-bug An unexpected behavior, bug, or error and removed stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement labels Feb 20, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @ambv ambv removed type-bug An unexpected behavior, bug, or error extension-modules C modules in the Modules dir 3.8 only security fixes 3.7 (EOL) end of life labels Apr 24, 2023
    @ambv ambv added docs Documentation in the Doc dir needs backport to 3.11 only security fixes labels Apr 24, 2023
    @mblahay
    Copy link
    Contributor

    mblahay commented Apr 24, 2023

    I am suggesting that this be changed to a documentation issue.

    @ambv
    Copy link
    Contributor

    ambv commented Apr 24, 2023

    Agreed. The domain name would have to be encoded in Punycode, while the rest of the path uses percent-encoding, which depends on the byte representation that the server expects. It's unclear in general to which bytes, say, "ł" should translate on the server side.

    Documenting that we're not doing automatic encoding of either domains or path elements is the most reasonable way forward given that it's been 15 years since this issue got created.

    @mblahay
    Copy link
    Contributor

    mblahay commented Apr 25, 2023

    @ambv Is this enough?

    Open the URL url, which can be either a string containing a valid, properly encoded, URL or a Request object.

    I am attempting to avoid specifically calling out the encodings to use as it may be difficult to communicate succinctly and they may change over time.

    @ambv
    Copy link
    Contributor

    ambv commented Apr 25, 2023

    Yes, looks good. Let's see it on a PR.

    ambv added a commit that referenced this issue Apr 26, 2023
    …d Request (#103855)
    
    Co-authored-by: Łukasz Langa <lukasz@langa.pl>
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue Apr 26, 2023
    …pen and Request (pythonGH-103855)
    
    (cherry picked from commit 44010d0)
    
    Co-authored-by: Michael Blahay <mblahay@users.noreply.github.com>
    Co-authored-by: Łukasz Langa <lukasz@langa.pl>
    erlend-aasland pushed a commit that referenced this issue May 9, 2023
    …open and Request (GH-103855) (#103891)
    
    (cherry picked from commit 44010d0)
    
    Co-authored-by: Michael Blahay <mblahay@users.noreply.github.com>
    Co-authored-by: Łukasz Langa <lukasz@langa.pl>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir easy needs backport to 3.11 only security fixes
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants