Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlparse does not correctly handle signs, underscores, and whitespace in port numbers #96035

Closed
kenballus opened this issue Aug 16, 2022 · 8 comments
Assignees
Labels
3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes stdlib Python modules in the Lib dir triaged The issue has been accepted as valid by a triager. type-bug An unexpected behavior, bug, or error

Comments

@kenballus
Copy link
Contributor

Background

RFC 3986 (spec for URIs) defines a valid port string with the following grammar rule:

  • port = *DIGIT

Here's the WHATWG URL spec definition:
"""
A URL-port string must be one of the following:

  • the empty string
  • one or more ASCII digits representing a decimal number no greater than $2^{16} − 1$.

"""1

The bug

This is the port string parsing code from Lib/urllib/parse.py:166-176:

def port(self):
    port = self._hostinfo[1]
    if port is not None:
        try:
            port = int(port, 10)
        except ValueError:
            message = f'Port could not be cast to integer value as {port!r}'
            raise ValueError(message) from None
        if not ( 0 <= port <= 65535):
            raise ValueError("Port out of range 0-65535")
    return port

This will erroneously validate strings "-0" and f"+{x}" for any value of x in the valid range. Given that + and - are not digits, this behavior is in violation of both specifications.

This bug is easily reproducible with the following snippet:

from urllib.parse import urlparse
url1 = urlparse("http://python.org:-0")
url2 = urlparse("http://python.org:+80")
print(url1.port) # prints 0, but error is expected
print(url2.port) # prints 80, but error is expected

Happy to submit a PR, but don't want to step on any toes over at #25774.

My environment

  • CPython version tested on:
    • 3.10.6
  • Operating system and architecture:
    • Arch Linux x86_64

Footnotes

  1. Given that this is urlparse and not uriparse, it seems appropriate that we do not accept port numbers outside range(2**16), even though such numbers are allowed by RFC 3986.

@kenballus kenballus added the type-bug An unexpected behavior, bug, or error label Aug 16, 2022
@JelleZijlstra
Copy link
Member

Similarly we accept urlparse("http://python.org:10_000").port.

Do you think this has any practical effect, for example on security? Most likely we'll have to go through a deprecation cycle to remove the noncompliant behavior.

@kenballus
Copy link
Contributor Author

We also accept urlparse("http://python.org:\r\n80\r\n").port, because int trims whitespace.

I think it's plausible that there are security concerns stemming from desyncing a front-end server using urlparse from a backend server using something else. I can imagine, for instance, another implementation of a permissive URL parser interpreting "http://python.org:80_80" as having port 80, leading to disagreement between the servers as to port in question. I could also see the use of whitespace in the port string (especially "\r\n") as a potential source of confusion for HTTP servers, because even though urlparse has no problem with "http://python.org:80\r\n", a CRLF ends the HTTP 1.1 start line and would lead to an invalid HTTP message. I'm missing the creativity right now to string all of this together into something tangible, but I feel like it could be done.

@JelleZijlstra JelleZijlstra changed the title urlparse does not correctly handle signs in port numbers urlparse does not correctly handle signs, underscores, and whitespace in port numbers Aug 16, 2022
@lightswitch05
Copy link

I just ran into this issue with whitespace. It didn't cause a security issue, but rather a bunch of unexpected errors:

>>> urlparse('  https://example.com  ')
ParseResult(scheme='', netloc='', path='  https://example.com  ', params='', query='', fragment='')

@kenballus
Copy link
Contributor Author

I just ran into this issue with whitespace. It didn't cause a security issue, but rather a bunch of unexpected errors:

>>> urlparse('  https://example.com  ')
ParseResult(scheme='', netloc='', path='  https://example.com  ', params='', query='', fragment='')

This is definitely weird behavior, but maybe not a bug. RFC3986 says a URI scheme must begin with an ALPHA, so I think we're right in determining that there is no scheme (even though the RFC says there must be a scheme, I think the WHATWG spec says you can omit it in some situations. It's unclear to me.) The WHATWG spec also says that any ASCII string is a valid URL path, which is also in disagreement with the RFC. If we're going with the WHATWG interpretation, then ' https://example.com ' is a perfectly valid URL path and nothing has been parsed incorrectly. If we're going with the RFC, then urlparse's behavior is not good.

Another similar bug:

>>> urlparse("https:// example.com ")
ParseResult(scheme='https', netloc=' example.com ', path='', params='', query='', fragment='')

This time, putting space after the :// leads to spaces in the hostname. That's also wrong according to RFC, and okay according to WHATWG.

@lightswitch05
Copy link

Hmmm... interesting, what about the case without a scheme with port?

>>> urlparse('www.example.com:80')
ParseResult(scheme='www.example.com', netloc='', path='80', params='', query='', fragment='')

@kenballus
Copy link
Contributor Author

Hmmm... interesting, what about the case without a scheme with port?

>>> urlparse('www.example.com:80')
ParseResult(scheme='www.example.com', netloc='', path='80', params='', query='', fragment='')

This is just objectively wrong, and should probably be its own bug report. Good find.

@gpshead
Copy link
Member

gpshead commented Sep 13, 2022

Most likely we'll have to go through a deprecation cycle to remove the noncompliant behavior.

I don't think so. This is clearly a bug, not a feature someone should depend on.

@gpshead gpshead added 3.11 only security fixes 3.10 only security fixes 3.12 bugs and security fixes stdlib Python modules in the Lib dir triaged The issue has been accepted as valid by a triager. labels Sep 13, 2022
kenballus added a commit to kenballus/cpython that referenced this issue Oct 15, 2022
kenballus added a commit to kenballus/cpython that referenced this issue Oct 15, 2022
kenballus added a commit to kenballus/cpython that referenced this issue Oct 15, 2022
kenballus added a commit to kenballus/cpython that referenced this issue Oct 15, 2022
JelleZijlstra added a commit that referenced this issue Oct 20, 2022
Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 20, 2022
…ythonGH-98273)

Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
(cherry picked from commit 6f15ca8)

Co-authored-by: Ben Kallus <49924171+kenballus@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 20, 2022
…ythonGH-98273)

Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
(cherry picked from commit 6f15ca8)

Co-authored-by: Ben Kallus <49924171+kenballus@users.noreply.github.com>
@JelleZijlstra JelleZijlstra self-assigned this Oct 20, 2022
miss-islington added a commit that referenced this issue Oct 20, 2022
Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
(cherry picked from commit 6f15ca8)

Co-authored-by: Ben Kallus <49924171+kenballus@users.noreply.github.com>
miss-islington added a commit that referenced this issue Oct 20, 2022
Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
(cherry picked from commit 6f15ca8)

Co-authored-by: Ben Kallus <49924171+kenballus@users.noreply.github.com>
@JelleZijlstra
Copy link
Member

All the backports are merged, thanks for reporting and fixing this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes stdlib Python modules in the Lib dir triaged The issue has been accepted as valid by a triager. type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants