Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing functions in urllib.parse remove the delimiters of empty URI components in violation of RFC 3986 #99962

Open
geryogam opened this issue Dec 3, 2022 · 0 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@geryogam
Copy link
Contributor

geryogam commented Dec 3, 2022

Issue

Here is the generic syntax of a URI, consisting of five components:

URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment]
  • A component is undefined if it has an associated delimiter and the delimiter does not appear in the URI; the scheme and path components are always defined (cf. RFC 3986, § 5.2.1).
  • A component is empty if it has no characters; the scheme component is always non-empty (cf. RFC 3986, § 3).

So we need two distinct representations for undefined components and empty components.

Currently the functions urlparse(), urlsplit(), and urldefrag() of the urllib.parse submodule use the empty string '' for representing undefined components and empty components, which results in lossy parsing for URIs with empty components (their associated delimiters are removed):

>>> from urllib.parse import urldefrag, urlparse, urlsplit
>>> 
>>> urldefrag('scheme://authority/path?query#fragment').geturl()
'scheme://authority/path?query#fragment'
>>> urldefrag('scheme:///path?query#fragment').geturl()  # empty authority
'scheme:/path?query#fragment'
>>> urldefrag('scheme://authority/path?#fragment').geturl()  # empty query
'scheme://authority/path#fragment'
>>> urldefrag('scheme://authority/path?query#').geturl()  # empty fragment
'scheme://authority/path?query'
>>> 
>>> urlparse('scheme://authority/path?query#fragment').geturl()
'scheme://authority/path?query#fragment'
>>> urlparse('scheme:///path?query#fragment').geturl()  # empty authority
'scheme:/path?query#fragment'
>>> urlparse('scheme://authority/path?#fragment').geturl()  # empty query
'scheme://authority/path#fragment'
>>> urlparse('scheme://authority/path?query#').geturl()  # empty fragment
'scheme://authority/path?query'
>>> 
>>> urlsplit('scheme://authority/path?query#fragment').geturl()
'scheme://authority/path?query#fragment'
>>> urlsplit('scheme:///path?query#fragment').geturl()  # empty authority
'scheme:/path?query#fragment'
>>> urlsplit('scheme://authority/path?#fragment').geturl()  # empty query
'scheme://authority/path#fragment'
>>> urlsplit('scheme://authority/path?query#').geturl()  # empty fragment
'scheme://authority/path?query'

urlparse() also uses the empty string '' for representing undefined subcomponents and empty subcomponents when the scheme component is http, which also results in lossy parsing for URIs with empty parameter subcomponents (their associated delimiters are removed):

>>> from urllib.parse import urlparse
>>> 
>>> urlparse('http://authority/path;parameter?query#fragment').geturl()
'http://authority/path;parameter?query#fragment'
>>> urlparse('http://authority/path;?query#fragment').geturl()  # empty parameter
'http://authority/path?query#fragment'

All those URI pairs (before and after parsing) with empty components are not equivalent so the current parsing violates the URI specification by removing the associated delimiters (cf. RFC 3986, § 6.2.3):

Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification. For example, the URI http://example.com/? cannot be assumed to be equivalent to any of the examples above. Likewise, the presence or absence of delimiters within a userinfo subcomponent is usually significant to its interpretation. The fragment component is not subject to any scheme-based normalization; thus, two URIs that differ only by the suffix # are considered different regardless of the scheme.

Solution

In PR #15642, I proposed to use None for representing undefined components and keep using '' for representing empty components, so that urlparse(), urlsplit(), and urldefrag() are able to distinguish between the two cases.

The benefit of the None representation for undefined components is that it does not break the False in Boolean context assumption. The drawback is that it breaks pip, and potentially many more Python programs which rely on the current '' representation for undefined components. So the None representation is not backward compatible.

In PR #15642 (comment), @nicktimko suggested an alternative representation for undefined components:

It's maybe a bit surprising to have some of the tuple fields sometimes be None (typing.Tuple[typing.Optional[str]] instead of typing.Tuple[str]), but I'm not sure of a more obvious solution.

The other alternative I thought about was to just explicitly dump in the delimiter if it's empty (e.g. 'http://example.com/?#''http', 'example.com', '/', '?', '#'), but that's probably more surprising, rebuilding the URL is more complex, and what then if there's a URL like http://example.com/??.

I think you need to also describe the breaking change very clearly (haven't done it before, but I think that's what bedevere/news is for, i.e. these things), and leave hints in the actual documentation about the change ("changed in 3.9")

Housekeeping: I'd squash all the commits.

As he noted, '?' would not be a good representation for undefined components since this character is already allowed in the query and fragment components (cf. RFC 3986, § A). However '#' would because it is not allowed in any URI components. It is actually the only character in the URI character set that has this property. Characters outside the URI character set could also do the job (e.g. ", <, >).

The benefit of an str representation for undefined components compared to None is that is does not break the str type assumption in other programs. The drawback is that it breaks the False in Boolean context assumption and the '' value assumption, so it might still be backward incompatible.

In PR #15642 (comment), @nicktimko suggested another representation for undefined components:

Can always engage in some fence-sitting and support both compatible and compliance options. Pass/set a flag, use a different function/method. How about a new urllib2 😉?

Re: returning None and how it would break things, for compatiblity, some ✨Magic™️ objects (or even just a uniquely referenced sentinel string?) could be used and otherwise work like an empty string but is some sentinel object for checking.

For representing undefined components, we could use a member of an enum.Enum like this:

>>> from enum import Enum
>>> class Component(str, Enum):
...     UNDEFINED = ''
... 
>>> legacy = ''
>>> assert Component.UNDEFINED is not legacy  # distinct representation assumption
>>> assert bool(Component.UNDEFINED) is bool(legacy)  # False in Boolean context assumption
>>> assert isinstance(Component.UNDEFINED, type(legacy))  # str type assumption
>>> assert Component.UNDEFINED == legacy  # '' value assumption
>>> 

The benefit of that enum.Enum representation for undefined components compared to the previous str representation is that it does not break the False in Boolean context assumption nor the '' value assumption. Do you see any drawback?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants