Parsing functions in urllib.parse remove the delimiters of empty URI components in violation of RFC 3986 #99962

geryogam · 2022-12-03T02:42:01Z

Issue

Here is the generic syntax of a URI, consisting of five components:

URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment]

A component is undefined if it has an associated delimiter and the delimiter does not appear in the URI; the scheme and path components are always defined (cf. RFC 3986, § 5.2.1).
A component is empty if it has no characters; the scheme component is always non-empty (cf. RFC 3986, § 3).

So we need two distinct representations for undefined components and empty components.

Currently the functions urlparse(), urlsplit(), and urldefrag() of the urllib.parse submodule use the empty string '' for representing undefined components and empty components, which results in lossy parsing for URIs with empty components (their associated delimiters are removed):

>>> from urllib.parse import urldefrag, urlparse, urlsplit
>>> 
>>> urldefrag('scheme://authority/path?query#fragment').geturl()
'scheme://authority/path?query#fragment'
>>> urldefrag('scheme:///path?query#fragment').geturl()  # empty authority
'scheme:/path?query#fragment'
>>> urldefrag('scheme://authority/path?#fragment').geturl()  # empty query
'scheme://authority/path#fragment'
>>> urldefrag('scheme://authority/path?query#').geturl()  # empty fragment
'scheme://authority/path?query'
>>> 
>>> urlparse('scheme://authority/path?query#fragment').geturl()
'scheme://authority/path?query#fragment'
>>> urlparse('scheme:///path?query#fragment').geturl()  # empty authority
'scheme:/path?query#fragment'
>>> urlparse('scheme://authority/path?#fragment').geturl()  # empty query
'scheme://authority/path#fragment'
>>> urlparse('scheme://authority/path?query#').geturl()  # empty fragment
'scheme://authority/path?query'
>>> 
>>> urlsplit('scheme://authority/path?query#fragment').geturl()
'scheme://authority/path?query#fragment'
>>> urlsplit('scheme:///path?query#fragment').geturl()  # empty authority
'scheme:/path?query#fragment'
>>> urlsplit('scheme://authority/path?#fragment').geturl()  # empty query
'scheme://authority/path#fragment'
>>> urlsplit('scheme://authority/path?query#').geturl()  # empty fragment
'scheme://authority/path?query'

urlparse() also uses the empty string '' for representing undefined subcomponents and empty subcomponents when the scheme component is http, which also results in lossy parsing for URIs with empty parameter subcomponents (their associated delimiters are removed):

>>> from urllib.parse import urlparse
>>> 
>>> urlparse('http://authority/path;parameter?query#fragment').geturl()
'http://authority/path;parameter?query#fragment'
>>> urlparse('http://authority/path;?query#fragment').geturl()  # empty parameter
'http://authority/path?query#fragment'

All those URI pairs (before and after parsing) with empty components are not equivalent so the current parsing violates the URI specification by removing the associated delimiters (cf. RFC 3986, § 6.2.3):

Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification. For example, the URI http://example.com/? cannot be assumed to be equivalent to any of the examples above. Likewise, the presence or absence of delimiters within a userinfo subcomponent is usually significant to its interpretation. The fragment component is not subject to any scheme-based normalization; thus, two URIs that differ only by the suffix # are considered different regardless of the scheme.

Solution

In PR #15642, I proposed to use None for representing undefined components and keep using '' for representing empty components, so that urlparse(), urlsplit(), and urldefrag() are able to distinguish between the two cases.

The benefit of the None representation for undefined components is that it does not break the False in Boolean context assumption. The drawback is that it breaks pip, and potentially many more Python programs which rely on the current '' representation for undefined components. So the None representation is not backward compatible.

In PR #15642 (comment), @nicktimko suggested an alternative representation for undefined components:

It's maybe a bit surprising to have some of the tuple fields sometimes be None (typing.Tuple[typing.Optional[str]] instead of typing.Tuple[str]), but I'm not sure of a more obvious solution.

The other alternative I thought about was to just explicitly dump in the delimiter if it's empty (e.g. 'http://example.com/?#' → 'http', 'example.com', '/', '?', '#'), but that's probably more surprising, rebuilding the URL is more complex, and what then if there's a URL like http://example.com/??.

I think you need to also describe the breaking change very clearly (haven't done it before, but I think that's what bedevere/news is for, i.e. these things), and leave hints in the actual documentation about the change ("changed in 3.9")

Housekeeping: I'd squash all the commits.

As he noted, '?' would not be a good representation for undefined components since this character is already allowed in the query and fragment components (cf. RFC 3986, § A). However '#' would because it is not allowed in any URI components. It is actually the only character in the URI character set that has this property. Characters outside the URI character set could also do the job (e.g. ", <, >).

The benefit of an str representation for undefined components compared to None is that is does not break the str type assumption in other programs. The drawback is that it breaks the False in Boolean context assumption and the '' value assumption, so it might still be backward incompatible.

In PR #15642 (comment), @nicktimko suggested another representation for undefined components:

Can always engage in some fence-sitting and support both compatible and compliance options. Pass/set a flag, use a different function/method. How about a new urllib2 😉?

Re: returning None and how it would break things, for compatiblity, some ✨Magic™️ objects (or even just a uniquely referenced sentinel string?) could be used and otherwise work like an empty string but is some sentinel object for checking.

For representing undefined components, we could use a member of an enum.Enum like this:

>>> from enum import Enum
>>> class Component(str, Enum):
...     UNDEFINED = ''
... 
>>> legacy = ''
>>> assert Component.UNDEFINED is not legacy  # distinct representation assumption
>>> assert bool(Component.UNDEFINED) is bool(legacy)  # False in Boolean context assumption
>>> assert isinstance(Component.UNDEFINED, type(legacy))  # str type assumption
>>> assert Component.UNDEFINED == legacy  # '' value assumption
>>>

The benefit of that enum.Enum representation for undefined components compared to the previous str representation is that it does not break the False in Boolean context assumption nor the '' value assumption. Do you see any drawback?

The text was updated successfully, but these errors were encountered:

geryogam added the type-bug An unexpected behavior, bug, or error label Dec 3, 2022

geryogam mentioned this issue Dec 3, 2022

gh-82150: Make urllib.parse.urlsplit and urllib.parse.urlunsplit preserve the '?' and '#' delimiters of empty query and fragment components #15642

Closed

erlend-aasland mentioned this issue Dec 3, 2022

Correct urllib.parse functions dropping the delimiters of empty URI components #82150

Closed

geryogam mentioned this issue Dec 3, 2022

urllib.parse wrongly strips empty #fragment, ?query, //netloc #67041

Open

iritkatriel added the stdlib Python modules in the Lib dir label Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing functions in urllib.parse remove the delimiters of empty URI components in violation of RFC 3986 #99962

Parsing functions in urllib.parse remove the delimiters of empty URI components in violation of RFC 3986 #99962

geryogam commented Dec 3, 2022 •

edited

Parsing functions in urllib.parse remove the delimiters of empty URI components in violation of RFC 3986 #99962

Parsing functions in urllib.parse remove the delimiters of empty URI components in violation of RFC 3986 #99962

Comments

geryogam commented Dec 3, 2022 • edited

Issue

Solution

geryogam commented Dec 3, 2022 •

edited