You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A component is undefined if it has an associated delimiter and the delimiter does not appear in the URI; the scheme and path components are always defined (cf. RFC 3986, § 5.2.1).
A component is empty if it has no characters; the scheme component is always non-empty (cf. RFC 3986, § 3).
So we need two distinct representations for undefined components and empty components.
Currently the functions urlparse(), urlsplit(), and urldefrag() of the urllib.parse submodule use the empty string '' for representing undefined components and empty components, which results in lossy parsing for URIs with empty components (their associated delimiters are removed):
urlparse() also uses the empty string '' for representing undefined subcomponents and empty subcomponents when the scheme component is http, which also results in lossy parsing for URIs with empty parameter subcomponents (their associated delimiters are removed):
All those URI pairs (before and after parsing) with empty components are not equivalent so the current parsing violates the URI specification by removing the associated delimiters (cf. RFC 3986, § 6.2.3):
Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification. For example, the URI http://example.com/? cannot be assumed to be equivalent to any of the examples above. Likewise, the presence or absence of delimiters within a userinfo subcomponent is usually significant to its interpretation. The fragment component is not subject to any scheme-based normalization; thus, two URIs that differ only by the suffix # are considered different regardless of the scheme.
Solution
In PR #15642, I proposed to use None for representing undefined components and keep using '' for representing empty components, so that urlparse(), urlsplit(), and urldefrag() are able to distinguish between the two cases.
The benefit of the None representation for undefined components is that it does not break the False in Boolean context assumption. The drawback is that it breaks pip, and potentially many more Python programs which rely on the current '' representation for undefined components. So the None representation is not backward compatible.
It's maybe a bit surprising to have some of the tuple fields sometimes be None (typing.Tuple[typing.Optional[str]] instead of typing.Tuple[str]), but I'm not sure of a more obvious solution.
The other alternative I thought about was to just explicitly dump in the delimiter if it's empty (e.g. 'http://example.com/?#' → 'http', 'example.com', '/', '?', '#'), but that's probably more surprising, rebuilding the URL is more complex, and what then if there's a URL like http://example.com/??.
I think you need to also describe the breaking change very clearly (haven't done it before, but I think that's what bedevere/news is for, i.e. these things), and leave hints in the actual documentation about the change ("changed in 3.9")
Housekeeping: I'd squash all the commits.
As he noted, '?' would not be a good representation for undefined components since this character is already allowed in the query and fragment components (cf. RFC 3986, § A). However '#' would because it is not allowed in any URI components. It is actually the only character in the URI character set that has this property. Characters outside the URI character set could also do the job (e.g. ", <, >).
The benefit of an str representation for undefined components compared to None is that is does not break the str type assumption in other programs. The drawback is that it breaks the False in Boolean context assumption and the '' value assumption, so it might still be backward incompatible.
Can always engage in some fence-sitting and support both compatible and compliance options. Pass/set a flag, use a different function/method. How about a new urllib2 😉?
Re: returning None and how it would break things, for compatiblity, some ✨Magic™️ objects (or even just a uniquely referenced sentinel string?) could be used and otherwise work like an empty string but is some sentinel object for checking.
For representing undefined components, we could use a member of an enum.Enum like this:
>>> from enum import Enum
>>> class Component(str, Enum):
... UNDEFINED = ''
...
>>> legacy = ''
>>> assert Component.UNDEFINED is not legacy # distinct representation assumption
>>> assert bool(Component.UNDEFINED) is bool(legacy) # False in Boolean context assumption
>>> assert isinstance(Component.UNDEFINED, type(legacy)) # str type assumption
>>> assert Component.UNDEFINED == legacy # '' value assumption
>>>
The benefit of that enum.Enum representation for undefined components compared to the previous str representation is that it does not break the False in Boolean context assumption nor the '' value assumption. Do you see any drawback?
The text was updated successfully, but these errors were encountered:
Issue
Here is the generic syntax of a URI, consisting of five components:
So we need two distinct representations for undefined components and empty components.
Currently the functions
urlparse()
,urlsplit()
, andurldefrag()
of theurllib.parse
submodule use the empty string''
for representing undefined components and empty components, which results in lossy parsing for URIs with empty components (their associated delimiters are removed):urlparse()
also uses the empty string''
for representing undefined subcomponents and empty subcomponents when the scheme component ishttp
, which also results in lossy parsing for URIs with empty parameter subcomponents (their associated delimiters are removed):All those URI pairs (before and after parsing) with empty components are not equivalent so the current parsing violates the URI specification by removing the associated delimiters (cf. RFC 3986, § 6.2.3):
Solution
In PR #15642, I proposed to use
None
for representing undefined components and keep using''
for representing empty components, so thaturlparse()
,urlsplit()
, andurldefrag()
are able to distinguish between the two cases.The benefit of the
None
representation for undefined components is that it does not break theFalse
in Boolean context assumption. The drawback is that it breakspip
, and potentially many more Python programs which rely on the current''
representation for undefined components. So theNone
representation is not backward compatible.In PR #15642 (comment), @nicktimko suggested an alternative representation for undefined components:
As he noted,
'?'
would not be a good representation for undefined components since this character is already allowed in the query and fragment components (cf. RFC 3986, § A). However'#'
would because it is not allowed in any URI components. It is actually the only character in the URI character set that has this property. Characters outside the URI character set could also do the job (e.g."
,<
,>
).The benefit of an
str
representation for undefined components compared toNone
is that is does not break thestr
type assumption in other programs. The drawback is that it breaks theFalse
in Boolean context assumption and the''
value assumption, so it might still be backward incompatible.In PR #15642 (comment), @nicktimko suggested another representation for undefined components:
For representing undefined components, we could use a member of an
enum.Enum
like this:The benefit of that
enum.Enum
representation for undefined components compared to the previousstr
representation is that it does not break theFalse
in Boolean context assumption nor the''
value assumption. Do you see any drawback?The text was updated successfully, but these errors were encountered: