Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-37969: Correct urllib.parse functions dropping the delimiters of empty URI components #15642

Open
wants to merge 15 commits into
base: master
from

Conversation

@maggyero
Copy link
Contributor

commented Sep 2, 2019

This PR will make the following changes:

  • update the urlsplit and urlunsplit functions of the urllib.parse module to keep the ? delimiter in a URI with an empty query component and keep the # delimiter in a URI with an empty fragment component (currently the delimiters are dropped):

      >>> from urllib.parse import urlsplit, urlunsplit
      >>> urlunsplit(urlsplit('http://example.com/?'))
      'http://example.com/?'  # currently: 'http://example.com/'
      >>> urlunsplit(urlsplit('http://example.com/#'))
      'http://example.com/#'  # currently: 'http://example.com/'
    

    This is required by RFC 3986:

    Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification. For example, the URI "http://example.com/?" cannot be assumed to be equivalent to any of the examples above. Likewise, the presence or absence of delimiters within a userinfo subcomponent is usually significant to its interpretation. The fragment component is not subject to any scheme-based normalization; thus, two URIs that differ only by the suffix "#" are considered different regardless of the scheme.

    To do so:

    • the urlsplit function now decodes an absent '' query component as None and an absent '' fragment component as None (e.g., urlsplit('http://example.com/')('http', 'example.com', '/', None, None)), and still decodes an empty '?' query component as '' and an empty '#' fragment component as '' (e.g., urlsplit('http://example.com/?#')('http', 'example.com', '/', '', ''));
    • the urlunsplit function now encodes a None query component as an absent '' query component and a None fragment component as an absent '' fragment component (e.g., urlunsplit(('http', 'example.com', '/', None, None))'http://example.com/'), and now encodes a '' query component as an empty '?' query component and a '' fragment component as an empty '#' fragment component (e.g., urlunsplit(('http', 'example.com', '/', '', ''))'http://example.com/?#');
  • add and update the corresponding unit tests in the test.test_urlparse module;

  • update a unit test in the test.test_urllib2 module;

  • update the urllib.parse documentation accordingly.

https://bugs.python.org/issue37969

maggyero added 6 commits Sep 2, 2019

@maggyero maggyero changed the title bpo-37969: Update parse.py bpo-37969: Correct urllib.parse functions reporting false equivalent URIs Sep 2, 2019

maggyero added 7 commits Sep 2, 2019

@maggyero maggyero marked this pull request as ready for review Sep 2, 2019

@nicktimko

This comment has been minimized.

Copy link

commented Sep 3, 2019

It's maybe a bit surprising to have some of the tuple fields sometimes be None (typing.Tuple[typing.Optional[str]] instead of typing.Tuple[str]), but I'm not sure of a more obvious solution.

The other alternative I thought about was to just explicitly dump in the delimiter if it's empty (e.g. 'http://example.com/?#''http', 'example.com', '/', '?', '#'), but that's probably more surprising, rebuilding the URL is more complex, and what then if there's a URL like http://example.com/??.

I think you need to also describe the breaking change very clearly (haven't done it before, but I think that's what bedevere/news is for, i.e. these things), and leave hints in the actual documentation about the change ("changed in 3.9")

Housekeeping: I'd squash all the commits.

@maggyero

This comment has been minimized.

Copy link
Contributor Author

commented Sep 3, 2019

Thank you for reviewing this @nicktimko! Yes the None solution for absent query/fragment seemed the most straightforward and natural to me.

I have updated the PR description to detail the exact changes. Nice suggestion, I will make the news entry, documentation version note and commit squash. But before I would like to fix an issue: the documentation tests in Travis CI failed for an obscure reason (see below). Do you have any idea why?

@nicktimko

This comment has been minimized.

Copy link

commented Sep 3, 2019

I don't know, but the docs build looks like it's installing blurb, which might be related to the news, so maybe adding a news item would fix it? Does it run locally? Just guessing though.

@ned-deily ned-deily requested a review from orsenthil Sep 4, 2019

@maggyero

This comment has been minimized.

Copy link
Contributor Author

commented Sep 11, 2019

Thanks @nicktimko, I have added a news entry, but documentation tests still fail in Travis-CI.

@maggyero maggyero changed the title bpo-37969: Correct urllib.parse functions reporting false equivalent URIs bpo-37969: Correct urllib.parse functions dropping the delimiters of empty URI components Sep 11, 2019

@orsenthil

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

I don't think the documentation failure is related to the code in this PR. Perhaps this PR needs to be rebased?

@orsenthil
Copy link
Member

left a comment

This is going to be a breaking change and will affect a plenty of downstream libraries and frameworks that had been relying upon the previous behavior.

  • I don't have any code comments, and the code changes look good to me.
  • I find the rational ok
  • I will request reviews from more active core developers and want to hear their opinion on this change too.

@orsenthil orsenthil requested review from vstinner and serhiy-storchaka Sep 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.