Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlsplit can't round-trip relative-host urls. #59214

Closed
BuckGolemon mannequin opened this issue Jun 5, 2012 · 7 comments
Closed

urlsplit can't round-trip relative-host urls. #59214

BuckGolemon mannequin opened this issue Jun 5, 2012 · 7 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@BuckGolemon
Copy link
Mannequin

BuckGolemon mannequin commented Jun 5, 2012

BPO 15009
Nosy @orsenthil, @ezio-melotti, @bukzor, @vadmium
Superseder
  • bpo-22852: urllib.parse wrongly strips empty #fragment, ?query, //netloc
  • Files
  • parse.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2015-05-31.04:28:09.891>
    created_at = <Date 2012-06-05.22:28:27.232>
    labels = ['type-bug', 'library']
    title = "urlsplit can't round-trip relative-host urls."
    updated_at = <Date 2015-05-31.04:28:09.890>
    user = 'https://bugs.python.org/BuckGolemon'

    bugs.python.org fields:

    activity = <Date 2015-05-31.04:28:09.890>
    actor = 'martin.panter'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2015-05-31.04:28:09.891>
    closer = 'martin.panter'
    components = ['Library (Lib)']
    creation = <Date 2012-06-05.22:28:27.232>
    creator = 'Buck.Golemon'
    dependencies = []
    files = ['25862']
    hgrepos = []
    issue_num = 15009
    keywords = []
    message_count = 7.0
    messages = ['162378', '162507', '162509', '164320', '164322', '164712', '235585']
    nosy_count = 6.0
    nosy_names = ['orsenthil', 'ezio.melotti', 'bukzor', 'martin.panter', 'Buck.Golemon', 'ankitoshniwal']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '22852'
    type = 'behavior'
    url = 'https://bugs.python.org/issue15009'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @BuckGolemon
    Copy link
    Mannequin Author

    BuckGolemon mannequin commented Jun 5, 2012

    1. As long as x is valid, I expect that urlunsplit(urlsplit(x)) == x
    2. yelp:///foo is a well-formed (albeit odd) url. It it similar to file:///tmp: it specifies the /foo resource, on the "current" host, using the yelp protocol (defined on mobile devices).
    >>> from urlparse import urlsplit, urlunsplit
    >>> urlunsplit(urlsplit('yelp:///foo'))
    'yelp:/foo'

    Urlparse / unparse has the same bug:

    >>> urlunparse(urlparse('yelp:///foo'))
    'yelp:/foo'

    The file: protocol seems to be special-case, in an inappropriate manner:

    >>> urlunsplit(urlsplit('file:///tmp'))
    'file:///tmp'

    @BuckGolemon BuckGolemon mannequin added the stdlib Python modules in the Lib dir label Jun 5, 2012
    @ankitoshniwal
    Copy link
    Mannequin

    ankitoshniwal mannequin commented Jun 7, 2012

    Hello,

    Did some initial investigation, so looks like as per the code in parse.py, under the function urlunsplit, we take the 5-tuple returned by urlsplit . In the case of foo we get:
    SplitResult(scheme='yelp', netloc='', path='/foo', query='', fragment='')

    Now this tuple is passed to urlunsplit. We have a if statement under the urlunsplit function

    if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):

    which checks if the netloc exists in the url (in our case it does not) then we check if the scheme in the url is part of the uses_netloc list (predefined list in parse.py with the list of common types of schemes used like http, ftp, file, rsync etc). In our case since yelp is not part of it we fail at the if statement and then we just return the url instead of modifying it. What we need was that if the above statement fails we do an else which does something like this:

        if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
            if url and url[:1] != '/':
              url = '/' + url
            url = '//' + (netloc or '') + url
        else:
            if url and url[:1] != '/':
              url = '/' + url
            url = '//' + (netloc or '') + url

    In that case we get the right url back.

    After changing the code here is what i get on local dev machines:
    >>> urlunparse(urlparse('yelp:///foo'))
    'yelp:///foo'
    >>> urlunsplit(urlsplit('file:///tmp'))
    'file:///tmp'
    >>> urlunsplit(urlsplit('yelp:///foo'))
    'yelp:///foo'

    Thanks,
    Ankit.

    P.S : I am new to python trying to learn it and also work on small projects let me know what you think if this is the right approach.

    @BuckGolemon
    Copy link
    Mannequin Author

    BuckGolemon mannequin commented Jun 7, 2012

    Well i think the real issue is that you can't enumerate the protocals that "use netloc". All protocols are allowed to have a netloc. the smb: protocol certainly does, but it's not in the list.

    The core issue is that smb:/foo and smb:///foo are different urls, and should be represented differently when split. The /// form has a netloc, it's just the empty-string. The single-slash form has no netloc, so I propose that urlsplit('smb:/foo') return SplitResult(scheme='smb', netloc=None, path='/foo', query='', fragment='')

    @orsenthil orsenthil self-assigned this Jun 8, 2012
    @ezio-melotti ezio-melotti added the type-bug An unexpected behavior, bug, or error label Jun 15, 2012
    @orsenthil
    Copy link
    Member

    Let me address this one thing at a time, the point on smb really confused me and I got into thinking that how smb (being more common), the issue was not raised. Looks smb url will always start with smb:// (// are the requirements for identified netloc, empty or otherwise) and cases for smb are fine - http://tools.ietf.org/html/draft-crhertel-smb-url-00

    That said, the dependency on uses_netloc has come many times and I am still looking for way to remove the dependency without affecting the previous parsing behaviors and ofcourse tests.

    @orsenthil
    Copy link
    Member

    Look at the following two bugs which dwelt on similar issues: bpo-8339 and bpo-7904 and in one message particular, msg102737, I seem to have come to a conclusion that " I don't see that 'x://' and 'x:///y' qualifies as valid URLS as per RFC 3986" and it applies to this bug too where the url is requested as yelp:///x

    Does yelp://localhost/x be a way to access in your case? That would be consistent with specification. Or in your code, you can add 'yelp' to uses_netloc list and then expect the desired behavior.

    from urlparse import uses_netloc
    uses_netloc.append('yelp')

    I understand that, using of the uses_netloc is a limitation, but given the requirements of both absolute and relative parsing that lists has served a useful behavior.

    I would like to close this one for the above mention points and open a feature request (or convert this to a feature request) which asks to remove the dependency of uses_netloc in urlparse. Does this resolution sound okay?

    @bukzor
    Copy link
    Mannequin

    bukzor mannequin commented Jul 6, 2012

    Let's examine x://

    absolute-URI = scheme ":" hier-part [ "?" query ]
    hier-part = "//" authority path-abempty

    So this is okay if authority and path-abempty can both be empty strings.

    authority     = [ userinfo "@" ] host [ ":" port ]
    host          = IP-literal / IPv4address / reg-name
    reg-name      = *( unreserved / pct-encoded / sub-delims )
    path-abempty  = *( "/" segment )

    Yep.

    And the same applies for x:///y, except that path-abempty matches /y
    instead of nothing.

    This means these are in fact valid urls per RFC3986, counter to your claim.

    @vadmium
    Copy link
    Member

    vadmium commented Feb 9, 2015

    Fixing bpo-22852 or bpo-5843 should help fixing this.

    @vadmium vadmium closed this as completed May 31, 2015
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants