New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urlparse.urlsplit mishandles novel schemes #52152
Comments
urlparse.urlsplit('s3://example/files/photos/161565.jpg')
returns
('s3', '', '//example/files/photos/161565.jpg', '', '')
instead of
('s3', 'example', '/files/photos/161565.jpg', '', '') according to rfc 3986 's3' is a valid scheme name, so the '://' indicates a URL with netloc and path parts. |
Thanks for the report, could you provide a patch with unit tests? |
Does s3 stand for the amazon s3 services? urlparse does not have it under its list of known schemes yet. Does s3 have any specifications as such or is aligned towards any of the known schemes (like http or ftp)? s3 is valid scheme name according to rfc 3986, but urlparse module does not seem to recognize it. If we say, s3 to be much similar to http, then it can be added to list of known schemes. Does Amazon say anything about it? |
it's not actually necessary to have a list of known schemes. any url that has a double slash after the colon is expected to follow that with an authority section (what urlparse calls "netloc"), optionally followed by a path, which starts with a slash. there are various defined schemes with their own syntax within the URL framework, but one is free to invent new ones with the general form |
i have attached an svn diff of my (very simple!) fix and added unit test for python 2.7. |
Hello Mark, However there are reasons why the check is: "if scheme in uses_netloc and url[:2] == '//':" Different protocols have different parsing requirements. (for e.g. some wish to consider (or act as if), after the scheme, the rest is their path) The better way is to add 's3' to uses_netloc list and it should be fine too. I shall add it and include your tests. Thanks. |
I think Mark is correct. RFC 3986 says: When authority is present, the path must either be empty or begin with a slash ("/") character. When authority is not present, the path cannot begin with two slash characters ("//"). I think it would make sense to have urlparse fall back to doing a generic RFC 3986 parse when it does not recognize the scheme. |
The case which prompted this issue was a purely private set of URLs, sent to me by a client but never sent to Amazon or anywhere else outside our systems (though I'm sure many others have invented this particular scheme for their own use). It would have been convenient if urlparse had handled it properly. That is true for any scheme one may invent at need. On second thought it does make sense to enforce the use of :// for the schemes in uses_netloc, but still not to ignore its meaning for other schemes. It also makes sense to add s3 to uses_netloc despite the fact that it is not (afaik) registered, since it is an obvious invention. I'll make another patch, but I don't have time to do it just now. |
Doing a fallback test for // would look like but this is equivalent to i.e., an authority appears if and only if there is a // after the scheme. This still allows a uses_netloc scheme to appear without //. I have attached a patch against Python 2.7, svn revision 78212. It adds s3 to netloc. |
Fixed in the r78234 and merged back to other branches. Thanks for the patch. |
The fix for this bug breaks any code which worked with non-standard |
I remember seeing a discussion on python-dev archives about that months or years ago. Someone pointed to Guido that the new RFC removed the need for uses_netloc thanks to the generic syntax. Isn’t there already a bug about that? |
Though msg104261 suggests this change be documented in NEWS.txt, it doesn't appear to have made it. Sure enough, we just found application code that this broke. |
On Fri, Dec 03, 2010 at 10:33:50PM +0000, Fred L. Drake, Jr. wrote:
Better late than never. I just added the NEWS in r87014 (py3k) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: