-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urljoin fails with messy relative URLs #66316
Comments
Not sure if this is desired behavior, but it's making my code break, so I figured I'd get it filed. I'm trying to crawl this website: https://www.appeals2.az.gov/ODSPlus/recentDecisions2.cfm Unfortunately, most of the URLs in the HTML are relative, taking the form: '../../some/path/to/some/pdf.pdf' I'm using lxml's make_links_absolute() function, which calls urljoin creating invalid urls like: https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf If you put that into Firefox or wget or whatever, it works, despite being invalid and making no sense. **It works because those clients fix the problem,** joining the invalid path and the URL into: https://www.appeals2.az.gov/Decisions/CR20130096OPN.pdf I know this will mean making urljoin have a workaround to fix bad HTML, but this seems to be what wget, Chrome, Firefox, etc. all do. I've never filed a Python bugs before, but is this something we could consider? |
FWIW, the workaround that I've just created for this problem is this: u = 'https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf'
# Split the url and rejoin it, nuking any '/..' patterns at the
# beginning of the path.
s = urlsplit(u)
urlunsplit(s[:2] + (re.sub('^(/\.\.)+', '', s.path),) + s[3:]) |
The thing is, urljoin() isn't HTTP-specific, and such URLs *may* make sense for other protocols. |
@pitrou, I haven't delved into URLs in a long while, but the general idea is: scheme://domain:port/path?query_string#fragment_id When would it ever make sense to have something up a level from the root of the domain? |
Actually, according to RFC 3986, it seems you are right and we should remove extraneous leading "/../" and "/./" components in the path. |
I've only had a couple minutes to look into this so far, but the bug does indeed seem to be valid across all versions. In fact, the line "# XXX The stuff below is bogus in various ways..." in urljoin tipped me off to something potentially being askew ;) Unless the OP wants to provide a patch, I was going to take a look at possibly updating urljoin (at least the part below that line) to follow http://tools.ietf.org/html/rfc3986#section-5.2 a little more closely. |
@demian.brecht, that'd make me very pleased if you took this over. I won't have time to devote to it, unfortunately. |
Here's an initial patch with a fix that passes all the test cases other than the ones that are incorrect based on examples and pseudocode in RFC 3986. I haven't checked obsoleted RFCs yet to ensure consistency, but will do so when I get a chance (likely tomorrow morning). Posting this now to see if anyone's opposed to the change, or at least the direction of. Note: It /does/ break backwards compatibility, but it seems that previous logic was incorrect (based on my upcoming checking for consistency between RFCs). As such, I'm not sure that it should be fixed < 3.5. Thoughts? |
Actually, the logic seems to be correct according to RFC 1808 (which the variable names in the tests seem to hint at, as well). I think it's fine to upgrade the semantics to newer RFCs, but we should only do it in 3.5 indeed. |
Update based on review comments, added legacy support, fixed the tests up. |
Demian, thanks for the update! I'm not sure the rfc1808 flag is necessary. I would be fine with switching wholesale to the new semantics. Nick, since you've done work in the past on URIs, what do you think? |
FWIW, I think that it would be ever so slightly preferable to maintain support for the old behaviour but perhaps add a deprecation note in the docs. It's a little difficult to tell how much of an impact this change will have to existing code. Maintaining support for the old behaviour for 3.5 would make the change a little more easy to deal with. That said, it's nothing that I'd argue heatedly over. |
Unfortunately, I haven't looked at the RFC compliance side of things in These days, I'd be more inclined to just fix it for 3.5, and suggest anyone Extracting the test case inputs from that old "uriparse" rewrite might |
Uploaded new patch. Removed support for RFC1808-specific behaviour. Extracted non-compliant tests into comment blocks indicating the behaviour is no longer supported. |
Just hopping in here to say that the work going down here is beautiful. I've filed a lot of bugs. This one's not particularly difficult, but damn, I appreciate the speed and quality going into fixing it. Glad to see the Python language is a happy place with fast, quality fixes. |
Thanks Mike, it's always nice to get positive feedback :) |
New changeset b116489d31ff by Antoine Pitrou in branch 'default': |
The patch is now committed to the future Python 3.5. Thank you very much for this contribution! |
I'm now getting duplicated slashes in URLs, e.g.: https://new//foo.html In both cases, the base URL that gets joined with the postfix had a trailing slash, e.g. "http://my.little.server/url/" + "logo.gif" -> "http://my.little.server/url//logo.gif" |
Issue bpo-1500504 (the "urischemes" proposal that never got turned into a PyPI package) has several additional test cases. This particular attachment is one AMK cleaned up to run on Py3k: http://bugs.python.org/file32591/urischemes.py The module itself likely isn't worth salvaging, but the additional examples in the test cases were very useful in flushing out various RFC compliance issues. |
New changeset 901e4e52b20a by Senthil Kumaran in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: