New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
canonicalize_url
should not accept whitespace at the begin of an URI
#132
Comments
canonicalize_url
should not accept whitespace in the begin of an URIcanonicalize_url
should not accept whitespace at the begin of an URI
Hi @abmxer, I think in practice the question is where users get URLs with whitespaces from - they usually get them from a href attributes. HTML5 explicitly allows whitespaces before and after href value, and requires browsers to strip them. There are several places we may fix it - You're right that URL parsing functions should be more strict; it is better to handle whitespaces before it hits urlparse. See also: scrapy/scrapy#838, scrapy/scrapy#1603. |
I talked about this with @moisesguimaraes, he did some tests locally to see what browsers do and checked some RFC, it seems that it makes sense to strip spaces from beginning and end of the URL. @moisesguimaraes can you add some more detail? Thank you! |
Hi @eliasdorneles, Yes, according to RFC 1808, white spaces are not part of the URL, so it is safe to strip them. |
I would replace:
with
in w3lib, url.py, line 426 |
@moisesguimaraes RFC 1808 is quite old. Accordingly to RFC 3986 - 3.1 Scheme, the scheme should begin with a letter.
|
Accepting whitespace in the begin of URI introduces a crawling issue. Unique URIs are generated on each depth and crawler gets in a loop.
When a URI is extracted and begins with whitespace it is being considered a relative URI and its concatenated at the end of current URI, which leads to unique URI on each crawler depth.
But, accordingly to RFC2396 [ Page 38], the scheme part of an URI should begin with an ALPHA character.
After further investigation, the core python
urlparse
function doesn't follow RFC strictly, they consider some rules. Like on Python 3.5 they say:However a better implementation should be in accordance with RFC 3986 (which is an update of RFC2396 mentioned before), the
urlparse
should not accept whitespace at the begin of URI.When compared to the with Ruby implementation of RFC 2396, they don't allow whitespace at the begin of absolute or relative URIs and throws an exception "bad URI(is not URI?)".
In order to fix that on Scrapy level, I suggest the following.
In
scrapy/utils/url.py
add at the begin ofcanonicalize_url
:An a test in
tests/test_utils_url.py
:But raising an exception breaks a lot of tests and can introduce unexpected behaviors. You could simple
.strip()
url, but won't make it RFC compliant.The text was updated successfully, but these errors were encountered: