`canonicalize_url` should not accept whitespace at the begin of an URI #132
Accepting whitespace in the begin of URI introduces a crawling issue. Unique URIs are generated on each depth and crawler gets in a loop.
When a URI is extracted and begins with whitespace it is being considered a relative URI and its concatenated at the end of current URI, which leads to unique URI on each crawler depth.
But, accordingly to RFC2396 [ Page 38], the scheme part of an URI should begin with an ALPHA character.
After further investigation, the core python
However a better implementation should be in accordance with RFC 3986 (which is an update of RFC2396 mentioned before), the
When compared to the with Ruby implementation of RFC 2396, they don't allow whitespace at the begin of absolute or relative URIs and throws an exception "bad URI(is not URI?)".
In order to fix that on Scrapy level, I suggest the following.
An a test in
But raising an exception breaks a lot of tests and can introduce unexpected behaviors. You could simple
The text was updated successfully, but these errors were encountered:
I think in practice the question is where users get URLs with whitespaces from - they usually get them from a href attributes. HTML5 explicitly allows whitespaces before and after href value, and requires browsers to strip them.
There are several places we may fix it -
You're right that URL parsing functions should be more strict; it is better to handle whitespaces before it hits urlparse.
@moisesguimaraes RFC 1808 is quite old. Accordingly to RFC 3986 - 3.1 Scheme, the scheme should begin with a letter.