Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

canonicalize_url should not accept whitespace at the begin of an URI #132

Closed
abmxer opened this issue Nov 24, 2015 · 5 comments · Fixed by #136
Closed

canonicalize_url should not accept whitespace at the begin of an URI #132

abmxer opened this issue Nov 24, 2015 · 5 comments · Fixed by #136
Labels

Comments

@abmxer
Copy link

abmxer commented Nov 24, 2015

Accepting whitespace in the begin of URI introduces a crawling issue. Unique URIs are generated on each depth and crawler gets in a loop.

When a URI is extracted and begins with whitespace it is being considered a relative URI and its concatenated at the end of current URI, which leads to unique URI on each crawler depth.

But, accordingly to RFC2396 [ Page 38], the scheme part of an URI should begin with an ALPHA character.

"The syntax for URI scheme has been changed to require that all schemes begin with an alpha character."

After further investigation, the core python urlparse function doesn't follow RFC strictly, they consider some rules. Like on Python 3.5 they say:

"Changed in version 3.3: The fragment is now parsed for all URL schemes (unless allow_fragment is false), in accordance with RFC 3986. Previously, a whitelist of schemes that support fragments existed."

However a better implementation should be in accordance with RFC 3986 (which is an update of RFC2396 mentioned before), the urlparse should not accept whitespace at the begin of URI.

When compared to the with Ruby implementation of RFC 2396, they don't allow whitespace at the begin of absolute or relative URIs and throws an exception "bad URI(is not URI?)".

In order to fix that on Scrapy level, I suggest the following.

In scrapy/utils/url.py add at the begin of canonicalize_url:

if re.match('[a-zA-Z]', url[:1]) == None:
    raise ValueError('Bad URI (is not a valid URI?)')

An a test in tests/test_utils_url.py:

def test_rfc2396_scheme_should_begin_with_alpha(self):
    self.assertRaises(ValueError, canonicalize_url," http://www.example.com")

But raising an exception breaks a lot of tests and can introduce unexpected behaviors. You could simple .strip() url, but won't make it RFC compliant.

@abmxer abmxer changed the title canonicalize_url should not accept whitespace in the begin of an URI canonicalize_url should not accept whitespace at the begin of an URI Nov 24, 2015
@kmike
Copy link
Member

kmike commented Jan 23, 2016

Hi @abmxer,

I think in practice the question is where users get URLs with whitespaces from - they usually get them from a href attributes. HTML5 explicitly allows whitespaces before and after href value, and requires browsers to strip them.

There are several places we may fix it - canonicalize_url, link extractors (it won't help with links returned by resp.xpath('//a/@href').extract() though), safe_download_url, maybe somewhere else. I'm not sure about the best place to fix it.

You're right that URL parsing functions should be more strict; it is better to handle whitespaces before it hits urlparse.

See also: scrapy/scrapy#838, scrapy/scrapy#1603.

@eliasdorneles
Copy link
Member

I talked about this with @moisesguimaraes, he did some tests locally to see what browsers do and checked some RFC, it seems that it makes sense to strip spaces from beginning and end of the URL.

@moisesguimaraes can you add some more detail? Thank you!

@moisesguimaraes
Copy link

Hi @eliasdorneles,

Yes, according to RFC 1808, white spaces are not part of the URL, so it is safe to strip them.

@moisesguimaraes
Copy link

I would replace:

return urlparse(to_unicode(url, encoding))

with

return urlparse(to_unicode(url.strip(), encoding))

in w3lib, url.py, line 426

@abmxer
Copy link
Author

abmxer commented Jan 12, 2017

@moisesguimaraes RFC 1808 is quite old. Accordingly to RFC 3986 - 3.1 Scheme, the scheme should begin with a letter.

Scheme names consist of a sequence of characters beginning with a
letter and followed by any combination of letters, digits, plus
("+"), period ("."), or hyphen ("-").

@Gallaecio Gallaecio transferred this issue from scrapy/scrapy Jul 8, 2019
@Gallaecio Gallaecio added the bug label Jul 9, 2019
@wRAR wRAR closed this as completed in #136 Oct 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants