Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

canonicalize_url in linkextractor: not what browsers do #1941

Closed
Digenis opened this issue Apr 20, 2016 · 7 comments
Closed

canonicalize_url in linkextractor: not what browsers do #1941

Digenis opened this issue Apr 20, 2016 · 7 comments

Comments

@Digenis
Copy link
Member

@Digenis Digenis commented Apr 20, 2016

By default, the link extractor calls canonicalize_url on the collected links.
The following is not what browsers do:

canonicalize_url('http://example.com/index.php?/a/=/o/')
'http://example.com/index.php?%2Fa%2F=%2Fo%2F'  # encoding forward slashes
canonicalize_url('http://example.com/index.php?a')
'http://example.com/index.php?a='  # appending = on empty arguments

I doubt this is a problem in canonicalize_url
because it's not meant to mimic browsers in the first place, is it?

However this is a problem for the link extractor
because it can potentially end up extracting urls
that are wrong from the server's perspective.
In this example, the server doesn't recognise the extractor's url, only the browser's:

# http://forum.laptop.bg/index.php?/discover/
LinkExtractor(restrict_xpaths=('//a[contains(@href, "/topic")]',)).extract_links(response)[0].url
# Extractor: http://forum.laptop.bg/index.php?%2Ftopic%2F57339-%D0%BB%D0%B0%D0%BF%D1%82%D0%BE%D0%BF-asus-w90vp%2F=&comment=221153&do=findComment
# Browser:   http://forum.laptop.bg/index.php?/topic/57339-%D0%BB%D0%B0%D0%BF%D1%82%D0%BE%D0%BF-asus-w90vp/&do=findComment&comment=221153

Was this a design decision or a bug?

@kmike
Copy link
Member

@kmike kmike commented Apr 20, 2016

@Digenis +1 to not using canonicalize_url by default in link extractors. It causes other issues - see e.g. #1202.

See also: scrapy/w3lib#25 (comment).

@kmike
Copy link
Member

@kmike kmike commented Apr 20, 2016

This change would be backwards incompatible, and link extractors beg for a rewrite. I think it should be fixed as a part of link extractors rewrite. We shouldn't alter URLs by default; IMHO canonicalize_url is for duplication filters, not for preprocessing URLs before sending them to a server.

@redapple
Copy link
Contributor

@redapple redapple commented Apr 20, 2016

FWIW, regarding percent-escaping of /,
for a link as

<a href="http://localhost:8001/query?a=/&b=?&c=@&d=:">/query?a=/&b=?&c=@&d=:</a>

this is what Chrome (Version 50.0.2661.75 (64-bit)) and Firefox (45.0.2) request (as received by HTTP server)

------------------------------------------------------------------------------------------------------------------------
Chrome
127.0.0.1 - - [20/Apr/2016 18:05:57] "GET /query?a=/&b=?&c=@&d=: HTTP/1.1" 200 -
------------------------------------------------------------------------------------------------------------------------
Firefox
127.0.0.1 - - [20/Apr/2016 18:07:15] "GET /query?a=/&b=?&c=@&d=: HTTP/1.1" 200 -
------------------------------------------------------------------------------------------------------------------------

The encoding actually happens in Python's urlencode():

Python2, with urlencode() having no safe arg,

$ python2
Python 2.7.10 (default, Oct 14 2015, 16:09:02) 
[GCC 5.2.1 20151010] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> urllib.urlencode([('a', '/'), ('b', '?'), ('c', '@'), ('d', ':')])
'a=%2F&b=%3F&c=%40&d=%3A'

Python3, where urlencode() has safe so it's easier to get closer to browsers:

$ python3
Python 3.4.3+ (default, Oct 14 2015, 16:03:50) 
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.parse
>>> urllib.parse.urlencode([('a', '/'), ('b', '?'), ('c', '@'), ('d', ':')])
'a=%2F&b=%3F&c=%40&d=%3A'
>>> urllib.parse.urlencode([('a', '/'), ('b', '?'), ('c', '@'), ('d', ':')], safe='/?:@')
'a=/&b=?&c=@&d=:'
@Digenis
Copy link
Member Author

@Digenis Digenis commented Apr 21, 2016

I've seen this and I tried fixing it
by manually encoding the url which takes up only 4 lines
(I ran the debugger and indeed after all the if/else and function calls, it's just 4 statements).
But then canonicalize_url also appends = to empty arguments
which is also not what browsers do
so I thought canonicalize_url is not even meant to mimic browsers
(there's a test for the appended =).

@kmike
Copy link
Member

@kmike kmike commented Apr 21, 2016

@Digenis just to be clear, are we taking about changing the default for LinkExtractor, or maybe even removing canoincalize support form them, to provide a better experience? One can already work around the issue by turning canonicalize off (canonicalize=False LinkExtractor argument).

@Digenis
Copy link
Member Author

@Digenis Digenis commented Apr 21, 2016

OK,
should I close this in favour of some other ticket?

@kmike
Copy link
Member

@kmike kmike commented Apr 21, 2016

No, I think the issue is valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants