Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapy should handle "invalid" relative URLs better #1304

Open
kmike opened this issue Jun 15, 2015 · 14 comments

Comments

Projects
None yet
6 participants
@kmike
Copy link
Member

commented Jun 15, 2015

Currently Scrapy can't extract links from http://scrapy.org/ page correctly because urls in page header are relative to a non-existing parent: ../download/, ../doc/, etc. Browsers resolve these links as http://scrapy.org/download/ and http://scrapy.org/doc/, while response.urljoin, urlparse.urljoin and our ink extractors resolve them as http://scrapy.org/../download/, etc. This results in 400 Bad Request responses.

urlparse.urljoin is not correct (or not modern) here. In the URL Living Standard for browsers it is said:

If buffer is "..", remove url’s path’s last entry, if any, and then if c is neither "/" nor "", append the empty string to url’s path.

@kmike

This comment has been minimized.

Copy link
Member Author

commented Jun 15, 2015

By the way, this also fails:

yield scrapy.Request("http://scrapy.org", self.parse)

- the request is filtered out because of #1225 and self.parse is never called. Crawling scrapy.org with Scrapy is not straightforward :)

@kmike

This comment has been minimized.

Copy link
Member Author

commented Jun 15, 2015

The most evil spider ever: looks innocent, but doesn't work for multiple reasons

import scrapy

class ScrapySpider(scrapy.Spider):
    name = 'scrapyspider'

    def start_requests(self):
        yield scrapy.Request("http://scrapy.org", self.parse_main)

    def parse_main(self, response):
        for href in response.xpath("//a/@href").extract():
            yield scrapy.Request(response.urljoin(href), self.parse_link)

    def parse_link(self, response):
        print(response.url)
@jlbren

This comment has been minimized.

Copy link

commented Nov 2, 2016

Hello, is this still an open issue? If so myself and some colleagues would like to look into it

@kmike

This comment has been minimized.

Copy link
Member Author

commented Nov 2, 2016

@jlbren yeah, it is still open; help is appreciated!

@jlbren

This comment has been minimized.

Copy link

commented Nov 2, 2016

@kmike great, we will post updates as they come!

@kmike

This comment has been minimized.

Copy link
Member Author

commented Nov 2, 2016

Thanks @jlbren! Feel free to ask any questions and discuss it here. The issue doesn't look easy because it is a bug in urlparse.urljoin from standard library. Likely we'll have to add a workaround to https://github.com/scrapy/w3lib, but maybe there is another way to solve this, I don't know.

@jlbren

This comment has been minimized.

Copy link

commented Nov 2, 2016

@kmike thanks, that is a helpful starting point. We will definitely take you up on posting questions. I am working with a team of 2 other CS grad students so our aim is to try and make significant progress as a semester project. I imagine they will post any questions they encounter as well.

@creider

This comment has been minimized.

Copy link

commented Nov 16, 2016

Hello. I'm working with @jlbren on this issue and we can't seem to replicate it. We aren't getting any 400 bad request responses. Here is our output from running the 'evil spider' above..

http://pastebin.com/CM28zw9W

Could anyone shed any light as to why we aren't seeing the expected erroneous behavior?

@redapple

This comment has been minimized.

Copy link
Contributor

commented Nov 16, 2016

@creider , if I understood this issue, you can reproduce it with something like this:

>>> resp = scrapy.http.response.html.HtmlResponse('http://www.example.com',
...     body='''<html>
...                 <body>
...                     <a href="../index.html">some link</a>
...                 </body>
...         </html>''')
>>> 
>>> links = scrapy.linkextractors.LinkExtractor().extract_links(resp)
>>> links
[Link(url='http://www.example.com/../index.html', text='some link', fragment='', nofollow=False)]
>>> fetch(links[0].url)
2016-11-16 21:03:14 [scrapy] DEBUG: Crawled (404) <GET http://www.example.com/../index.html> (referer: None)
>>> fetch('http://www.example.com/index.html')
2016-11-16 21:03:22 [scrapy] DEBUG: Crawled (200) <GET http://www.example.com/index.html> (referer: None)
>>> 

Here, the link extractor generates a link to http://www.example.com/../index.html (this is one place where it could be better),
and scrapy does not "clean up" the URL either and asks for the URL as-is:

GET /../index.html HTTP/1.1
Host: www.example.com
Accept-Language: en
Accept-Encoding: gzip,deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Scrapy/1.2.1 (+http://scrapy.org)

HTTP/1.1 404 Not Found
Content-Encoding: gzip
Accept-Ranges: bytes
Cache-Control: max-age=604800
Content-Type: text/html
Date: Wed, 16 Nov 2016 20:04:30 GMT
Etag: "359670651"
Expires: Wed, 23 Nov 2016 20:04:30 GMT
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Server: ECS (iad/19C1)
Vary: Accept-Encoding
X-Cache: HIT
x-ec-custom-error: 1
Content-Length: 606
...
@redapple

This comment has been minimized.

Copy link
Contributor

commented Nov 16, 2016

Also note that in Python 3 -- Python 3.5 at least, urljoin seems to be doing a better job:

$ scrapy version -v
Scrapy    : 1.2.1
lxml      : 3.6.4.0
libxml2   : 2.9.4
Twisted   : 16.5.0
Python    : 3.5.2 (default, Sep 10 2016, 08:21:44) - [GCC 5.4.0 20160609]
pyOpenSSL : 16.2.0 (OpenSSL 1.0.2g  1 Mar 2016)
Platform  : Linux-4.4.0-47-generic-x86_64-with-Ubuntu-16.04-xenial

$ scrapy shell
(...)
>>> resp = scrapy.http.response.html.HtmlResponse('http://www.example.com',
...     body=b'''<html>
...                 <body>
...                     <a href="../index.html">some link</a>
...                 </body>
...         </html>''')
>>> 
>>> links = scrapy.linkextractors.LinkExtractor().extract_links(resp)
>>> links
[Link(url='http://www.example.com/index.html', text='some link', fragment='', nofollow=False)]
>>> 
@kmike

This comment has been minimized.

Copy link
Member Author

commented Nov 17, 2016

I'm fine with closing this issue if the bug in Python 2-only.
We're not yet ready for Python 3 (e.g. telnet is not ported), but I think Scrapy is close enough.

@kmike

This comment has been minimized.

Copy link
Member Author

commented Jul 15, 2017

FTR: this issue is fixed in Python 3.5.2; it is still present in Python 3.4.6 and 2.7.13

@NoExitTV

This comment has been minimized.

Copy link
Contributor

commented Oct 7, 2017

I looked into this and think I've managed to fix this issue for python 2.7.13.
I'll create a pull request and it would be nice if some of you could take a look at it and give me some feedback.

@nctl144

This comment has been minimized.

Copy link
Member

commented Mar 5, 2018

I made a PR at scrapy/w3lib#104. Hope that helps solve this problem :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.