Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinkExtractor chokes when only one link is bogus #907

redapple opened this issue Sep 29, 2014 · 2 comments

LinkExtractor chokes when only one link is bogus #907

redapple opened this issue Sep 29, 2014 · 2 comments


Copy link

@redapple redapple commented Sep 29, 2014

_extract_links() either returns all extracted links (can be an empty list) or fails;

It would be nice to wrap a try/except and return what could be extracted and skip bogus links.

Example session:

paul@desktop:~$ scrapy shell
2014-09-29 13:45:58+0200 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
2014-09-29 13:45:58+0200 [scrapy] INFO: Optional features available: ssl, http11, boto
2014-09-29 13:45:58+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-09-29 13:45:58+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-09-29 13:45:58+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-29 13:45:58+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-09-29 13:45:58+0200 [scrapy] INFO: Enabled item pipelines: 
2014-09-29 13:45:58+0200 [scrapy] DEBUG: Telnet console listening on
2014-09-29 13:45:58+0200 [scrapy] DEBUG: Web service listening on
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fab4bcd6fd0>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x7fab51714450>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: from scrapy.http import HtmlResponse

In [2]: r = HtmlResponse(body=
'<html><body><a href="">link1</a><a href="">link2</a></body></html>', status=200, url="")

In [3]: from scrapy.contrib.linkextractors import LinkExtractor

In [4]: lx = LinkExtractor()

In [5]: lx.extract_links(r)
[Link(url='', text='link1', fragment='', nofollow=False),
 Link(url='', text='link2', fragment='', nofollow=False)]

In [6]: r = HtmlResponse(body=
'<html><body><a href="">link1</a><a href="http://[">link2</a></body></html>', status=200, url="")

In [7]: lx.extract_links(r)
ValueError                                Traceback (most recent call last)
<ipython-input-7-297e7bca14b8> in <module>()
----> 1 lx.extract_links(r)

/usr/local/lib/python2.7/dist-packages/scrapy/contrib/linkextractors/lxmlhtml.pyc in extract_links(self, response)
    105         all_links = []
    106         for doc in docs:
--> 107             links = self._extract_links(doc, response.url, response.encoding, base_url)
    108             all_links.extend(self._process_links(links))
    109         return unique_list(all_links)

/usr/local/lib/python2.7/dist-packages/scrapy/linkextractor.pyc in _extract_links(self, *args, **kwargs)
     93     def _extract_links(self, *args, **kwargs):
---> 94         return self.link_extractor._extract_links(*args, **kwargs)

/usr/local/lib/python2.7/dist-packages/scrapy/contrib/linkextractors/lxmlhtml.pyc in _extract_links(self, selector, response_url, response_encoding, base_url)
     50         for el, attr, attr_val in self._iter_links(selector._root):
     51             # pseudo lxml.html.HtmlElement.make_links_absolute(base_url)
---> 52             attr_val = urljoin(base_url, attr_val)
     53             url = self.process_attr(attr_val)
     54             if url is None:

/usr/lib/python2.7/urlparse.pyc in urljoin(base, url, allow_fragments)
    259             urlparse(base, '', allow_fragments)
    260     scheme, netloc, path, params, query, fragment = \
--> 261             urlparse(url, bscheme, allow_fragments)
    262     if scheme != bscheme or scheme not in uses_relative:
    263         return url

/usr/lib/python2.7/urlparse.pyc in urlparse(url, scheme, allow_fragments)
    141     Note that we don't break the components up in smaller bits
    142     (e.g. netloc is a single string) and we don't expand % escapes."""
--> 143     tuple = urlsplit(url, scheme, allow_fragments)
    144     scheme, netloc, url, query, fragment = tuple
    145     if scheme in uses_params and ';' in url:

/usr/lib/python2.7/urlparse.pyc in urlsplit(url, scheme, allow_fragments)
    189                 if (('[' in netloc and ']' not in netloc) or
    190                         (']' in netloc and '[' not in netloc)):
--> 191                     raise ValueError("Invalid IPv6 URL")
    192             if allow_fragments and '#' in url:
    193                 url, fragment = url.split('#', 1)

ValueError: Invalid IPv6 URL

In [8]: 
@redapple redapple changed the title `LinkExtractor` chokes when only one link is bogus LinkExtractor chokes when only one link is bogus Sep 29, 2014
Copy link

@kmike kmike commented Sep 30, 2014

From the user perspective we should definitely skip bogus links, +1 to this feature.

I think it is better to catch specific errors one-by-one. I.e. to fix this issue wrap only urljoin in try-except, not the whole loop body.

dangra added a commit that referenced this issue Aug 16, 2015
[MRG+1] [LinkExtractors] Ignore bogus links (#907)
Copy link

@dangra dangra commented Aug 16, 2015

fixed in #1352

@dangra dangra closed this Aug 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants