Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Link Extractor #2010

Closed
k-m-engin opened this issue May 25, 2016 · 5 comments
Closed

Unicode Link Extractor #2010

k-m-engin opened this issue May 25, 2016 · 5 comments

Comments

@k-m-engin
Copy link

When using the following to extract all of the links from a response:

self.link_extractor = LinkExtractor()
...
links = self.link_extractor.extract_links(response)

On rare occasions, the following error is thrown:

2016-05-25 12:13:55,432 [root] [ERROR]  Error on http://detroit.curbed.com/2016/5/5/11605132/tiny-house-designer-show, traceback: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1203, in mainLoop
    self.runUntilCurrent()
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 825, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 393, in callback
    self._startRunCallbacks(result)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/var/www/html/DomainCrawler/DomainCrawler/spiders/hybrid_spider.py", line 223, in parse
    items.extend(self._extract_requests(response))
  File "/var/www/html/DomainCrawler/DomainCrawler/spiders/hybrid_spider.py", line 477, in _extract_requests
    links = self.link_extractor.extract_links(response)
  File "/usr/local/lib/python2.7/site-packages/scrapy/linkextractors/lxmlhtml.py", line 111, in extract_links
    all_links.extend(self._process_links(links))
  File "/usr/local/lib/python2.7/site-packages/scrapy/linkextractors/__init__.py", line 103, in _process_links
    link.url = canonicalize_url(urlparse(link.url))
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/url.py", line 85, in canonicalize_url
    parse_url(url), encoding=encoding)
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/url.py", line 46, in _safe_ParseResult
    to_native_str(parts.netloc.encode('idna')),
  File "/usr/local/lib/python2.7/encodings/idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "/usr/local/lib/python2.7/encodings/idna.py", line 73, in ToASCII
    raise UnicodeError("label empty or too long")
exceptions.UnicodeError: label empty or too long

I was able to find some information concerning the error from here.
My question is: What is the best way to handle this? Even if there is one bad link in the response, I'd want all of the other good links to be extracted.

@kmike
Copy link
Member

kmike commented May 25, 2016

I think it worths fixing in Scrapy and/or in w3lib. There was a similar issue in past (#1352). Do you have an example of URL where this fails?

@eLRuLL
Copy link
Member

eLRuLL commented May 25, 2016

$ scrapy shell http://detroit.curbed.com/2016/5/5/11605132/tiny-house-designer-show
...
In [1]: from scrapy.linkextractors import LinkExtractor
In [2]: le = LinkExtractor()
In [3]: le.extract_links(response)
---------------------------------------------------------------------------
UnicodeError                              Traceback (most recent call last)
<ipython-input-3-bd594ee31d5c> in <module>()
----> 1 le.extract_links(response)
...
UnicodeError: label empty or too long 

@k-m-engin
Copy link
Author

I also know that erroneous strings such as:

print '.google.com'.encode('idna')

will also fail.

@redapple
Copy link
Contributor

redapple commented Jun 6, 2016

I believe we should fix it at canonicalize_url level, something like catching the exception and returning the URL as is if the encoding of the domain name using IDNA algorithms fail.

It's a shame there's not explicit exception for wrong label lengths (we can test the exception string but that feels hacky)

>>> from scrapy.utils.url import canonicalize_url
>>> canonicalize_url('http://www.'+'a'*63+'.com')
'http://www.aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.com/'
>>> canonicalize_url('http://www.'+'a'*64+'.com')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "scrapy/utils/url.py", line 85, in canonicalize_url
    parse_url(url), encoding=encoding)
  File "scrapy/utils/url.py", line 46, in _safe_ParseResult
    to_native_str(parts.netloc.encode('idna')),
  File "/home/paul/.virtualenvs/scrapydev/lib/python2.7/encodings/idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "/home/paul/.virtualenvs/scrapydev/lib/python2.7/encodings/idna.py", line 73, in ToASCII
    raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long
>>> 

@kmike kmike added this to the v1.1.1 milestone Jun 6, 2016
@kmike
Copy link
Member

kmike commented Jun 6, 2016

I'm adding 1.1.1 milestone because this is a Scrapy 1.1 regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants