New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeEncodeError in SgmlLinkExtractor when using restrict_xpaths #562
Comments
I know nothing about this issue, but has something to say :)
I guess we need to figure out where exactly things went wrong. Lxml should parse binary data, source encoding should be passed to tostring method, HTMLParser should be created with a proper encoding set. The code here: https://github.com/scrapy/scrapy/blob/master/scrapy/selector/lxmldocument.py is suspicious - why is body encoded to utf8 and html parser is created with utf8 encoding instead of a known response encoding? I'm sure it is there to fix some issue Scrapy had in past, but I don't know what issues is it - maybe it is possible to fix it in another way?
|
Agreed. The offending code is this block: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py#L118 Specifically the statement: |
@kmike I was thinking in this today and maybe |
notes about lxmldocument.py, and specially about this lines: def _factory(response, parser_cls):
url = response.url
body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
parser = parser_cls(recover=True, encoding='utf8')
return etree.fromstring(body, parser=parser, base_url=url) there are some reasons to recode everything to utf-8 instead of relying on lxml encoding support, some reasons may be obsolete in recent lxml versions but I bet others are still true:
>>> parser = lxml.etree.HTMLParser(recover=True, encoding='utf-8')
>>> lxml.etree.fromstring('<span>\xf0it\xe2\x80\x99s</span>')
File "<string>", line unknown
XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0xF0 0x69 0x74 0xE2, line 1, column 7 The above example was taken from this SO question, it was motivate by real response bodies found while scraping. I think |
@darkrho: It can be fixed in SGMLLinkExtractor with a simlar hack to what lxml does >>> u'\u2665'.encode('iso8859-15', errors='xmlcharrefreplace').decode('iso8859-15')
u'♥' the only trick is to encode using errors='xmlcharrefreplace' |
@kmike : what is the internal use case you see for this?
@nramirezuy : show me a case where returning bytes worth it instead of encoding outside. |
proposed patch: diff --git a/scrapy/contrib/linkextractors/sgml.py b/scrapy/contrib/linkextractors/sgml.py
index d8f6ae4..9a63fcd 100644
--- a/scrapy/contrib/linkextractors/sgml.py
+++ b/scrapy/contrib/linkextractors/sgml.py
@@ -121,7 +121,7 @@ class SgmlLinkExtractor(BaseSgmlLinkExtractor):
body = u''.join(f
for x in self.restrict_xpaths
for f in sel.xpath(x).extract()
- ).encode(response.encoding)
+ ).encode(response.encoding, errors='xmlcharrefreplace')
else:
body = response.body and sample script: >>> from scrapy.http import HtmlResponse
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> resp = HtmlResponse('http://example.com', encoding='iso8859-15', body='<html><body><p>♥</p></body></html>')
>>> SgmlLinkExtractor(restrict_xpaths='//p').extract_links(resp)
[] |
another exampel for my proposed patch: >>> body = '<html><body><p><a href="/♥/you?c=€">♥</a></p></body></html>'
>>> resp = HtmlResponse('http://example.com', encoding='iso8859-15', body=body)
>>> SgmlLinkExtractor(restrict_xpaths='//p').extract_links(resp)
[Link(url='http://example.com/%E2%99%A5/you?c=%E2%82%AC', text=u'', fragment='', nofollow=False)] doing it with lxml: >>> parser = lxml.etree.HTMLParser(encoding='iso8859-15')
>>> fragment = lxml.html.fromstring(resp.body, parser=parser)
>>> lxml.html.tostring(fragment, encoding='iso8859-15')
'<html><body><p><a href="/%E2%99%A5/you?c=%E2%82%AC">♥</a></p></body></html>' What makes me sad is that SgmlLinkExtractor fails if restrict_paths is not used (even without my patch) >>> SgmlLinkExtractor().extract_links(resp)
[Link(url='http://example.com/♥/you?c=&euro=', text=u'', fragment='', nofollow=False)] ^^ this is another bug that hopefully will be fixed by #559. |
@dangra , patch makes sense to me |
+1 to proposed patch. I think it's simple enough to backport it to 0.22, right? |
Yes. We can consider it a bug fix. I'll submit a PR later if nobody else
|
@dangra a common case is when you select to create a request, we always use the returned unicode directly, this works because urls should be ascii compatible but this is wrong. |
This issue has been addressed before by #285 but at the time none of the proposed alternative solution have made it into Scrapy.
Even though the solution proposed in #285 was a good workaround, it was discarded because it returned a different response.
But by the same argument, the
restrict_xpaths
argument already causes to modify the body content and thus the link extractor acts on a different body than the original.Here is an example:
So, the link extractor by using the selector to build a fragment of the html get all the entities converted and thus causing the failure when trying to re-encode the body.
It's a bummer that the most widely used link extractor and the only one that supports the handy argument
restrict_xpaths
fails in such simple and very common case.I don't know what's the best solution (a solution that can be backported to 0.22), but it seems if the
Selector
have the support to work in other encodings than unicode by a given parameter it might work well for this case. This becauselxml
can handle the entities in different encodings nicely:In this way, even though the entities still are being converted, the re-encoding of the extracted fragment won't fail.
The text was updated successfully, but these errors were encountered: