Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError in SgmlLinkExtractor when using restrict_xpaths #562

Closed
rmax opened this issue Jan 24, 2014 · 14 comments · Fixed by #565
Closed

UnicodeEncodeError in SgmlLinkExtractor when using restrict_xpaths #562

rmax opened this issue Jan 24, 2014 · 14 comments · Fixed by #565

Comments

@rmax
Copy link
Contributor

rmax commented Jan 24, 2014

This issue has been addressed before by #285 but at the time none of the proposed alternative solution have made it into Scrapy.

Even though the solution proposed in #285 was a good workaround, it was discarded because it returned a different response.

But by the same argument, the restrict_xpaths argument already causes to modify the body content and thus the link extractor acts on a different body than the original.

Here is an example:

In [6]: from scrapy.selector import Selector

In [7]: from scrapy.http import HtmlResponse

In [8]: resp = HtmlResponse('http://example.com', body='<html><body><p>&hearts;</p></body></html>')

In [9]: Selector(resp).extract()
Out[9]: u'<html><body><p>\u2665</p></body></html>'

So, the link extractor by using the selector to build a fragment of the html get all the entities converted and thus causing the failure when trying to re-encode the body.

In [10]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

In [11]: resp = HtmlResponse('http://example.com', encoding='iso8859-15', body='<html><body><p>&hearts;</p></body></html>')

In [12]: SgmlLinkExtractor(restrict_xpaths='//p').extract_links(resp)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-5-7d00fa2c152d> in <module>()
----> 1 SgmlLinkExtractor(restrict_xpaths='//p').extract_links(resp)

/home/rolando/Projects/scrapy/scrapy/contrib/linkextractors/sgml.pyc in extract_links(self, response)
    122                             for x in self.restrict_xpaths
    123                             for f in sel.xpath(x).extract()
--> 124                             ).encode(response.encoding)
    125         else:
    126             body = response.body

/usr/lib/python2.7/encodings/iso8859_15.pyc in encode(self, input, errors)
     10 
     11     def encode(self,input,errors='strict'):
---> 12         return codecs.charmap_encode(input,errors,encoding_table)
     13 
     14     def decode(self,input,errors='strict'):

UnicodeEncodeError: 'charmap' codec can't encode character u'\u2665' in position 3: character maps to <undefined>

It's a bummer that the most widely used link extractor and the only one that supports the handy argument restrict_xpaths fails in such simple and very common case.

I don't know what's the best solution (a solution that can be backported to 0.22), but it seems if the Selector have the support to work in other encodings than unicode by a given parameter it might work well for this case. This because lxml can handle the entities in different encodings nicely:

In [46]: import lxml

In [47]: parser = lxml.etree.HTMLParser()

In [48]: parser.feed('<p>&hearts;</p>')

In [52]: fragment = parser.close()

In [53]: lxml.html.tostring(fragment)
Out[53]: '<html><body><p>&#9829;</p></body></html>'

In [54]: lxml.html.tostring(fragment, encoding='unicode')
Out[54]: u'<html><body><p>\u2665</p></body></html>'

In [55]: lxml.html.tostring(fragment, encoding='ascii')
Out[55]: '<html><body><p>&#9829;</p></body></html>'

In [56]: lxml.html.tostring(fragment, encoding='iso8859-15')
Out[56]: '<html><body><p>&#9829;</p></body></html>'

In this way, even though the entities still are being converted, the re-encoding of the extracted fragment won't fail.

@kmike
Copy link
Member

kmike commented Jan 24, 2014

I know nothing about this issue, but has something to say :)

  1. Unicode is not an encoding at all, you can't encode to unicode.
  2. Lxml's "encoding=unicode" parameter looks like a quick hack to return unicode strings for short html snippets, it is not good to rely on it unless we have a compelling reason.
  3. To parse the data lxml needs binary data encoded to a proper encoding.
  4. lxml.etree.HTMLParser() has an 'encoding' argument which should be filled when encoding is known (e.g. from response headers).

I guess we need to figure out where exactly things went wrong. Lxml should parse binary data, source encoding should be passed to tostring method, HTMLParser should be created with a proper encoding set.

The code here: https://github.com/scrapy/scrapy/blob/master/scrapy/selector/lxmldocument.py is suspicious - why is body encoded to utf8 and html parser is created with utf8 encoding instead of a known response encoding? I'm sure it is there to fix some issue Scrapy had in past, but I don't know what issues is it - maybe it is possible to fix it in another way?

Selector.extract method returns unicode by calling tostring(..., encoding='unicode'), and this is handy, but it could make sense to have an another method (or an option) (for internal use?) - it should return byte strings in the response encoding.

@rmax
Copy link
Contributor Author

rmax commented Jan 24, 2014

Agreed.

The offending code is this block: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py#L118

Specifically the statement: u''.join(f for x in self.restrict_xpaths for f in sel.xpath(x).extract()).encode(response.encoding).

@nramirezuy
Copy link
Contributor

@kmike I was thinking in this today and maybe .extract() can accept encoding as an argument and default unicode, sometimes we need the bytes.

@dangra
Copy link
Member

dangra commented Jan 25, 2014

notes about lxmldocument.py, and specially about this lines:

def _factory(response, parser_cls):
    url = response.url
    body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
    parser = parser_cls(recover=True, encoding='utf8')
    return etree.fromstring(body, parser=parser, base_url=url)

there are some reasons to recode everything to utf-8 instead of relying on lxml encoding support, some reasons may be obsolete in recent lxml versions but I bet others are still true:

  1. lxml doesn't support all the encodings python support: https://gist.github.com/dangra/8610777
    Ignoring the obvious wrong encodings, there are still some legit encodings found in the wild.
  2. Sometimes the encoding declared isn't accurate and lxml can't handle invalid encoded bodies,
    Using response.body_as_unicode() ensures the body is decoded (trough w3lib.encoding)
    and we can trust lxml to handle valid utf8 correctly. By the time I ported from libxml2 to lxml,
    lxml was always recoding its input to utf-8, we encode to utf-8 outside because lxml fails with
    unicode input that contains encoding declarations.
  3. The only encoding that really makes sense to pass directly to lxml is utf-8,
    but even in this case lxml fails to parse invalid utf8 encoded strings.
>>> parser = lxml.etree.HTMLParser(recover=True, encoding='utf-8')
>>> lxml.etree.fromstring('<span>\xf0it\xe2\x80\x99s</span>')
  File "<string>", line unknown
XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0xF0 0x69 0x74 0xE2, line 1, column 7

The above example was taken from this SO question, it was motivate by real response bodies found while scraping. I think w3lib.encoding does a good job in this topic, and it took time to tune it for the wild web.

@dangra
Copy link
Member

dangra commented Jan 25, 2014

@darkrho: It can be fixed in SGMLLinkExtractor with a simlar hack to what lxml does

>>> u'\u2665'.encode('iso8859-15', errors='xmlcharrefreplace').decode('iso8859-15')
u'&#9829;'

the only trick is to encode using errors='xmlcharrefreplace'

@dangra
Copy link
Member

dangra commented Jan 25, 2014

Selector.extract method returns unicode by calling tostring(..., encoding='unicode'), and this is handy, but it could make sense to have an another method (or an option) (for internal use?) - it should return byte strings in the response encoding.

@kmike : what is the internal use case you see for this?

I was thinking in this today and maybe .extract() can accept encoding as an argument and default unicode, sometimes we need the bytes.

@nramirezuy : show me a case where returning bytes worth it instead of encoding outside.

@dangra
Copy link
Member

dangra commented Jan 25, 2014

It's a bummer that the most widely used link extractor and the only one that supports the handy argument restrict_xpaths fails in such simple and very common case.

Let's put our eyes on #559 ! (@redapple) :)

@dangra
Copy link
Member

dangra commented Jan 25, 2014

proposed patch:

diff --git a/scrapy/contrib/linkextractors/sgml.py b/scrapy/contrib/linkextractors/sgml.py
index d8f6ae4..9a63fcd 100644
--- a/scrapy/contrib/linkextractors/sgml.py
+++ b/scrapy/contrib/linkextractors/sgml.py
@@ -121,7 +121,7 @@ class SgmlLinkExtractor(BaseSgmlLinkExtractor):
             body = u''.join(f
                             for x in self.restrict_xpaths
                             for f in sel.xpath(x).extract()
-                            ).encode(response.encoding)
+                            ).encode(response.encoding, errors='xmlcharrefreplace')
         else:
             body = response.body

and sample script:

>>> from scrapy.http import HtmlResponse
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> resp = HtmlResponse('http://example.com', encoding='iso8859-15', body='<html><body><p>&hearts;</p></body></html>')
>>> SgmlLinkExtractor(restrict_xpaths='//p').extract_links(resp)
[]

@dangra
Copy link
Member

dangra commented Jan 25, 2014

another exampel for my proposed patch:

>>> body = '<html><body><p><a href="/&hearts;/you?c=&euro;">&hearts;</a></p></body></html>'
>>> resp = HtmlResponse('http://example.com', encoding='iso8859-15', body=body)
>>> SgmlLinkExtractor(restrict_xpaths='//p').extract_links(resp)
[Link(url='http://example.com/%E2%99%A5/you?c=%E2%82%AC', text=u'', fragment='', nofollow=False)]

doing it with lxml:

>>> parser = lxml.etree.HTMLParser(encoding='iso8859-15')
>>> fragment = lxml.html.fromstring(resp.body, parser=parser)
>>> lxml.html.tostring(fragment, encoding='iso8859-15')
'<html><body><p><a href="/%E2%99%A5/you?c=%E2%82%AC">&#9829;</a></p></body></html>'

What makes me sad is that SgmlLinkExtractor fails if restrict_paths is not used (even without my patch)

>>> SgmlLinkExtractor().extract_links(resp)
[Link(url='http://example.com/&hearts;/you?c=&euro=', text=u'', fragment='', nofollow=False)]

^^ this is another bug that hopefully will be fixed by #559.

@dangra
Copy link
Member

dangra commented Jan 25, 2014

/cc @stav as you originally proposed #285

@redapple
Copy link
Contributor

@dangra , patch makes sense to me

@rmax
Copy link
Contributor Author

rmax commented Jan 25, 2014

+1 to proposed patch. I think it's simple enough to backport it to 0.22, right?

@dangra
Copy link
Member

dangra commented Jan 25, 2014

Yes. We can consider it a bug fix. I'll submit a PR later if nobody else
does.
On Jan 25, 2014 11:15 AM, "Rolando Espinoza La fuente" <
notifications@github.com> wrote:

+1 to proposed patch. I think it's simple enough to backport it to 0.22,
right?


Reply to this email directly or view it on GitHubhttps://github.com//issues/562#issuecomment-33288687
.

@nramirezuy
Copy link
Contributor

@dangra a common case is when you select to create a request, we always use the returned unicode directly, this works because urls should be ascii compatible but this is wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants