UnicodeEncodeError in SgmlLinkExtractor when using restrict_xpaths #562

rmax · 2014-01-24T17:28:23Z

This issue has been addressed before by #285 but at the time none of the proposed alternative solution have made it into Scrapy.

Even though the solution proposed in #285 was a good workaround, it was discarded because it returned a different response.

But by the same argument, the restrict_xpaths argument already causes to modify the body content and thus the link extractor acts on a different body than the original.

Here is an example:

In [6]: from scrapy.selector import Selector

In [7]: from scrapy.http import HtmlResponse

In [8]: resp = HtmlResponse('http://example.com', body='<html><body><p>&hearts;</p></body></html>')

In [9]: Selector(resp).extract()
Out[9]: u'<html><body><p>\u2665</p></body></html>'

So, the link extractor by using the selector to build a fragment of the html get all the entities converted and thus causing the failure when trying to re-encode the body.

In [10]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

In [11]: resp = HtmlResponse('http://example.com', encoding='iso8859-15', body='<html><body><p>&hearts;</p></body></html>')

In [12]: SgmlLinkExtractor(restrict_xpaths='//p').extract_links(resp)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-5-7d00fa2c152d> in <module>()
----> 1 SgmlLinkExtractor(restrict_xpaths='//p').extract_links(resp)

/home/rolando/Projects/scrapy/scrapy/contrib/linkextractors/sgml.pyc in extract_links(self, response)
    122                             for x in self.restrict_xpaths
    123                             for f in sel.xpath(x).extract()
--> 124                             ).encode(response.encoding)
    125         else:
    126             body = response.body

/usr/lib/python2.7/encodings/iso8859_15.pyc in encode(self, input, errors)
     10 
     11     def encode(self,input,errors='strict'):
---> 12         return codecs.charmap_encode(input,errors,encoding_table)
     13 
     14     def decode(self,input,errors='strict'):

UnicodeEncodeError: 'charmap' codec can't encode character u'\u2665' in position 3: character maps to <undefined>

It's a bummer that the most widely used link extractor and the only one that supports the handy argument restrict_xpaths fails in such simple and very common case.

I don't know what's the best solution (a solution that can be backported to 0.22), but it seems if the Selector have the support to work in other encodings than unicode by a given parameter it might work well for this case. This because lxml can handle the entities in different encodings nicely:

In [46]: import lxml

In [47]: parser = lxml.etree.HTMLParser()

In [48]: parser.feed('<p>&hearts;</p>')

In [52]: fragment = parser.close()

In [53]: lxml.html.tostring(fragment)
Out[53]: '<html><body><p>&#9829;</p></body></html>'

In [54]: lxml.html.tostring(fragment, encoding='unicode')
Out[54]: u'<html><body><p>\u2665</p></body></html>'

In [55]: lxml.html.tostring(fragment, encoding='ascii')
Out[55]: '<html><body><p>&#9829;</p></body></html>'

In [56]: lxml.html.tostring(fragment, encoding='iso8859-15')
Out[56]: '<html><body><p>&#9829;</p></body></html>'

In this way, even though the entities still are being converted, the re-encoding of the extracted fragment won't fail.

The text was updated successfully, but these errors were encountered:

kmike · 2014-01-24T19:09:17Z

I know nothing about this issue, but has something to say :)

Unicode is not an encoding at all, you can't encode to unicode.
Lxml's "encoding=unicode" parameter looks like a quick hack to return unicode strings for short html snippets, it is not good to rely on it unless we have a compelling reason.
To parse the data lxml needs binary data encoded to a proper encoding.
lxml.etree.HTMLParser() has an 'encoding' argument which should be filled when encoding is known (e.g. from response headers).

I guess we need to figure out where exactly things went wrong. Lxml should parse binary data, source encoding should be passed to tostring method, HTMLParser should be created with a proper encoding set.

The code here: https://github.com/scrapy/scrapy/blob/master/scrapy/selector/lxmldocument.py is suspicious - why is body encoded to utf8 and html parser is created with utf8 encoding instead of a known response encoding? I'm sure it is there to fix some issue Scrapy had in past, but I don't know what issues is it - maybe it is possible to fix it in another way?

Selector.extract method returns unicode by calling tostring(..., encoding='unicode'), and this is handy, but it could make sense to have an another method (or an option) (for internal use?) - it should return byte strings in the response encoding.

rmax · 2014-01-24T19:56:45Z

Agreed.

The offending code is this block: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/linkextractors/sgml.py#L118

Specifically the statement: u''.join(f for x in self.restrict_xpaths for f in sel.xpath(x).extract()).encode(response.encoding).

nramirezuy · 2014-01-24T21:28:07Z

@kmike I was thinking in this today and maybe .extract() can accept encoding as an argument and default unicode, sometimes we need the bytes.

dangra · 2014-01-25T02:39:57Z

notes about lxmldocument.py, and specially about this lines:

def _factory(response, parser_cls):
    url = response.url
    body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
    parser = parser_cls(recover=True, encoding='utf8')
    return etree.fromstring(body, parser=parser, base_url=url)

there are some reasons to recode everything to utf-8 instead of relying on lxml encoding support, some reasons may be obsolete in recent lxml versions but I bet others are still true:

lxml doesn't support all the encodings python support: https://gist.github.com/dangra/8610777
Ignoring the obvious wrong encodings, there are still some legit encodings found in the wild.
Sometimes the encoding declared isn't accurate and lxml can't handle invalid encoded bodies,
Using response.body_as_unicode() ensures the body is decoded (trough w3lib.encoding)
and we can trust lxml to handle valid utf8 correctly. By the time I ported from libxml2 to lxml,
lxml was always recoding its input to utf-8, we encode to utf-8 outside because lxml fails with
unicode input that contains encoding declarations.
The only encoding that really makes sense to pass directly to lxml is utf-8,
but even in this case lxml fails to parse invalid utf8 encoded strings.

>>> parser = lxml.etree.HTMLParser(recover=True, encoding='utf-8')
>>> lxml.etree.fromstring('<span>\xf0it\xe2\x80\x99s</span>')
  File "<string>", line unknown
XMLSyntaxError: Input is not proper UTF-8, indicate encoding !
Bytes: 0xF0 0x69 0x74 0xE2, line 1, column 7

The above example was taken from this SO question, it was motivate by real response bodies found while scraping. I think w3lib.encoding does a good job in this topic, and it took time to tune it for the wild web.

dangra · 2014-01-25T02:58:00Z

@darkrho: It can be fixed in SGMLLinkExtractor with a simlar hack to what lxml does

>>> u'\u2665'.encode('iso8859-15', errors='xmlcharrefreplace').decode('iso8859-15')
u'&#9829;'

the only trick is to encode using errors='xmlcharrefreplace'

dangra · 2014-01-25T03:01:58Z

Selector.extract method returns unicode by calling tostring(..., encoding='unicode'), and this is handy, but it could make sense to have an another method (or an option) (for internal use?) - it should return byte strings in the response encoding.

@kmike : what is the internal use case you see for this?

I was thinking in this today and maybe .extract() can accept encoding as an argument and default unicode, sometimes we need the bytes.

@nramirezuy : show me a case where returning bytes worth it instead of encoding outside.

dangra · 2014-01-25T03:03:23Z

It's a bummer that the most widely used link extractor and the only one that supports the handy argument restrict_xpaths fails in such simple and very common case.

Let's put our eyes on #559 ! (@redapple) :)

dangra · 2014-01-25T03:11:29Z

proposed patch:

diff --git a/scrapy/contrib/linkextractors/sgml.py b/scrapy/contrib/linkextractors/sgml.py
index d8f6ae4..9a63fcd 100644
--- a/scrapy/contrib/linkextractors/sgml.py
+++ b/scrapy/contrib/linkextractors/sgml.py
@@ -121,7 +121,7 @@ class SgmlLinkExtractor(BaseSgmlLinkExtractor):
             body = u''.join(f
                             for x in self.restrict_xpaths
                             for f in sel.xpath(x).extract()
-                            ).encode(response.encoding)
+                            ).encode(response.encoding, errors='xmlcharrefreplace')
         else:
             body = response.body

and sample script:

>>> from scrapy.http import HtmlResponse
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> resp = HtmlResponse('http://example.com', encoding='iso8859-15', body='<html><body><p>&hearts;</p></body></html>')
>>> SgmlLinkExtractor(restrict_xpaths='//p').extract_links(resp)
[]

dangra · 2014-01-25T03:33:43Z

another exampel for my proposed patch:

>>> body = '<html><body><p><a href="/&hearts;/you?c=&euro;">&hearts;</a></p></body></html>'
>>> resp = HtmlResponse('http://example.com', encoding='iso8859-15', body=body)
>>> SgmlLinkExtractor(restrict_xpaths='//p').extract_links(resp)
[Link(url='http://example.com/%E2%99%A5/you?c=%E2%82%AC', text=u'', fragment='', nofollow=False)]

doing it with lxml:

>>> parser = lxml.etree.HTMLParser(encoding='iso8859-15')
>>> fragment = lxml.html.fromstring(resp.body, parser=parser)
>>> lxml.html.tostring(fragment, encoding='iso8859-15')
'<html><body><p><a href="/%E2%99%A5/you?c=%E2%82%AC">&#9829;</a></p></body></html>'

What makes me sad is that SgmlLinkExtractor fails if restrict_paths is not used (even without my patch)

>>> SgmlLinkExtractor().extract_links(resp)
[Link(url='http://example.com/&hearts;/you?c=&euro=', text=u'', fragment='', nofollow=False)]

^^ this is another bug that hopefully will be fixed by #559.

dangra · 2014-01-25T03:38:14Z

/cc @stav as you originally proposed #285

redapple · 2014-01-25T11:31:57Z

@dangra , patch makes sense to me

rmax · 2014-01-25T13:15:40Z

+1 to proposed patch. I think it's simple enough to backport it to 0.22, right?

dangra · 2014-01-25T18:27:55Z

Yes. We can consider it a bug fix. I'll submit a PR later if nobody else
does.
On Jan 25, 2014 11:15 AM, "Rolando Espinoza La fuente" <
notifications@github.com> wrote:

+1 to proposed patch. I think it's simple enough to backport it to 0.22,
right?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/562#issuecomment-33288687
.

…and scrapy#285

nramirezuy · 2014-02-03T17:06:56Z

@dangra a common case is when you select to create a request, we always use the returned unicode directly, this works because urls should be ascii compatible but this is wrong.

redapple added a commit to redapple/scrapy that referenced this issue Jan 26, 2014

Adding proposed fix for issue scrapy#562

ce5af82

dangra added a commit to dangra/scrapy that referenced this issue Jan 27, 2014

replace unencodeable codepoints with html entities. fixes scrapy#562 …

66829c9

…and scrapy#285

dangra mentioned this issue Jan 27, 2014

replace unencodeable codepoints with html entities #565

Merged

pablohoffman closed this as completed in #565 Jan 28, 2014

dangra added a commit that referenced this issue Feb 5, 2014

replace unencodeable codepoints with html entities. fixes #562 and #285

5f87b17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeEncodeError in SgmlLinkExtractor when using restrict_xpaths #562

UnicodeEncodeError in SgmlLinkExtractor when using restrict_xpaths #562

rmax commented Jan 24, 2014

kmike commented Jan 24, 2014

rmax commented Jan 24, 2014

nramirezuy commented Jan 24, 2014

dangra commented Jan 25, 2014

dangra commented Jan 25, 2014

dangra commented Jan 25, 2014

dangra commented Jan 25, 2014

dangra commented Jan 25, 2014

dangra commented Jan 25, 2014

dangra commented Jan 25, 2014

redapple commented Jan 25, 2014

rmax commented Jan 25, 2014

dangra commented Jan 25, 2014

nramirezuy commented Feb 3, 2014

UnicodeEncodeError in SgmlLinkExtractor when using restrict_xpaths #562

UnicodeEncodeError in SgmlLinkExtractor when using restrict_xpaths #562

Comments

rmax commented Jan 24, 2014

kmike commented Jan 24, 2014

rmax commented Jan 24, 2014

nramirezuy commented Jan 24, 2014

dangra commented Jan 25, 2014

dangra commented Jan 25, 2014

dangra commented Jan 25, 2014

dangra commented Jan 25, 2014

dangra commented Jan 25, 2014

dangra commented Jan 25, 2014

dangra commented Jan 25, 2014

redapple commented Jan 25, 2014

rmax commented Jan 25, 2014

dangra commented Jan 25, 2014

nramirezuy commented Feb 3, 2014