Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML entity causes UnicodeEncodeError in LxmlLinkExtractor #998

Closed
fgpietersz opened this issue Dec 30, 2014 · 4 comments · Fixed by #4321
Closed

HTML entity causes UnicodeEncodeError in LxmlLinkExtractor #998

fgpietersz opened this issue Dec 30, 2014 · 4 comments · Fixed by #4321

Comments

@fgpietersz
Copy link

A page containing 〈 causes with error in LxmlLinkExtractor.link_extractor()

          File "/[path]/spiders/search_spider.py", line 112, in parse
            self.link_extractor.extract_links(response) if
          File "/[path to virtualenv]/local/lib/python2.7/site-packages/scrapy/contrib/linkextractors/lxmlhtml.py", line 107, in extract_links
            links = self._extract_links(doc, response.url, response.encoding, base_url)
          File "/[path to virtualenv]/local/lib/python2.7/site-packages/scrapy/linkextractor.py", line 94, in _extract_links
            return self.link_extractor._extract_links(*args, **kwargs)
          File "/[path to virtualenv]/local/lib/python2.7/site-packages/scrapy/contrib/linkextractors/lxmlhtml.py", line 57, in _extract_links
            url = url.encode(response_encoding)
          File "/[path to virtualenv]/lib/python2.7/encodings/cp1252.py", line 12, in encode
            return codecs.charmap_encode(input,errors,encoding_table)
        exceptions.UnicodeEncodeError: 'charmap' codec can't encode character u'\u2329' in position 87: character maps to <undefined>

The encoding is correct and works for displaying the page in a browser, but fails as above in Scrapy.

@fgpietersz
Copy link
Author

I have a workaroud:

cleaned_response = response.replace(
    body=response.body_as_unicode().encode('utf-32'),
    encoding='utf-32'
)

and then pass cleaned_response to link_extactor()

I hope it means a fix will be straightforward.

I am using Scrapy 0.24.4, by the way.

@nramirezuy
Copy link
Contributor

Can you give us a snippet to reproduce it?

@Digenis
Copy link
Member

Digenis commented Oct 27, 2015

Here is a snippet to reproduce:
file: snowman.html

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=Windows-1251"/>
  </head>
  <body>
    <a title="unicode snowman" href="./&#9731;">
      &#x0441;&#x043d;&#x0435;&#x0436;&#x0435;&#x043d;
      &#x0447;&#x043e;&#x0432;&#x0435;&#x043a;
    </a>
  </body>
</html>

run: scrapy shell ~/snowman.html
The above doesn't work (probably some bug utils.url),
on 1.1 neither scrapy shell file:///home/digenis/snowman.html does.
Only
scrapy shell file://localhost/home/digenis/snowman.html

from scrapy.linkextractors import LinkExtractor
LinkExtractor().extract_links(response)

UnicodeEncodeError: 'charmap' codec can't encode character u'\u2603' in position 24: character maps to <undefined>

lxml decodes the entity inside the href attribute
and sinceWindows-1251 doesn't have the ⛄ character
it can't be decoded "back" to cp1251.

As @kmike suggests in #1403 (comment)
the path part needs to be encoded to utf8, as browsers do.
It should need to be able to be encoded in the response's encoding

@redapple
Copy link
Contributor

redapple commented Sep 14, 2016

This issue is fixed in #1949 where Link objects are created with Unicode URLs (and not converted to native str anymore, which chokes in Python 2),
leaving the encoding to the Request' init (when safe_url_string is called, path encoded as UTF-8 and query/fragment as response/document encoding) or when canonicalize_url (on by default) kicks in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants