Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception in LxmLinkExtractor.extract_links 'charmap' codec can't encode character #1403

Closed
aldarund opened this issue Aug 2, 2015 · 8 comments · Fixed by #4321
Closed
Labels

Comments

@aldarund
Copy link

aldarund commented Aug 2, 2015

Stacktrace (most recent call last):

  File "scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
    for x in result:
  File "scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
    for x in result:
  File "scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "scrapy/spiders/crawl.py", line 69, in _parse_response
    for requests_or_item in iterate_spider_output(cb_res):
  File "ex_link_crawl/spiders/external_link_spider.py", line 45, in parse_obj
    for link in LxmlLinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
  File "scrapy/linkextractors/lxmlhtml.py", line 108, in extract_links
    links = self._extract_links(doc, response.url, response.encoding, base_url)
  File "scrapy/linkextractors/__init__.py", line 103, in _extract_links
    return self.link_extractor._extract_links(*args, **kwargs)
  File "scrapy/linkextractors/lxmlhtml.py", line 57, in _extract_links
    url = url.encode(response_encoding)
  File "encodings/cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)

My use of extractor is following:

def parse_obj(self, response):
        if not isinstance(response, HtmlResponse):
            return
        for link in LxmlLinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
            if not link.nofollow:
                yield LinkCrawlItem(domain=link.url)
@aldarund aldarund changed the title Exception in LxmLinkExtractor.extract_links 'charmap' codec can't encode character u'
' in position 7: character maps to <undefined> Exception in LxmLinkExtractor.extract_links 'charmap' codec can't encode character Aug 2, 2015
@kmike
Copy link
Member

kmike commented Aug 27, 2015

There is a test for it, but it is disabled for LxmlLinkExtractor. I've marked it as xfail here: #1461; apparently it passes in Python 3.

@Digenis
Copy link
Member

Digenis commented Oct 19, 2015

It passes in python 3 because to_native_str at

url = to_native_str(url, encoding=response_encoding)

doesn't encode the link to the response's encoding.

It's a case where lxml manages to extract a unicode link with symbols that the response's encoding doesn't support. But urls need to be all ascii.
How does it get properly encoded before fetching it in python3?
I tried to_unicode(url, response_encoding) and the test passes both on 2 and 3
but on python2 Link expects bytes so it encodes unicode urls to utf8 with a warning.
Why bytes in python2 and unicode in python3?
If encoding is meant to be handled by the downloader, shouldn't all be unicode?
If such abstraction was not meant at all, shouldn't it all be bytes?
This all looks like there's an abstraction on the handling of url encoding
that is handled in the extractor in python2 and elsewhere in python3.

@kmike
Copy link
Member

kmike commented Oct 19, 2015

Why bytes in python2 and unicode in python3?

As I recall, the main reason is that in Python 2 urllib/urlparse work on bytes, while in Python 3 they work on unicode. Also, we can't make urls unicode in Python 2 because it is inconvenient and backwards incompatible, and urls as bytes are hard to work with in Python 3 (urllib.parse doesn't work with bytes, you have to pass original encoding everywhere if you work with bytes, selectors and link extractors return unicode).

Non-ascii URL encoding is tricker than one may think: domain should be encoded using IDNA, path should be encoded to UTF8 and then escaped to ASCII, query should be encoded to response encoding and then escaped to ASCII - this is how browsers work.

We haven't really solved that. In the Python 3 port we decided that URLs should be bytes in Python 2 and unicode in Python 3 and updated all code accordingly, marking 'hard' test cases as xfail to unblock further porting.

@Digenis
Copy link
Member

Digenis commented Oct 27, 2015

I don't understand how urlparse doesn't work with bytes in In python2:
urlparse.urlparse(u'http://\N{SNOWFLAKE}.com/\N{SNOWMAN}')
ParseResult(scheme=u'http', netloc=u'\u2744.com', path=u'/\u2603', params='', query='', fragment='')

Regarding the backwards compatibility,
where exactly is it dropped by a transition to unicode urls?

Regarding the final encoding of the url before downloading,
requiring links to be encodable to lesser encodings
shouldn't be a necessary step
because the path can be encoded to utf8 + %escapes later.

@kmike
Copy link
Member

kmike commented Oct 27, 2015

I don't understand how urlparse doesn't work with bytes in In python2

I think we've seen other functions from urllib and urlparse modules which don't work correctly; sorry, no concrete examples right now.

Regarding the backwards compatibility, where exactly is it dropped by a transition to unicode urls?

Besides obvious data type differences (other stuff may be unexpectingly promoted to unicode) the difference is that with urls-as-bytes users should encode them, but with urls-as-unicode scrapy encodes them, and the result may be different.

requiring links to be encodable to lesser encodings shouldn't be a necessary step

Sorry, I didn't get what you mean; could you please clarify it?

because the path can be encoded to utf8 + %escapes later.

The problem is that here /path?query path should be encoded to utf8 and then %-escaped (even if an URL is extracted from a non-utf8 page), but query should be encoded to response encoding and then %-escaped (even if path was encoded to utf8).

@kmike
Copy link
Member

kmike commented Oct 27, 2015

@kmike
Copy link
Member

kmike commented Oct 27, 2015

another example: urllib.quote(u'привет')

@Digenis
Copy link
Member

Digenis commented Oct 28, 2015

requiring links to be encodable to lesser encodings shouldn't be a necessary step

Sorry, I didn't get what you mean; could you please clarify it?

In the link extractor, unicode objects are encoded with url.encode(response_encoding)
then in the downloader they are decoded again, I suppose with url.decode(response_encoding)
and the path finally to utf8, after parsing, parsed_url.path.encode('utf8')

Unless I misunderstood and what you explained about unicode urls is not done in scrapy
and the request is done with a path encoded to the responses's encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment