-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception in LxmLinkExtractor.extract_links 'charmap' codec can't encode character #1403
Comments
There is a test for it, but it is disabled for LxmlLinkExtractor. I've marked it as xfail here: #1461; apparently it passes in Python 3. |
It passes in python 3 because scrapy/scrapy/linkextractors/lxmlhtml.py Line 59 in 8dc400c
doesn't encode the link to the response's encoding. It's a case where lxml manages to extract a unicode link with symbols that the response's encoding doesn't support. But urls need to be all ascii. |
As I recall, the main reason is that in Python 2 urllib/urlparse work on bytes, while in Python 3 they work on unicode. Also, we can't make urls unicode in Python 2 because it is inconvenient and backwards incompatible, and urls as bytes are hard to work with in Python 3 (urllib.parse doesn't work with bytes, you have to pass original encoding everywhere if you work with bytes, selectors and link extractors return unicode). Non-ascii URL encoding is tricker than one may think: domain should be encoded using IDNA, path should be encoded to UTF8 and then escaped to ASCII, query should be encoded to response encoding and then escaped to ASCII - this is how browsers work. We haven't really solved that. In the Python 3 port we decided that URLs should be bytes in Python 2 and unicode in Python 3 and updated all code accordingly, marking 'hard' test cases as xfail to unblock further porting. |
I don't understand how urlparse doesn't work with bytes in In python2: Regarding the backwards compatibility, Regarding the final encoding of the url before downloading, |
I think we've seen other functions from urllib and urlparse modules which don't work correctly; sorry, no concrete examples right now.
Besides obvious data type differences (other stuff may be unexpectingly promoted to unicode) the difference is that with urls-as-bytes users should encode them, but with urls-as-unicode scrapy encodes them, and the result may be different.
Sorry, I didn't get what you mean; could you please clarify it?
The problem is that here |
another example: |
In the link extractor, unicode objects are encoded with Unless I misunderstood and what you explained about unicode urls is not done in scrapy |
My use of extractor is following:
The text was updated successfully, but these errors were encountered: