-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML entity causes UnicodeEncodeError in LxmlLinkExtractor #998
Comments
I have a workaroud:
and then pass cleaned_response to link_extactor() I hope it means a fix will be straightforward. I am using Scrapy 0.24.4, by the way. |
Can you give us a snippet to reproduce it? |
Here is a snippet to reproduce: <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1251"/>
</head>
<body>
<a title="unicode snowman" href="./☃">
снежен
човек
</a>
</body>
</html>
from scrapy.linkextractors import LinkExtractor
LinkExtractor().extract_links(response)
lxml decodes the entity inside the href attribute As @kmike suggests in #1403 (comment) |
This issue is fixed in #1949 where Link objects are created with Unicode URLs (and not converted to native str anymore, which chokes in Python 2), |
A page containing 〈 causes with error in LxmlLinkExtractor.link_extractor()
The encoding is correct and works for displaying the page in a browser, but fails as above in Scrapy.
The text was updated successfully, but these errors were encountered: