Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed feed to HTMLParser to unicode instead of str. #756

Closed
wants to merge 1 commit into from

Conversation

bijzz
Copy link

@bijzz bijzz commented Jun 22, 2014

A lot of sites yielded UnicodeDecodeError when using HtmlParserLinkExtractor().extract_links(response). When the HTMLParser receives the response.body as unicode the exceptions dissappear. Maybe you can still replicate this with one or the other url i posted on stackedit (but these might work on your system depending on the system default encoding).

Also have a look at the HTMLParser documentation in Python 2 docs.python.org/2.7 stating data can be either unicode or str, but passing unicode is advised..

@kmike
Copy link
Member

kmike commented Jun 22, 2014

It'd be great to have a test case for this - it should fail before the change, but pass after it.

The code in _extract_links looks suspicious - if the body is unicode it shouldn't be necessary to pass response encoding. check _extract_link: link.text is assumed to be a bytestring there; if the input is unicode then link.text also should become unicode (I haven't checked that), and link.text.decode(response_encoding) is likely to fail - it will become equivalent to link.text.encode(sys.getdefaultencoding()).decode(response_encoding) which usually means link.text.encode('ascii').decode(response_encoding).

See also: #559. Don't want to discourage you (your change is in right direction), but I think it doesn't worth it to maintain multiple link extractor implementations.

@dangra
Copy link
Member

dangra commented Jul 2, 2014

I coincide with @kmike, it is not worth maintaining multiple link extractors and #559 result is a promising replacement to rule them all. I think it is time to deprecate other than the lxml linkextractor.

@kmike kmike mentioned this pull request Jul 2, 2014
@pablohoffman
Copy link
Member

+1 to @dangra & @kmike.

But I'm happy to change my mind if someone shows me a good reason to leave any link extractor other than the lxml-based one.

@pablohoffman
Copy link
Member

@bijzz what was the reason you decided to use the HtmlParserLinkExtractor instead of the default one?

@elacuesta
Copy link
Member

This link extractor was deprecated (and moved) in Scrapy 1.0 (https://github.com/scrapy/scrapy/blob/1.0.0/docs/news.rst#changelog, #1205).
We are now close to 2.0, I think there's a high chance this class gets removed before that.

@elacuesta elacuesta closed this Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants