Changed feed to HTMLParser to unicode instead of str. #756

bijzz · 2014-06-22T20:07:30Z

A lot of sites yielded UnicodeDecodeError when using HtmlParserLinkExtractor().extract_links(response). When the HTMLParser receives the response.body as unicode the exceptions dissappear. Maybe you can still replicate this with one or the other url i posted on stackedit (but these might work on your system depending on the system default encoding).

Also have a look at the HTMLParser documentation in Python 2 docs.python.org/2.7 stating data can be either unicode or str, but passing unicode is advised..

kmike · 2014-06-22T21:53:12Z

It'd be great to have a test case for this - it should fail before the change, but pass after it.

The code in _extract_links looks suspicious - ~~if the body is unicode it shouldn't be necessary to pass response encoding.~~ check _extract_link: link.text is assumed to be a bytestring there; if the input is unicode then link.text also should become unicode (I haven't checked that), and link.text.decode(response_encoding) is likely to fail - it will become equivalent to link.text.encode(sys.getdefaultencoding()).decode(response_encoding) which usually means link.text.encode('ascii').decode(response_encoding).

See also: #559. Don't want to discourage you (your change is in right direction), but I think it doesn't worth it to maintain multiple link extractor implementations.

dangra · 2014-07-02T15:48:26Z

I coincide with @kmike, it is not worth maintaining multiple link extractors and #559 result is a promising replacement to rule them all. I think it is time to deprecate other than the lxml linkextractor.

pablohoffman · 2014-07-02T20:55:05Z

+1 to @dangra & @kmike.

But I'm happy to change my mind if someone shows me a good reason to leave any link extractor other than the lxml-based one.

pablohoffman · 2014-07-02T20:56:40Z

@bijzz what was the reason you decided to use the HtmlParserLinkExtractor instead of the default one?

elacuesta · 2019-12-20T23:33:34Z

This link extractor was deprecated (and moved) in Scrapy 1.0 (https://github.com/scrapy/scrapy/blob/1.0.0/docs/news.rst#changelog, #1205).
We are now close to 2.0, I think there's a high chance this class gets removed before that.

Changed feed to HTMLParser to unicode instead of str.

613c92b

kmike mentioned this pull request Jul 2, 2014

Deprecate SgmlLinkExtractor #777

Merged

kmike mentioned this pull request Aug 29, 2014

SgmlLinkExtractor - fix for parsing <area> tag with Unicode present #865

Merged

elacuesta closed this Dec 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed feed to HTMLParser to unicode instead of str. #756

Changed feed to HTMLParser to unicode instead of str. #756

bijzz commented Jun 22, 2014

kmike commented Jun 22, 2014

dangra commented Jul 2, 2014

pablohoffman commented Jul 2, 2014

pablohoffman commented Jul 2, 2014

elacuesta commented Dec 20, 2019

Changed feed to HTMLParser to unicode instead of str. #756

Changed feed to HTMLParser to unicode instead of str. #756

Conversation

bijzz commented Jun 22, 2014

kmike commented Jun 22, 2014

dangra commented Jul 2, 2014

pablohoffman commented Jul 2, 2014

pablohoffman commented Jul 2, 2014

elacuesta commented Dec 20, 2019