Join GitHub today
content() cann't handle chinese text with encoding "GB2312" correctly #209
this problem arises from function parse_text, which presume all encodings returned by iconvlist() are upper-case or upper-case repeated.
additionally, in parse_auto function, the encoding parameter doesn't transfer to the parser function in the last line. it might be ok for jpeg, png etc, but for html, xml, the parser function in XML package that are eventually called will not correctly handle texts in them, even they have declared the encoding themself.
of course, there is a workaround: content(xxx, "text") %>% htmlParse(encoding="yyy"), presuming the bug i metioned above has been fixed, but it's not consistent.
sorry for my second comment, i haven't made it clear enough and may contain mistake.
what i said is that, because of the mis-handling of non-english character (at least of chinese) of XML package (maybe eventually libxml2), even it has converted to UTF-8 already, if we don't give the 'encoding' parameter to the 'htmlParse' or 'xmlParse' function explicitly, we may get a wrong result.
as you have convert all text to UTF-8 before htmlParse or xmlParse, what need to do is to simply add 'encoding="UTF-8"' parameter to each htmlParse or xmlParse that called by parser functions.
so, stopping autoparsing text formats into text first is not helpful to the problem. actually, autoparing text formats into text first is a beatiful design, at least from my point of view.
here is an example
the html code
with encoding parameter
content(a,"text",encoding="GB2312") %>% htmlParse(encoding="UTF-8") %>% xmlRoot() %>% xmlValue()
without encoding parameter, we'll get nothing
content(a,"text",encoding="GB2312") %>% htmlParse() %>% xmlRoot() %>% xmlValue()
GET("http://finance.sina.com.cn/china/20150704/122922591331.shtml") %>% content("text",encoding="gb2312") %>% htmlParse(encoding="utf-8") %>% xmlRoot %>% xmlValue()
sometimes httr doesn't work without encoding parameter
GET("http://world.huanqiu.com/article/2015-07/6849047.html?from=bdwz") %>% content("text")
and sometimes encoding should be specified in GET and htmlParse
GET("http://www.pbc.gov.cn") %>% content("text")