New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problem on Windows with non-US English locale #54

Closed
jennybc opened this Issue Oct 12, 2015 · 13 comments

Comments

Projects
None yet
5 participants
@jennybc
Member

jennybc commented Oct 12, 2015

I have various reports of problems with listing Google Sheets or specifying worksheet names or (possibly) reading Sheet data from [1] Windows and [2] non-US English locales. Examples: Spanish_Spain.1252, Danish_Denmark.1252 and ... someone in Colombia who didn't provide session info.

I think all the problematic text has been successfully processed with httr::content(as = "text", encoding = "UTF-8") but then a problem is introduced in xml2::read_xml().

The most relevant issue is jennybc/googlesheets#151. I apologize for some noise there -- a recent example Sheet posted seems to have a different problem. If you look there, focus on the comments from @krose. He has done a lot of digging and posed a related question on stack overflow. From his work it seems the problem might come from

read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html)

@krose

This comment has been minimized.

krose commented Oct 12, 2015

I think it's the doc_parse_raw function that use CP1252 to parse UTF-8.

See this comment that describes the issue, but with base functions.

@hadley

This comment has been minimized.

Member

hadley commented Oct 12, 2015

Could you provide a simple reproducible example for me? (i.e. an xml file that fails to parse uploaded to dropbox or similar)

@krose

This comment has been minimized.

krose commented Oct 12, 2015

Sure. This small scrip should do it. I've set the default text encoding to UTF-8 in RStudio::tools::gobal options::default text encoding.

library(xml2)

test_xml <- "<note><weight>vægt</weight></note>"

xml2::read_xml(test_xml)
xml2::read_xml(enc2utf8(test_xml), encoding = "UTF-8")
xml2::read_xml(test_xml, encoding = "UTF-8")

Results in:

> library(xml2)
> 
> test_xml <- "<note><weight>vægt</weight></note>"
> 
> xml2::read_xml(test_xml)
{xml_document}
<note>
[1] <weight>vægt</weight>
> xml2::read_xml(enc2utf8(test_xml), encoding = "UTF-8")
{xml_document}
<note>
[1] <weight>vægt</weight>
> xml2::read_xml(test_xml, encoding = "UTF-8")
{xml_document}
<note>
[1] <weight>vægt</weight>
@hadley

This comment has been minimized.

Member

hadley commented Oct 12, 2015

The char to raw stuff is completely expected, so I deleted that.

I'm pretty sure the problem is with xml_text(), not read_xml(). It looks like I'm not returning the correct string type in node_text() and in node_name()

@hadley hadley closed this in 67e1704 Oct 12, 2015

@hadley

This comment has been minimized.

Member

hadley commented Oct 12, 2015

I didn't actually test it, but I'm pretty sure this should fix the problem.

@krose

This comment has been minimized.

krose commented Oct 12, 2015

I just updated the package to the master version on Github, ran the script again and the issue is still there:

> library(xml2)
> 
> test_xml <- "<note><weight>vægt</weight></note>"
> 
> xml2::read_xml(test_xml)
{xml_document}
<note>
[1] <weight>vægt</weight>
> xml2::read_xml(enc2utf8(test_xml), encoding = "UTF-8")
{xml_document}
<note>
[1] <weight>vægt</weight>
> xml2::read_xml(test_xml, encoding = "UTF-8")
{xml_document}
<note>
[1] <weight>vægt</weight>
> 
> charToRaw(enc2utf8(test_xml))
 [1] 3c 6e 6f 74 65 3e 3c 77 65 69 67 68 74 3e 76 c3 a6 67 74 3c 2f 77 65 69 67 68 74 3e 3c 2f 6e 6f 74 65 3e
> charToRaw(test_xml)
 [1] 3c 6e 6f 74 65 3e 3c 77 65 69 67 68 74 3e 76 e6 67 74 3c 2f 77 65 69 67 68 74 3e 3c 2f 6e 6f 74 65 3e
> 
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C                    LC_TIME=Danish_Denmark.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xml2_0.1.2.9000

loaded via a namespace (and not attached):
[1] tools_3.2.2 Rcpp_0.12.1
> devtools::session_info()
Session info ---------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.2 (2015-08-14)
 system   x86_64, mingw32             
 ui       RStudio (0.99.441)          
 language (EN)                        
 collate  Danish_Denmark.1252         
 tz       Europe/Berlin               
 date     2015-10-12                  

Packages -------------------------------------------------------------------------------------------------------------
 package  * version    date       source                          
 devtools   1.9.0      2015-07-25 Github (hadley/devtools@2881db5)
 digest     0.6.8      2014-12-31 CRAN (R 3.2.1)                  
 memoise    0.2.1      2014-04-22 CRAN (R 3.2.1)                  
 Rcpp       0.12.1     2015-09-10 CRAN (R 3.2.2)                  
 xml2     * 0.1.2.9000 2015-10-12 Github (hadley/xml2@21fbf96)    

@jeroen jeroen reopened this Oct 12, 2015

@jeroen

This comment has been minimized.

Member

jeroen commented Oct 12, 2015

@krose I pushed a fix. Can you test if this works?
@hadley I'm not sure if this is the most elegant solution, feel free to rewrite.

@hadley

This comment has been minimized.

Member

hadley commented Oct 12, 2015

@jeroenooms that looks good to me

@krose

This comment has been minimized.

krose commented Oct 12, 2015

@jeroenooms @hadley @jennybc It works like a charm. Thanks! This also fixed the issue in googlesheets.

> library(xml2)
> 
> test_xml <- "<note><weight>vægt</weight></note>"
> 
> xml2::read_xml(test_xml)
{xml_document}
<note>
[1] <weight>vægt</weight>
> xml2::read_xml(enc2utf8(test_xml), encoding = "UTF-8")
{xml_document}
<note>
[1] <weight>vægt</weight>
> xml2::read_xml(test_xml, encoding = "UTF-8")
{xml_document}
<note>
[1] <weight>vægt</weight>

@katossky

This comment has been minimized.

katossky commented Feb 12, 2017

The problem seems to happen again :'(

library(xml2)
write('<div>Léa Ravon</div>', file='test.html')
xml_text(read_html('test.html'))

gives:

[1] "Léa Ravon"

R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

@krose

This comment has been minimized.

krose commented Feb 13, 2017

Just for info. I ran @katossky example with the CRAN and github versions of xml2 on Windows and it's not an issue here.

@hadley

This comment has been minimized.

Member

hadley commented Feb 13, 2017

@katossky You need to provide a bit more evidence that the problem is with xml2, and not with write. Regardless, please create a new issue and use the reprex package to make your reprex.

@katossky

This comment has been minimized.

katossky commented Feb 13, 2017

The problem first occurred with a external file I did not write. And you may open test.html with any text software to check its encoding. I open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment