Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problem on Windows with non-US English locale #54

Closed
jennybc opened this issue Oct 12, 2015 · 13 comments
Closed

Encoding problem on Windows with non-US English locale #54

jennybc opened this issue Oct 12, 2015 · 13 comments

Comments

@jennybc
Copy link
Member

jennybc commented Oct 12, 2015

I have various reports of problems with listing Google Sheets or specifying worksheet names or (possibly) reading Sheet data from [1] Windows and [2] non-US English locales. Examples: Spanish_Spain.1252, Danish_Denmark.1252 and ... someone in Colombia who didn't provide session info.

I think all the problematic text has been successfully processed with httr::content(as = "text", encoding = "UTF-8") but then a problem is introduced in xml2::read_xml().

The most relevant issue is jennybc/googlesheets#151. I apologize for some noise there -- a recent example Sheet posted seems to have a different problem. If you look there, focus on the comments from @krose. He has done a lot of digging and posed a related question on stack overflow. From his work it seems the problem might come from

read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html)

@krose
Copy link

krose commented Oct 12, 2015

I think it's the doc_parse_raw function that use CP1252 to parse UTF-8.

See this comment that describes the issue, but with base functions.

@hadley
Copy link
Member

hadley commented Oct 12, 2015

Could you provide a simple reproducible example for me? (i.e. an xml file that fails to parse uploaded to dropbox or similar)

@krose
Copy link

krose commented Oct 12, 2015

Sure. This small scrip should do it. I've set the default text encoding to UTF-8 in RStudio::tools::gobal options::default text encoding.

library(xml2)

test_xml <- "<note><weight>vægt</weight></note>"

xml2::read_xml(test_xml)
xml2::read_xml(enc2utf8(test_xml), encoding = "UTF-8")
xml2::read_xml(test_xml, encoding = "UTF-8")

Results in:

> library(xml2)
> 
> test_xml <- "<note><weight>vægt</weight></note>"
> 
> xml2::read_xml(test_xml)
{xml_document}
<note>
[1] <weight>vægt</weight>
> xml2::read_xml(enc2utf8(test_xml), encoding = "UTF-8")
{xml_document}
<note>
[1] <weight>vægt</weight>
> xml2::read_xml(test_xml, encoding = "UTF-8")
{xml_document}
<note>
[1] <weight>vægt</weight>

@hadley
Copy link
Member

hadley commented Oct 12, 2015

The char to raw stuff is completely expected, so I deleted that.

I'm pretty sure the problem is with xml_text(), not read_xml(). It looks like I'm not returning the correct string type in node_text() and in node_name()

@hadley hadley closed this as completed in 67e1704 Oct 12, 2015
@hadley
Copy link
Member

hadley commented Oct 12, 2015

I didn't actually test it, but I'm pretty sure this should fix the problem.

@krose
Copy link

krose commented Oct 12, 2015

I just updated the package to the master version on Github, ran the script again and the issue is still there:

> library(xml2)
> 
> test_xml <- "<note><weight>vægt</weight></note>"
> 
> xml2::read_xml(test_xml)
{xml_document}
<note>
[1] <weight>vægt</weight>
> xml2::read_xml(enc2utf8(test_xml), encoding = "UTF-8")
{xml_document}
<note>
[1] <weight>vægt</weight>
> xml2::read_xml(test_xml, encoding = "UTF-8")
{xml_document}
<note>
[1] <weight>vægt</weight>
> 
> charToRaw(enc2utf8(test_xml))
 [1] 3c 6e 6f 74 65 3e 3c 77 65 69 67 68 74 3e 76 c3 a6 67 74 3c 2f 77 65 69 67 68 74 3e 3c 2f 6e 6f 74 65 3e
> charToRaw(test_xml)
 [1] 3c 6e 6f 74 65 3e 3c 77 65 69 67 68 74 3e 76 e6 67 74 3c 2f 77 65 69 67 68 74 3e 3c 2f 6e 6f 74 65 3e
> 
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C                    LC_TIME=Danish_Denmark.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xml2_0.1.2.9000

loaded via a namespace (and not attached):
[1] tools_3.2.2 Rcpp_0.12.1
> devtools::session_info()
Session info ---------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.2 (2015-08-14)
 system   x86_64, mingw32             
 ui       RStudio (0.99.441)          
 language (EN)                        
 collate  Danish_Denmark.1252         
 tz       Europe/Berlin               
 date     2015-10-12                  

Packages -------------------------------------------------------------------------------------------------------------
 package  * version    date       source                          
 devtools   1.9.0      2015-07-25 Github (hadley/devtools@2881db5)
 digest     0.6.8      2014-12-31 CRAN (R 3.2.1)                  
 memoise    0.2.1      2014-04-22 CRAN (R 3.2.1)                  
 Rcpp       0.12.1     2015-09-10 CRAN (R 3.2.2)                  
 xml2     * 0.1.2.9000 2015-10-12 Github (hadley/xml2@21fbf96)    

@jeroen jeroen reopened this Oct 12, 2015
@jeroen
Copy link
Member

jeroen commented Oct 12, 2015

@krose I pushed a fix. Can you test if this works?
@hadley I'm not sure if this is the most elegant solution, feel free to rewrite.

@hadley
Copy link
Member

hadley commented Oct 12, 2015

@jeroenooms that looks good to me

@krose
Copy link

krose commented Oct 12, 2015

@jeroenooms @hadley @jennybc It works like a charm. Thanks! This also fixed the issue in googlesheets.

> library(xml2)
> 
> test_xml <- "<note><weight>vægt</weight></note>"
> 
> xml2::read_xml(test_xml)
{xml_document}
<note>
[1] <weight>vægt</weight>
> xml2::read_xml(enc2utf8(test_xml), encoding = "UTF-8")
{xml_document}
<note>
[1] <weight>vægt</weight>
> xml2::read_xml(test_xml, encoding = "UTF-8")
{xml_document}
<note>
[1] <weight>vægt</weight>

@katossky
Copy link

The problem seems to happen again :'(

library(xml2)
write('<div>Léa Ravon</div>', file='test.html')
xml_text(read_html('test.html'))

gives:

[1] "Léa Ravon"

R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

@krose
Copy link

krose commented Feb 13, 2017

Just for info. I ran @katossky example with the CRAN and github versions of xml2 on Windows and it's not an issue here.

@hadley
Copy link
Member

hadley commented Feb 13, 2017

@katossky You need to provide a bit more evidence that the problem is with xml2, and not with write. Regardless, please create a new issue and use the reprex package to make your reprex.

@katossky
Copy link

The problem first occurred with a external file I did not write. And you may open test.html with any text software to check its encoding. I open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants