Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 string breaks R-session in Windows 10 #90

Closed
amatsuo opened this issue Aug 22, 2017 · 7 comments
Closed

UTF-8 string breaks R-session in Windows 10 #90

amatsuo opened this issue Aug 22, 2017 · 7 comments

Comments

@amatsuo
Copy link

amatsuo commented Aug 22, 2017

Exploring the cause of this issue quanteda/spacyr#69
I found that an R session is terminated when a utf-8 string is handed to r through r_to_py in a windows system (Windows 10).

This is a minimal example which reproduces the issue

library(reticulate)
# this works fine
(a <- r_to_py("my name is george orwell"))
# but this breaks r-session
(b <- r_to_py("my name is 伊能忠敬"))

Here is the session info

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reticulate_1.0

loaded via a namespace (and not attached):
[1] compiler_3.4.1 tools_3.4.1    Rcpp_0.12.12   jsonlite_1.5  

The version of python:

Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32

(downloaded from python.org)

Is there any workaround?

@jjallaire
Copy link
Member

This might require some changes to reticulate unless you can somehow pass a different permutation of the character vector that doesn't trigger the issue. We will investigate and fix this after our next release hits CRAN (should be the next few days).

@amatsuo
Copy link
Author

amatsuo commented Aug 24, 2017

Thanks for a response. Actually, many of non-ascii characters would crash an R-session. So it'd be great if you'd fix this issue.

@antuki
Copy link

antuki commented Aug 24, 2017

Hello,

I tried yesterday a package using reticulate to read .msg emails thanks to a module written in Python. And for some e-mails that I've tested (and they all seem to have special characters inside like russian letters) the R session is also aborted (cf. package and issue here hrbrmstr/msgxtractr#1). Maybe my problem is linked with this issue (I'm also working on windows 10 but using the "import" function of reticulate) so i'm also interested in fixing it :)

@jjallaire
Copy link
Member

jjallaire commented Aug 25, 2017 via email

@antuki
Copy link

antuki commented Aug 25, 2017

OK, I've created 2 emails (.msg) for an example : one contains special characters (the same symbols as amatsuo and a smiley) and the other doesn't.
You can download it here : https://github.com/antuki/encoding_issues

Now the R code :

library(reticulate)
xm <- import("ExtractMsg")
msg <- xm$Message(path.expand("msg_without_special_characters.msg"))
msg$body ####NO PROBLEM FOR THIS ONE
[1] "Without special characters\r\n \r\nAntuki\r\n \r\n"
msg <- xm$Message(path.expand("msg_with_special_characters.msg"))
msg$body #### R SESSION ABORTED

If you need the session info :

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] reticulate_1.0

loaded via a namespace (and not attached):
[1] compiler_3.4.1 parallel_3.4.1 tools_3.4.1 NLP_0.1-11 Rcpp_0.12.12
[6] slam_0.1-40 jsonlite_1.5 tm_0.7-1

@jjallaire
Copy link
Member

Thank you!

What's particularly interesting about this example is that the string conversion piece appears to bypasses reticulate entirely. It's almost as if the issue is that the Python runtime embedded by reticulate doesn't know enough about the locale and that might be causing the crash.

Will update this when we know more.

@jjallaire
Copy link
Member

Fixed here: 8917eeb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants