Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding, std::string, and Rcpp::String #56

Closed
knapply opened this issue Nov 3, 2019 · 7 comments
Closed

Encoding, std::string, and Rcpp::String #56

knapply opened this issue Nov 3, 2019 · 7 comments

Comments

@knapply
Copy link
Contributor

knapply commented Nov 3, 2019

Thank you so much for working on this.

I took the package for a test drive and noticed it encounters an issue I haven't been able to solve. Perhaps you have some ideas.

My understanding is that by using std::string, strings will automatically use native encoding when they are returned to R (at least on Windows). The result is that strings can be irreparably mangled.

I haven't found a way to avoid this with std::string and have stuck to Rcpp::String myself.

I posed the question on StackOverflow with examples at the C++ level: https://stackoverflow.com/questions/58126425/rcppstring-keep-utf-8-encoding-but-stdstring-does-not

and here's an example of the behavior with {jsonify}:

example_json <- '{"name":"回收站"}'

jsonify::from_json(example_json)
#> $name
#> [1] "回收站"

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.3.6
#>  [5] yaml_2.2.0      Rcpp_1.0.2      stringi_1.4.3   rmarkdown_1.16 
#>  [9] highr_0.8       knitr_1.25      stringr_1.4.0   xfun_0.10      
#> [13] digest_0.6.22   evaluate_0.14
SymbolixAU pushed a commit that referenced this issue Nov 3, 2019
@SymbolixAU
Copy link
Collaborator

Ah that's interesting! I hadn't picked up on this issue because I've been testing on MacOS and Ubuntu, where it works

Screen Shot 2019-11-04 at 8 01 41 am

Am I right in understanding in the link to the Rcpp discussion where it suggests using CharacterVector in place of StringVector?

If that's the case I've just created a branch where I've changed all instances of StringVector to CharacterVector. Can you test if this works?

devtools::install_github("SymbolixAU/jsonify", ref = "issue56")

@knapply
Copy link
Contributor Author

knapply commented Nov 3, 2019

Thanks for the effort, but unfortunately it didn't work.

I've never used StringVector before, but after taking a quick peak it's just an alias...

https://github.com/RcppCore/Rcpp/blob/f3c5a34e06e774532227470b01c63a8f08ce4313/inst/include/Rcpp/vector/instantiation.h#L34

typedef Vector<STRSXP> CharacterVector ;
typedef Vector<STRSXP> StringVector ;

As far as I can tell, std::strings need to be explicitly converted to Rcpp::Strings for it to work.

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::List test_encoding() {
  std::string std_string("回收站");
  
  Rcpp::List out(3);
  
  out[0] = std_string;
  out[1] = Rcpp::String(std_string);
  out[2] = Rcpp::String(std_string, CE_UTF8);
  
  out.attr("names") = Rcpp::CharacterVector::create(
    "auto wrapped", "no encoding", "explicit encoding"
  );
  
  return out;
}

/*** R
test_encoding()

#>$`auto wrapped`
#>[1] "回收站"
#>
#>$`no encoding`
#>[1] "回收站"
#>
#>$`explicit encoding`
#>[1] "回收站"
*/

This discussion is also relevant, RcppCore/Rcpp#263. Based on that, I'm wondering if there is indeed a macro floating around that I'm missing, so that's what I'm looking for next.

@ChrisMuir
Copy link
Contributor

Wanted to chime in and echo what knapply said. I dealt with this in past, and found this helpful (object holding is a deque of std::string). That example is related to this issue.

@dcooley
Copy link
Collaborator

dcooley commented Nov 4, 2019

@ChrisMuir I don't suppose you or @knapply are able to make a fix for this? I don't have access to a Windows machine so won't get a chance to test & fix in the near future? Happy to help find the relevant pieces of code though.

@ChrisMuir
Copy link
Contributor

I have an old PC that I haven't touched in years. If it will boot, I can take a stab at this tonight. I'll update my progress here.

@knapply
Copy link
Contributor Author

knapply commented Nov 4, 2019

I didn't notice anything that they wouldn't play along with, so I wrapped all the string-related operations that eventually return to R in Rcpp::String().

That (seems) to be all that was needed; PR inbound: #57

example_json <- '{"name":"回收站","arabic_alphabet":"غ ظ ض ذ خ ث ت ش ر ق ص ف ع س ن م ل ك ي ط ح ز و ه د ج ب أ"}'

jsonify::from_json(example_json)
#> $name
#> [1] "回收站"
#> 
#> $arabic_alphabet
#> [1] "غ ظ ض ذ خ ث ت ش ر ق ص ف ع س ن م ل ك ي ط ح ز و ه د ج ب أ"

jsonify::from_json(example_json, simplify = FALSE)
#> $name
#> [1] "回收站"
#> 
#> $arabic_alphabet
#> [1] "غ ظ ض ذ خ ث ت ش ر ق ص ف ع س ن م ل ك ي ط ح ز و ه د ج ب أ"

temp_file <- tempfile(fileext = ".json")
readr::write_file(example_json, temp_file)
jsonify::from_json(temp_file)
#> $name
#> [1] "回收站"
#> 
#> $arabic_alphabet
#> [1] "غ ظ ض ذ خ ث ت ش ر ق ص ف ع س ن م ل ك ي ط ح ز و ه د ج ب أ"

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#> [1] jsonify_1.0.0          readr_1.3.1            compiler_3.6.1         backports_1.1.5        R6_2.4.0              
#> [6] hms_0.5.1              tools_3.6.1            pillar_1.4.2.9001      rstudioapi_0.10.0-9000 tibble_2.1.3          
#> [11] crayon_1.3.4           Rcpp_1.0.2             vctrs_0.2.0            zeallot_0.1.0          packrat_0.5.0         
#> [16] pkgconfig_2.0.3        rlang_0.4.1  

@SymbolixAU
Copy link
Collaborator

closed in #57

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants