Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding, std::string, and Rcpp::String #56

Closed
knapply opened this issue Nov 3, 2019 · 7 comments
Closed

Encoding, std::string, and Rcpp::String #56

knapply opened this issue Nov 3, 2019 · 7 comments

Comments

@knapply
Copy link
Contributor

@knapply knapply commented Nov 3, 2019

Thank you so much for working on this.

I took the package for a test drive and noticed it encounters an issue I haven't been able to solve. Perhaps you have some ideas.

My understanding is that by using std::string, strings will automatically use native encoding when they are returned to R (at least on Windows). The result is that strings can be irreparably mangled.

I haven't found a way to avoid this with std::string and have stuck to Rcpp::String myself.

I posed the question on StackOverflow with examples at the C++ level: https://stackoverflow.com/questions/58126425/rcppstring-keep-utf-8-encoding-but-stdstring-does-not

and here's an example of the behavior with {jsonify}:

example_json <- '{"name":"回收站"}'

jsonify::from_json(example_json)
#> $name
#> [1] "回收站"

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.3.6
#>  [5] yaml_2.2.0      Rcpp_1.0.2      stringi_1.4.3   rmarkdown_1.16 
#>  [9] highr_0.8       knitr_1.25      stringr_1.4.0   xfun_0.10      
#> [13] digest_0.6.22   evaluate_0.14
SymbolixAU added a commit that referenced this issue Nov 3, 2019
@SymbolixAU
Copy link
Owner

@SymbolixAU SymbolixAU commented Nov 3, 2019

Ah that's interesting! I hadn't picked up on this issue because I've been testing on MacOS and Ubuntu, where it works

Screen Shot 2019-11-04 at 8 01 41 am

Am I right in understanding in the link to the Rcpp discussion where it suggests using CharacterVector in place of StringVector?

If that's the case I've just created a branch where I've changed all instances of StringVector to CharacterVector. Can you test if this works?

devtools::install_github("SymbolixAU/jsonify", ref = "issue56")
@knapply
Copy link
Contributor Author

@knapply knapply commented Nov 3, 2019

Thanks for the effort, but unfortunately it didn't work.

I've never used StringVector before, but after taking a quick peak it's just an alias...

https://github.com/RcppCore/Rcpp/blob/f3c5a34e06e774532227470b01c63a8f08ce4313/inst/include/Rcpp/vector/instantiation.h#L34

typedef Vector<STRSXP> CharacterVector ;
typedef Vector<STRSXP> StringVector ;

As far as I can tell, std::strings need to be explicitly converted to Rcpp::Strings for it to work.

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::List test_encoding() {
  std::string std_string("回收站");
  
  Rcpp::List out(3);
  
  out[0] = std_string;
  out[1] = Rcpp::String(std_string);
  out[2] = Rcpp::String(std_string, CE_UTF8);
  
  out.attr("names") = Rcpp::CharacterVector::create(
    "auto wrapped", "no encoding", "explicit encoding"
  );
  
  return out;
}

/*** R
test_encoding()

#>$`auto wrapped`
#>[1] "回收站"
#>
#>$`no encoding`
#>[1] "回收站"
#>
#>$`explicit encoding`
#>[1] "回收站"
*/

This discussion is also relevant, RcppCore/Rcpp#263. Based on that, I'm wondering if there is indeed a macro floating around that I'm missing, so that's what I'm looking for next.

@ChrisMuir
Copy link
Collaborator

@ChrisMuir ChrisMuir commented Nov 4, 2019

Wanted to chime in and echo what knapply said. I dealt with this in past, and found this helpful (object holding is a deque of std::string). That example is related to this issue.

@dcooley
Copy link
Collaborator

@dcooley dcooley commented Nov 4, 2019

@ChrisMuir I don't suppose you or @knapply are able to make a fix for this? I don't have access to a Windows machine so won't get a chance to test & fix in the near future? Happy to help find the relevant pieces of code though.

@ChrisMuir
Copy link
Collaborator

@ChrisMuir ChrisMuir commented Nov 4, 2019

I have an old PC that I haven't touched in years. If it will boot, I can take a stab at this tonight. I'll update my progress here.

@knapply
Copy link
Contributor Author

@knapply knapply commented Nov 4, 2019

I didn't notice anything that they wouldn't play along with, so I wrapped all the string-related operations that eventually return to R in Rcpp::String().

That (seems) to be all that was needed; PR inbound: #57

example_json <- '{"name":"回收站","arabic_alphabet":"غ ظ ض ذ خ ث ت ش ر ق ص ف ع س ن م ل ك ي ط ح ز و ه د ج ب أ"}'

jsonify::from_json(example_json)
#> $name
#> [1] "回收站"
#> 
#> $arabic_alphabet
#> [1] "غ ظ ض ذ خ ث ت ش ر ق ص ف ع س ن م ل ك ي ط ح ز و ه د ج ب أ"

jsonify::from_json(example_json, simplify = FALSE)
#> $name
#> [1] "回收站"
#> 
#> $arabic_alphabet
#> [1] "غ ظ ض ذ خ ث ت ش ر ق ص ف ع س ن م ل ك ي ط ح ز و ه د ج ب أ"

temp_file <- tempfile(fileext = ".json")
readr::write_file(example_json, temp_file)
jsonify::from_json(temp_file)
#> $name
#> [1] "回收站"
#> 
#> $arabic_alphabet
#> [1] "غ ظ ض ذ خ ث ت ش ر ق ص ف ع س ن م ل ك ي ط ح ز و ه د ج ب أ"

sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#> [1] jsonify_1.0.0          readr_1.3.1            compiler_3.6.1         backports_1.1.5        R6_2.4.0              
#> [6] hms_0.5.1              tools_3.6.1            pillar_1.4.2.9001      rstudioapi_0.10.0-9000 tibble_2.1.3          
#> [11] crayon_1.3.4           Rcpp_1.0.2             vctrs_0.2.0            zeallot_0.1.0          packrat_0.5.0         
#> [16] pkgconfig_2.0.3        rlang_0.4.1  
@SymbolixAU SymbolixAU closed this Nov 5, 2019
@SymbolixAU
Copy link
Owner

@SymbolixAU SymbolixAU commented Nov 5, 2019

closed in #57

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.