New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Encoding info losses for non-ASCII column names #335

Open
shrektan opened this Issue Apr 9, 2018 · 0 comments

Comments

Projects
None yet
1 participant
@shrektan

shrektan commented Apr 9, 2018

If the column names contain non-ASCII strings, the Encoding info will be lost when reading from the local feather file. The following example is run on my Mac.

It will be even worse if it's run on a Windows machine because it seems like feather will try to convert the column names to native encoding from unknown encoding, leading to garbage column names that can never be converted back.

Minimal Reproducible Example

utf8_strings <- c("çile", "façile", "El. paÅ¡tas", "¡tas", "Þ")
latin1_strings <- iconv(utf8_strings, from = "UTF-8", to = "latin1")
tbl <- data.frame(utf8_strings, latin1_strings, stringsAsFactors = FALSE)
colnames(tbl) <- c(utf8_strings[2], latin1_strings[2])
tbl2 <- local({
  tmp_file <- tempfile(fileext = ".feather")
  on.exit(unlink(tmp_file), add = TRUE)
  feather::write_feather(tbl, tmp_file)
  feather::read_feather(tmp_file)
})
colnames(tbl)
#> [1] "façile" "façile"
colnames(tbl2)
#> [1] "façile"    "fa\xe7ile" ############SEE HERE############
Encoding(colnames(tbl))
#> [1] "UTF-8"  "latin1"
Encoding(colnames(tbl2))
#> [1] "unknown" "unknown"
Encoding(colnames(tbl2)) <- c("UTF-8", "latin1")
colnames(tbl2)
#> [1] "façile" "façile"

sessionInfo()

#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.4
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.16    digest_0.6.15   rprojroot_1.3-2 backports_1.1.2
#>  [5] formatR_1.5     magrittr_1.5    evaluate_0.10.1 pillar_1.2.1   
#>  [9] rlang_0.2.0     stringi_1.1.7   rmarkdown_1.9   tools_3.4.3    
#> [13] stringr_1.3.0   feather_0.3.1   hms_0.4.2       yaml_2.1.18    
#> [17] compiler_3.4.3  pkgconfig_2.0.1 htmltools_0.3.6 knitr_1.20     
#> [21] tibble_1.4.2

On Windows the output will become

> colnames(tbl)
[1] "façile"    "fa<e7>ile"
> colnames(tbl2)
[1] "fa<U+00E7>ile" "fa<e7>ile"    
> Encoding(colnames(tbl))
[1] "UTF-8"  "latin1"
> Encoding(colnames(tbl2))
[1] "unknown" "unknown"
> Encoding(colnames(tbl2)) <- c("UTF-8", "latin1")
> colnames(tbl2)
[1] "fa<U+00E7>ile" "fa<e7>ile ######NOTICE THE FIRST ONE###########
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment