Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filepaths with umlauts not handled correctly #10507

Closed
4 tasks done
dpprdan opened this issue Feb 4, 2022 · 10 comments
Closed
4 tasks done

filepaths with umlauts not handled correctly #10507

dpprdan opened this issue Feb 4, 2022 · 10 comments
Assignees
Labels
backport Issues whose associated fixes will need to be backported for a previous release. bug locale internationalization, localization, keyboard, region, and language windows

Comments

@dpprdan
Copy link

dpprdan commented Feb 4, 2022

System details

RStudio Edition : Desktop [Open Source]
RStudio Version : 2022.2.0.421
OS Version      : Windows 10 x64 (build 19041)
R Version       : R version 4.1.2 (2021-11-01)

Steps to reproduce the problem

RStudio 2021.09.2+382 and 2022.02.0+421 cause problems with filenames with umlauts, so that tidyverse/non-base packages do not handle them correctly:

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> owd <- setwd(tempdir())
> dir.create("tmp")
> # write test file
> write.csv(mtcars, "tmp/mtcärs.csv")
> # can be read fine with read.csv()
> head(read.csv("tmp/mtcärs.csv"))
                  X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> 
> # readr and data.table cannot process the filename with umlaut correctly
> readr::read_csv("tmp/mtcärs.csv")
Error in file(con, "rb") : cannot open the connection
In addition: Warning message:
In file(con, "rb") :
  cannot open file 'C:/Users/WDAGUtilityAccount/AppData/Local/Temp/RtmpA9L12X/tmp/mtc�rs.csv': No such file or directory
> data.table::fread("tmp/mtcärs.csv")
Error in data.table::fread("tmp/mtcärs.csv") : 
  File not found: tmp/mtcärs.csv
> 
> # filenames with umlauts written by readr or writexl are garbled
> readr::write_csv(mtcars, "tmp/mtcörs.csv")
> writexl::write_xlsx(mtcars, "tmp/mtcörs.xlsx")                                           
> dir("tmp")
[1] "mtc�rs.xlsx"  "mtcärs.csv"   "mtc�rs.csv"
> 
> # readxl also cannot handle filenames with umlauts
> file.rename("tmp/mtc�rs.xlsx", "tmp/mtcärs.xlsx")
[1] TRUE
> dir("tmp")
[1] "mtcärs.csv"   "mtcärs.xlsx"  "mtc�rs.csv"
> readxl::read_xlsx("tmp/mtcärs.xlsx")
Error: Evaluation error: zip file 'C:\Users\WDAGUtilityAccount\AppData\Local\Temp\RtmpA9L12X\tmp\mtc�rs.xlsx' cannot be opened.
In addition: Warning message:
In normalizePath(path.expand(path), winslash, mustWork) :
  path[1]="tmp/mtcärs.xlsx": The system cannot find the file specified
Sessioninfo
> sessioninfo::session_info()
- Session info ----------------------------------------------------------------------------
 setting  value
 version  R version 4.1.2 (2021-11-01)
 os       Windows 10 x64 (build 19041)
 system   x86_64, mingw32
 ui       RStudio
 language (EN)
 collate  English_United States.1252
 ctype    English_United States.1252
 tz       Europe/Berlin
 date     2022-02-04
 rstudio  2022.02.0+421 Prairie Trillium (desktop)
 pandoc   NA

- Packages --------------------------------------------------------------------------------
 package     * version date (UTC) lib source
 bit           4.0.4   2020-08-04 [1] CRAN (R 4.1.2)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.1.2)
 cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.1.2)
 cli           3.1.1   2022-01-20 [1] CRAN (R 4.1.2)
 crayon        1.4.2   2021-10-29 [1] CRAN (R 4.1.2)
 data.table    1.14.2  2021-09-27 [1] CRAN (R 4.1.2)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
 fansi         1.0.2   2022-01-14 [1] CRAN (R 4.1.2)
 glue          1.6.1   2022-01-22 [1] CRAN (R 4.1.2)
 hms           1.1.1   2021-09-26 [1] CRAN (R 4.1.2)
 lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
 magrittr      2.0.2   2022-01-26 [1] CRAN (R 4.1.2)
 pillar        1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.2)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.2)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.2)
 Rcpp          1.0.8   2022-01-13 [1] CRAN (R 4.1.2)
 readr         2.1.2   2022-01-30 [1] CRAN (R 4.1.2)
 readxl        1.3.1   2019-03-13 [1] CRAN (R 4.1.2)
 rlang         1.0.0   2022-01-26 [1] CRAN (R 4.1.2)
 rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.2)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
 tibble        3.1.6   2021-11-07 [1] CRAN (R 4.1.2)
 tidyselect    1.1.1   2021-04-30 [1] CRAN (R 4.1.2)
 tzdb          0.2.0   2021-10-27 [1] CRAN (R 4.1.2)
 utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
 vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.2)
 vroom         1.5.7   2021-11-30 [1] CRAN (R 4.1.2)
 withr         2.4.3   2021-11-30 [1] CRAN (R 4.1.2)
 writexl       1.4.0   2021-04-20 [1] CRAN (R 4.1.2)

 [1] C:/Program Files/R/R-4.1.2/library

-------------------------------------------------------------------------------------------
Code to reproduce
Sys.getlocale()
owd <- setwd(tempdir())
dir.create("tmp")
# write test file
write.csv(mtcars, "tmp/mtcärs.csv")
# can be read fine with read.csv()
head(read.csv("tmp/mtcärs.csv"))

# readr and data.table cannot process the filename with umlaut correctly
readr::read_csv("tmp/mtcärs.csv")
data.table::fread("tmp/mtcärs.csv")

# filenames with umlauts written by readr or writexl are garbled
readr::write_csv(mtcars, "tmp/mtcörs.csv")
writexl::write_xlsx(mtcars, "tmp/mtcörs.xlsx")
dir("tmp")

# readxl also cannot handle filenames with umlauts
file.rename("tmp/mtc�rs.xlsx", "tmp/mtcärs.xlsx")
dir("tmp")
readxl::read_xlsx("tmp/mtcärs.xlsx")

Describe the problem in detail

RStudio Desktop (I tested versions 2021.09.2+382 and 2022.02.0+421) apparently handles filenames/-paths with umlauts/non-ASCII characters differently than previous versions. As a result, some non-base packages cannot handle the paths correctly. The problem does not occur in the R console, when run with {reprex} from RStudio, or with RStudio 2021.09.0+351, which is why I think that this is a regression in RStudio. I have used multiple packages to show that this is not isolated to one of them.

The issue also does not occur with R devel (2022-02-03 r81650 ucrt) - but that is not released yet and I assume that RStudio is supposed to work with R < v4.2 for a little longer?

Describe the behavior you expected

RStudio and non-base packages work well together, so that file-paths with non-ASCII characters do not cause errors.

  • I have read the guide for submitting good bug reports.
  • I have installed the latest version of RStudio, and confirmed that the issue still persists.
  • If I am reporting an RStudio crash, I have included a diagnostics report.
  • I have done my best to include a minimal, self-contained set of instructions for consistently reproducing the issue.
@kevinushey kevinushey added the bug label Feb 4, 2022
@kevinushey kevinushey self-assigned this Feb 4, 2022
@kevinushey
Copy link
Contributor

Thanks for the bug report! For what it's worth, forcing the strings to be re-encoded as UTF-8 seems to dodge the issue:

path <- file.path(tempdir(), "mtcärs.csv")
write.csv(mtcars, path)
read.csv(path)

library(readr)
read_csv(path)            # fails 
read_csv(enc2utf8(path))  # succeeds

I'll see if I can learn more.

@oloverm
Copy link

oloverm commented Feb 7, 2022

A bunch of people with similar issues in this thread: https://community.rstudio.com/t/rstudio-cant-deal-with-file-names-with-unicode-characters/126601/20

@kevinushey
Copy link
Contributor

A similar issue, with saveRDS():

str(l10n_info())
setwd(tempdir())
name <- "æ.rds"
Encoding(name)
saveRDS(list(), name)
list.files()

When run with R-devel + UCRT everything is fine. Unfortunately, with older versions of R, the file name is now mis-encoded.

> str(l10n_info())
List of 4
 $ MBCS    : logi FALSE
 $ UTF-8   : logi FALSE
 $ Latin-1 : logi TRUE
 $ codepage: int 1252
> setwd(tempdir())
> name <- "æ.rds"
> Encoding(name)
[1] "latin1"
> saveRDS(list(), name)
> list.files()
[1] "�.rds"                                         
[2] "rs-graphics-dbda5c83-bc5a-4fad-8671-154f2ab449b1"

My suspicion is that this is because RStudio is now compiled to use the Windows UTF-8 code page, but R is still trying to use the "default" system code page. This leads to æ being misinterpreted or misencoded when trying to write the file.

As far as I can see, the only way to set the application encoding is in the application's manifest file; that is, it must be set at build time, and cannot be changed at run time.

https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page

If this is the case, then I think we need to back out the UTF-8 change for Windows, and later consider distributing two separate builds of RStudio; one for R (>= 4.2.0) and one for R (< 4.2.0).

@ronblum ronblum added this to the Prairie Trillium (2022.02.0) milestone Feb 9, 2022
@ronblum ronblum self-assigned this Feb 9, 2022
@ronblum ronblum added test locale internationalization, localization, keyboard, region, and language windows labels Feb 9, 2022
@ronblum
Copy link
Contributor

ronblum commented Feb 9, 2022

Using RStudio Desktop 2022.02.0-431 on Windows 10, I'm still seeing

> Encoding(name)
[1] "latin1"

Is this still supposed to be this way, or should it be UTF-8?

@kevinushey
Copy link
Contributor

That's fine; it should still be latin1, for versions of R (< 4.2.0).

The most important thing for testing is that list.files() prints the file name as it was set + created; e.g.

> list.files()
[1] "æ.rds"

@ronblum
Copy link
Contributor

ronblum commented Feb 10, 2022

In that case, verified fixed. Thanks!

@ronblum ronblum closed this as completed Feb 10, 2022
@ronblum ronblum removed the test label Feb 10, 2022
@dpprdan
Copy link
Author

dpprdan commented Feb 10, 2022

Will a patched Ghost Orchid version be released as well?
Or rather why don't the Windows dailies with (I assume) the patch not build, while they do for the other platforms? Last Windows version is from a month ago, all others were apparently build 3 days ago as per https://dailies.rstudio.com/rstudio/ghost-orchid/

@jmcphers
Copy link
Member

Just fixed the dailies page! You can get the Windows build here now: https://dailies.rstudio.com/rstudio/ghost-orchid/desktop/windows/2021-09-3-384/

@ronblum ronblum added the backport Issues whose associated fixes will need to be backported for a previous release. label Feb 10, 2022
@ronblum ronblum reopened this Apr 26, 2022
@ronblum ronblum added the test label Apr 26, 2022
@ronblum
Copy link
Contributor

ronblum commented Apr 26, 2022

Reopening and putting in "test" for Ghost Orchid backport.

@ronblum
Copy link
Contributor

ronblum commented Apr 27, 2022

Verified in RStudio Desktop 2021.09.3+396 (Ghost Orchid) with R 3.6.3 and 4.2.0 on Windows 11 using the three repros above: the original post and the two examples by @kevinushey .

@ronblum ronblum closed this as completed Apr 27, 2022
@ronblum ronblum removed the test label Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Issues whose associated fixes will need to be backported for a previous release. bug locale internationalization, localization, keyboard, region, and language windows
Projects
None yet
Development

No branches or pull requests

6 participants