Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encode characters as UTF-8 when locale is not set #730

Merged
merged 4 commits into from Dec 7, 2017

Conversation

@yutannihilation
Copy link
Member

@yutannihilation yutannihilation commented Oct 29, 2017

As reported on #729, read_csv() etc fails to handle the encoding on non-UTF-8 MBCS locale when file is a text:

df <- read_csv("中文,英文\n英文,德文")

In this case, "中文,英文\n英文,德文" is encoded as the default encoding but the default_locale() assumes that file is encoded as UTF-8. So, it would be better to encode the text as UTF-8 before passing to other functions.

This PR fixes the problem, but I'm not 100% sure this PR is OK since this may break some existing use cases where file is a UTF-8 text marked improper encoding (locale(encoding = "UTF-8") is identical to default_encoding()):

read_csv(utf8_but_improperly_encoded_text, locale = locale(encoding = "UTF-8"))

Users will have to encode the text properly before passing it to read_csv() etc.

Encoding(properly_encoded_text) <- "UTF-8"
read_csv(properly_encoded_text)

What do you think?

R/read_delim.R Outdated
@@ -177,7 +177,13 @@ read_delimited <- function(file, tokenizer, col_names = TRUE, col_types = NULL,
if (empty_file(file)) {
return(tibble::data_frame())
}
data <- file
if (is.character(file) && identical(locale, default_locale())) {
Copy link
Member

@jimhester jimhester Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a rare case where missing(locale) might be the right thing to do.

Loading

Copy link
Member Author

@yutannihilation yutannihilation Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't know missing() propagates to the function called inside... Thanks for the suggestion!

Loading

Copy link
Member Author

@yutannihilation yutannihilation Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no. In this case, locale is never "missing" because the default value is specified for locale in the function that call read_delimited() (e.g. read_csv(), read_delim()...).

reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-11-03

# this is ok
f1 <- function(x) missing(x)
f2 <- function(x) f1(x = x)
f2()
#> [1] TRUE

# missing() returns FALSE if x has the default value in the caller function
f1 <- function(x) missing(x)
f2 <- function(x = 1L) f1(x = x)
f2()
#> [1] FALSE

Loading

@yutannihilation
Copy link
Member Author

@yutannihilation yutannihilation commented Nov 6, 2017

Although this PR may breaks some use cases, now I'm sure this is justifiable since

  1. copying and pasting is the most primitive form of reproducibility, so df <- read_csv("中文,英文\n英文,德文") should return the same result regardless of which platform we execute the code on.
  2. readr expects UTF-8 inputs, so it's the user's responsibility to either
    1. encode the input into UTF-8 before passing it to read_*(). This PR is OK for this.
    2. specify the proper encoding for the input. This PR is almost OK for this; it fails if the input is actually UTF-8 but is marked some improper encoding. I bet this use case is rare and the problem can be avoidable by Encoding<-() or iconv().

Loading

@jimhester
Copy link
Member

@jimhester jimhester commented Dec 6, 2017

Ok I think this is fine, can you add a note to NEWS.md with the issue # and mentioning your GitHub username and I will merge it.

Loading

@yutannihilation
Copy link
Member Author

@yutannihilation yutannihilation commented Dec 7, 2017

Thanks! I've added it.

Loading

@jimhester jimhester merged commit fdaed60 into tidyverse:master Dec 7, 2017
1 of 2 checks passed
Loading
@jimhester
Copy link
Member

@jimhester jimhester commented Dec 7, 2017

Thanks!

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants