Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

select() mistakenly converts colnames to UTF-8 #2284

Closed
yutannihilation opened this issue Dec 2, 2016 · 8 comments
Closed

select() mistakenly converts colnames to UTF-8 #2284

yutannihilation opened this issue Dec 2, 2016 · 8 comments

Comments

@yutannihilation
Copy link
Member

@yutannihilation yutannihilation commented Dec 2, 2016

I've faced this behavior with R 3.3.2 and dplyr 0.5.0 on my Windows machine.

d <- data.frame("φ" = 1:10)

Encoding(colnames(d))
#> [1] "unknown"

d2 <- d %>% select(φ)
Encoding(colnames(d2))
#> [1] "UTF-8"

This is due to this tricky behavior of c() function in the base R; The names of named vectors are converted to UTF-8 accidentally.

x <- "φ"
names(x) <- "φ"

Encoding(names(x))
#> [1] "unknown"

Encoding(names(c(x)))
#> [1] "UTF-8"

This problem itself is not so problematic and not dplyr's fault, but until #1950 is solved we will continue to see the error with functions like group_by() and distict(). (e.g. #2277, #2005)

d %>%
  select(φ) %>%
  group_by(φ)
#> Error in grouped_df_impl(data, unname(vars), drop) : unknown column 'φ' 

So, could you consider to stop using c() here?: https://github.com/hadley/dplyr/blob/1ba25c03826372f41e7d7b850a174d881f0ac15b/R/select-vars.R#L86

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 9, 2016

Thanks. Can you please confirm that enc2native("φ") != "φ" in your installation? What happens if you assign names(d) <- enc2native(names(d))? This could fix your issues in the short term, and is important for us to understand if our notion of fixing #1950 is correct.

Loading

@yutannihilation
Copy link
Member Author

@yutannihilation yutannihilation commented Dec 9, 2016

Sure! Here is the result.

enc2native("φ") != "φ"
#> [1] FALSE

I got the same error.

d <- data.frame("φ" = 1:10)
names(d) <- enc2native(names(d))

Encoding(colnames(d))
#> [1] "unknown"

d2 <- d %>% select(φ)
Encoding(colnames(d2))
#> [1] "UTF-8"

d %>%
  select(φ) %>%
  group_by(φ)
#> Error in grouped_df_impl(data, unname(vars), drop) : unknown column 'φ' 

Loading

@yutannihilation
Copy link
Member Author

@yutannihilation yutannihilation commented Dec 10, 2016

I guess you already know this, but please share my subtle knowledge about locale. (I learned this in a question on SO)

A character can be either "unknown"(native) or "UTF-8", depending on the user's locale. Here is a example about a Chinese/Japanse character.

My default locale is this:

strsplit(Sys.getlocale(), ";")
#> [[1]]
#> [1] "LC_COLLATE=Japanese_Japan.932"  "LC_CTYPE=Japanese_Japan.932"   
#> [3] "LC_MONETARY=Japanese_Japan.932" "LC_NUMERIC=C"                  
#> [5] "LC_TIME=Japanese_Japan.932"  

Since I'm in a locale which contains the corresponding character code for , I get "unknown".

# a Chinese/Japanese character for "cat"
x <- ""
Encoding(x)
#> [1] "unknown"

But, in the one where "猫" is not native, I get "UTF-8"

# code page 1253 is for Greek (c.f. https://en.wikipedia.org/wiki/Windows-1253)
Sys.setlocale(locale = "English_US.1253")

x <- ""
Encoding(x)
#> [1] "UTF-8"

But, as "φ" cannot be natively represented, I get errors when trying to evaluate it (as you commented on #1950 (comment)).

d %>% select(猫)
#> Error: unexpected input in "d %>% select(\"

By the way, it seems very hard to create a data.frame with in its colname in that locale... Hope I can help you test your idea 👍

# data.frame() sucks as usual...
data.frame("" = 1)
#>   X.U.732B.
#> 1         1

d <- tibble::data_frame("" = 1)
d
#> # A tibble: 1 × 1
#>   `<U+732B>`
#>        <dbl>
#> 1          1

colnames(d)
#> [1] "<U+732B>"

# This is possible
d <- structure(
    list("\u732B" = 1),
    .Names = "\u732B",
    row.names = c(NA,-1L),
    class = "data.frame"
  )
colnames(d)
#> [1] "猫"

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 10, 2016

@yutannihilation: I certainly need to get myself a Japanese/Chinese/Korean/... Windows for testing these details. I'll try to set up one on Azure; my only concern is that I might be unable to understand the system messages in such a Windows installation. We'll see.

Loading

@yutannihilation
Copy link
Member Author

@yutannihilation yutannihilation commented Dec 10, 2016

my only concern is that I might be unable to understand the system messages

I think you can choose whether to localize system messages or not. Actually, I get English messages in Japanese locale. (But, I forgot where to change this setting...)

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 10, 2016

Thanks for your input. I was able to install a Japanese Windows with English messages (Sys.getlocale() shows the Japanese locale in a fresh R session), but it looks like I can fully emulate on my "regular" system using Sys.setlocale("LC_CTYPE", "Japanese_Japan.932") .

The φ character seems to have a native representation in your locale, but it's different from the Unicode representation:

charToRaw("φ")
## [1] 83 d3
charToRaw(enc2utf8("φ"))
## [1] cf 86
charToRaw(enc2native(enc2utf8("φ")))
## [1] 83 d3

In contrast, when I try this with an accented Latin-1 character, I get:

"ü"
## "u"

It's converted to ASCII right away, without warning.

Please see my SO reply for more detailed recommendations. We really need to fix #1950; I'll look at your pull request, it might resolve a particular use case but uncover other problems.

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Dec 10, 2016

Actually, you should be able to work around with a simple helper. The following works in a Japanese locale:

fix_names <- function(x) {
  names(x) <- enc2native(names(x))
  x
}

data_frame(φ = 1:5) %>%
  select(φ) %>%
  fix_names %>%
  group_by(φ)

Loading

@yutannihilation
Copy link
Member Author

@yutannihilation yutannihilation commented Dec 12, 2016

Thanks! I agree that fixing #1950 is the most important, though maybe I don't fully understand the point yet... So, should I close this and my pull request?

(No reply is needed for the following. I don't intend to blame you. I know the problem is very difficult)

I am happy to close this one and my pull request. But, one thing, I am a bit frustrated that we are still suffering from the same issue (#1507) which I once believed was solved half a year ago.

I know this kind of workarounds and I've been recommended similar one to my friends every time I was asked about the error :-P. Yes. I'm comfortable with this work around. But, many users doesn't even notice that any work around exists. They just give up and leave from dplyr (and R).

So, could you please consider employing some temporal workaround like mine and #2058 if the problem will last for long?

Loading

@krlmlr krlmlr closed this in #2382 Jan 27, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

2 participants