Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutate()/transmute() fails to handle column names in UTF-8 on Windows #2387

Closed
yutannihilation opened this issue Jan 28, 2017 · 14 comments
Closed
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@yutannihilation
Copy link
Member

yutannihilation commented Jan 28, 2017

As suggested in #2285 (comment), the current GitHub version of mutate() and transmute() behaves strangely for column names in UTF-8 on Windows. Though I couldn't find the cause yet, I'm afraid this is a regression bug related to #1950.

Here are the error details with reprexes:

Case 1) Error with a data.frame that contains non-ASCII colnames

I found that mutate() adds a strange column for non-ASCII columns.

df1 <- data_frame(Φ = 1)
df1
#> # A tibble: 1 × 1
#>      Φ
#>   <dbl>
#> 1     1

df1_mutated <- df1 %>% mutate(Φ = Φ * 2)
df1_mutated
#> # A tibble: 1 × 2
#>      Φ    ホヲ
#>   <dbl> <dbl>
#> 1     1     2

Case 2) Error with a data.frame that contains non-ASCII colnames in UTF-8

If the non-ASCII colname is UTF-8-encoded, mutate() does not add but replaces the existing columns with the strange column.

df2 <- df1
colnames(df2) <- enc2utf8(colnames(df2))

df2_mutated <- df2 %>% mutate(Φ = Φ * 2)
df2_mutated
#> # A tibble: 1 × 1
#>      ホヲ
#>   <dbl>
#> 1     2

Details

Then, what is this mysterious character ホヲ?

This is actually a UTF-8-converted Φ, but unfortunately it lost Encoding attribute. This is why the non-ASCII columns are mistakenly handled in non-UTF-8 environments.

charToRaw(colnames(df2_mutated))
#> [1] ce a6
charToRaw(enc2utf8("Φ"))
#> [1] ce a6

Encoding(colnames(df2_mutated))
#> [1] "unknown"

So if I set Encoding() as "UTF-8", it starts to work fine again.

colnames(df2_mutated) <- `Encoding<-`(colnames(df2_mutated), "UTF-8")
df2_mutated
#> # A tibble: 1 × 1
#>      Φ
#>   <dbl>
#> 1     2

My environment

library(dplyr)

packageVersion("dplyr")
#> [1] '0.5.0.9000'

# the installed version is at the point of this commit:
#    https://github.com/hadley/dplyr/commit/8aa1bdb8fe95b741fb9411dbccd1b3af2f631dfc
packageDescription("dplyr")$GithubSHA1
#> [1] "8aa1bdb8fe95b741fb9411dbccd1b3af2f631dfc"

Sys.getlocale()
#> [1] "LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932"
@krlmlr
Copy link
Member

krlmlr commented Jan 28, 2017

Thanks. Could you please double-check that this is indeed a regression, perhaps with dplyr 0.5.0 and/or an older version 0d57562:

remotes::install_github("hadley/devtools@0d57562")

@yutannihilation
Copy link
Member Author

@krlmlr Thanks for your quick response! I will.

@yutannihilation
Copy link
Member Author

Here is the result: https://gist.github.com/yutannihilation/e458b0fcf6ef784260e9d5558fa3ec68

  • dplyr 0.5.0 (on CRAN) works fine
  • dplyr at 0d57562 won't work

So, it doesn't seem that this behavior is introduced by the recent changes in this few days.

@krlmlr
Copy link
Member

krlmlr commented Jan 28, 2017

Thanks. The version I suggested to test doesn't contain the most recent encoding-related changes. I'll make sure to test this particular example when fixing #1950.

@krlmlr
Copy link
Member

krlmlr commented Jan 28, 2017

Could you please test 2a86e24? It's the predecessor of b205d1f (#2185) that might have introduced this.

@yutannihilation
Copy link
Member Author

Oh, 2a86e24 works fine.

https://gist.github.com/yutannihilation/e458b0fcf6ef784260e9d5558fa3ec68#file-dplyr_test3-md

@yutannihilation
Copy link
Member Author

@yutannihilation
Copy link
Member Author

I guess you've already found the cause, but let me write my expectation.

Since the lazyeval::all_dots() returns the result with UTF-8 names, UTF-8-encoded colnames are passed as dots to mutate_impl() here:

mutate_.tbl_df  <- function(.data, ..., .dots) {
  dots <- lazyeval::all_dots(.dots, ..., all_named = TRUE)
  mutate_impl(.data, dots)
}

Therefore, the name here in the C++ function mutate_not_grouped() is UTF-8 encoded:

 Symbol name = lazy.name();

But, later, it is set as Symbol here, which will lose Encoding information when we convert it to a character.

 accumulator.set(name, results[i]);

@krlmlr
Copy link
Member

krlmlr commented Jan 28, 2017

Thanks for confirming this. I've started to replace all uses of Symbol by String in #2388, but it seems to break quite a few tests in the current state. Also, I think we should encapsulate symbols in a custom class that wraps String, so I'm probably going to change quite a bit before this is ready for review.

@yutannihilation
Copy link
Member Author

yutannihilation commented Jan 28, 2017

Thanks for your information!

I think we should encapsulate symbols in a custom class that wraps String

I got it. Now I understood your comment on #1950 a bit clearer. I will keep watching #2388 and am always happy to help you test the change with my Windows :)

@krlmlr
Copy link
Member

krlmlr commented Jan 28, 2017

Oh, I was referring to a C++ class. At the R level we might use a simple wrapper around character; we can't use language symbols to store column names because this is potentially lossy.

@krlmlr
Copy link
Member

krlmlr commented Jan 28, 2017

Anyway, thanks a lot for your patience and for testing!

@yutannihilation
Copy link
Member Author

Ah, sorry, I was confused with C++ level and R level... I see.

No problem!👍

@hadley hadley added bug an unexpected problem or unintended behavior data frame labels Jan 31, 2017
@krlmlr krlmlr added this to the data frame 1 milestone Feb 21, 2017
@yutannihilation
Copy link
Member Author

Confirmed this has been fixed. Kudos to you for this great job! Thanks so much!!! 🍣 🍣 🍣

@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants