Skip to content

mutate()/transmute() fails to handle column names in UTF-8 on Windows #2387

@yutannihilation

Description

@yutannihilation

As suggested in #2285 (comment), the current GitHub version of mutate() and transmute() behaves strangely for column names in UTF-8 on Windows. Though I couldn't find the cause yet, I'm afraid this is a regression bug related to #1950.

Here are the error details with reprexes:

Case 1) Error with a data.frame that contains non-ASCII colnames

I found that mutate() adds a strange column for non-ASCII columns.

df1 <- data_frame(Φ = 1)
df1
#> # A tibble: 1 × 1
#>      Φ
#>   <dbl>
#> 1     1

df1_mutated <- df1 %>% mutate(Φ = Φ * 2)
df1_mutated
#> # A tibble: 1 × 2
#>      Φ    ホヲ
#>   <dbl> <dbl>
#> 1     1     2

Case 2) Error with a data.frame that contains non-ASCII colnames in UTF-8

If the non-ASCII colname is UTF-8-encoded, mutate() does not add but replaces the existing columns with the strange column.

df2 <- df1
colnames(df2) <- enc2utf8(colnames(df2))

df2_mutated <- df2 %>% mutate(Φ = Φ * 2)
df2_mutated
#> # A tibble: 1 × 1
#>      ホヲ
#>   <dbl>
#> 1     2

Details

Then, what is this mysterious character ホヲ?

This is actually a UTF-8-converted Φ, but unfortunately it lost Encoding attribute. This is why the non-ASCII columns are mistakenly handled in non-UTF-8 environments.

charToRaw(colnames(df2_mutated))
#> [1] ce a6
charToRaw(enc2utf8("Φ"))
#> [1] ce a6

Encoding(colnames(df2_mutated))
#> [1] "unknown"

So if I set Encoding() as "UTF-8", it starts to work fine again.

colnames(df2_mutated) <- `Encoding<-`(colnames(df2_mutated), "UTF-8")
df2_mutated
#> # A tibble: 1 × 1
#>      Φ
#>   <dbl>
#> 1     2

My environment

library(dplyr)

packageVersion("dplyr")
#> [1] '0.5.0.9000'

# the installed version is at the point of this commit:
#    https://github.com/hadley/dplyr/commit/8aa1bdb8fe95b741fb9411dbccd1b3af2f631dfc
packageDescription("dplyr")$GithubSHA1
#> [1] "8aa1bdb8fe95b741fb9411dbccd1b3af2f631dfc"

Sys.getlocale()
#> [1] "LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932"

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behavior

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions