Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutate_all() and UTF-8 names #2967

Closed
krlmlr opened this issue Jul 13, 2017 · 9 comments
Closed

mutate_all() and UTF-8 names #2967

krlmlr opened this issue Jul 13, 2017 · 9 comments
Labels
Milestone

Comments

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jul 13, 2017

library(magrittr)
suppressPackageStartupMessages(library(dplyr))

data_frame(a = 1) %>%
  setNames(enc2native("\u4e2d")) %>%
  mutate_all(funs(as.character))
#> # A tibble: 1 x 2
#>   `<U+4E2D>`  `中`
#>        <dbl> <chr>
#> 1          1     1

data_frame(a = 1) %>%
  setNames("\u4e2d") %>%
  mutate_all(funs(as.character))
#> Warning: Mangling the following names: <U+4E2D> -> <U+4E2D>. Use
#> enc2native() to avoid the warning.
#> Error in mutate_impl(.data, dots): Evaluation error: variable '<U+4E2D>' not found.

(I did manually change the output for the first example to include the Chinese character.)

In the first example, mutate_all() gets a data frame with Unicode-escaped column names at input. Do we want to repair encoding in column names for each dplyr verb?

In the second example, the column name is properly UTF-8 encoded, but r-lib/rlang@ff87439 seems to get confused.

CC @lionel-.

krlmlr added a commit to krlmlr/dplyr that referenced this issue Jul 13, 2017
krlmlr added a commit that referenced this issue Jul 28, 2017
* fix Windows tests for Unicode column names

* fix rename for foreign column names

@lionel-: Perhaps exprs() and quo() should unescape names?

* Revert "fix rename for foreign column names"

now successful with current dev version of rlang.

This reverts commit 1db519c.

* exclude failing test on Windows

for #2967
@krlmlr krlmlr added this to the 0.7.3 milestone Aug 16, 2017
@krlmlr krlmlr added this to the 0.7.3 milestone Aug 16, 2017
@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Aug 22, 2017

Locale-independent reprex:

suppressPackageStartupMessages(library(dplyr))
withr::with_locale(
  c(LC_CTYPE = "C"),
  data_frame(a = 1) %>%
    setNames("\u4e2d") %>%
    mutate_all(funs(as.character))
)
#> Warning: Mangling the following names: <U+4E2D> -> <U+4E2D>. Use
#> enc2native() to avoid the warning.
#> Error in mutate_impl(.data, dots): Evaluation error: variable '中' not found.

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Aug 23, 2017

Seems this will be easier to fix with objectables.

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Aug 25, 2017

Also need to bump rlang dependency and reenable test (#3049).

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Mar 12, 2018

The error looks different with current rlang, will revisit.

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Mar 13, 2018

Postponing until after #2311, we always need to create a data mask if we're not on a UTF-8 system.

Even simpler reprex:

library(tidyverse)
withr::with_locale(
  c(LC_CTYPE = "C"),
  {
    varname <- "\u4e2d"
    data_frame(!!varname := 1) %>%
      mutate(!!paste0("new-", varname) := as.character(!!sym(varname)))
  }
)
#> Warning: Mangling the following names: <U+4E2D> -> <U+4E2D>. Use
#> enc2native() to avoid the warning.
#> Error in mutate_impl(.data, dots): Evaluation error: variable '中' not found.

Created on 2018-03-13 by the reprex package (v0.2.0).

The problem here is that the data frame has UTF-8 names, we need a proper data mask with mangled names (i.e., <U+4E2D> instead of ). Currently we don't build the data mask for ungrouped operations, but we need to if we have foreign column names that cannot be translated.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 26, 2018

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data_frame(a = 1) %>%
  setNames(enc2native("\u4e2d")) %>%
  mutate_all(funs(as.character))
#> # A tibble: 1 x 1
#>   中   
#>   <chr>
#> 1 1

data_frame(a = 1) %>%
  setNames("\u4e2d") %>%
  mutate_all(funs(as.character))
#> # A tibble: 1 x 1
#>   中   
#>   <chr>
#> 1 1

Created on 2018-03-26 by the reprex package (v0.2.0).

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Mar 26, 2018

🤔 ok the second example still fails for me:

suppressPackageStartupMessages(library(dplyr))
withr::with_locale(
  c(LC_CTYPE = "C"),
  data_frame(a = 1) %>%
    setNames("\u4e2d") %>%
    mutate_all(funs(as.character))
)
#> Warning: Mangling the following names: <U+4E2D> -> <U+4E2D>. Use
#> enc2native() to avoid the warning.
#> Error in mutate_impl(.data, dots): Evaluation error: variable '中' not found.

Created on 2018-03-26 by the reprex package (v0.2.0).

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Sep 14, 2018

Now getting this:

suppressPackageStartupMessages(library(dplyr))
withr::with_locale(
  c(LC_CTYPE = "C"),
  data_frame(a = 1) %>%
    setNames("\u4e2d") %>%
    mutate_all(funs(as.character))
)
#> # A tibble: 1 x 1
#>   中   
#>   <chr>
#> 1 1

@lock
Copy link

@lock lock bot commented Mar 16, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Mar 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants