Encoding problems on Windows caused by character -> symbol -> character roundtrip #1950

krlmlr · 2016-06-21T18:57:43Z

Some of the encoding problems (e.g., grouping) seem to be caused by converting character to symbol and back. On Linux (UTF-8 locale):

> "ä" %>% Encoding
[1] "UTF-8"
> "ä" %>% as.name %>% as.character %>% Encoding
[1] "UTF-8"
> "ä" %>% iconv(to = "latin1") %>% as.name %>% as.character %>% Encoding
[1] "unknown"
> "ä" %>% iconv(to = "latin1") %>% as.name %>% as.character %>% enc2utf8
[1] "<e4>"

On Windows (latin-1 locale):

> "ä" %>% Encoding
[1] "latin1"
> "ä" %>% as.name %>% as.character %>% Encoding
[1] "latin1"
> "ä" %>% iconv(to = "UTF-8") %>% as.name %>% as.character %>% Encoding
[1] "unknown"
> "ä" %>% iconv(to = "UTF-8") %>% as.name %>% as.character %>% enc2utf8
[1] "Ã¤"

On Windows, one test fails because of that.

So, currently we should suggest using ASCII (or at least native-encoded) column names, and not fiddle with the encoding, in particular not set it to UTF-8 on Windows.

A lot of internal dplyr logic seems to be based on the symbol type, I don't see a quick solution here.

The text was updated successfully, but these errors were encountered:

kevinushey · 2016-06-21T19:19:07Z

IIUC, the over-arching issue is that the as.name => as.character roundtrip loses the encoding, and so we no longer know how to properly translate from the original encoding to the desired encoding (hence why enc2utf8 doesn't produce the desired result)?

krlmlr · 2016-06-21T19:33:30Z

Yes -- if it's not the native encoding, the roundtrip loses it, and there's no way to recover.

hadley · 2016-06-23T12:26:41Z

Maybe dplyr could have it's own symbol coercion methods that always assumed UTF-8?

krlmlr · 2016-06-23T12:29:52Z

We could do that, but I think the safest option is not to use symbols to represent column names -- a simple S3 class should do the same job.

hadley · 2016-06-23T12:35:07Z

I think symbols are the correct way to represent column names for a number of reasons.

krlmlr · 2016-06-23T12:44:57Z

We could settle for a combination:

> structure(as.name("col_name"), class = "encoded_symbol")
col_name
attr(,"class")
[1] "encoded_symbol"

With a suitable as.character.encoded_symbol() function.

krlmlr · 2016-08-09T22:54:15Z

Windows users are confined to their native encoding for column names anyway if they want to use them in expressions. These are always in the native encoding, the following doesn't work on Windows:

~成交日期
## Error: unexpected input in "~\"

For characters that can be represented in the native encoding, . %>% as.name %>% as.character works as expected and reliably maintains the declared encoding. I think we should use (and expect) the native encoding whenever we interact with language objects, and UTF-8 otherwise.

I have submitted a bug report to R's bugzilla concerning the behavior of as.name() for strings in non-native encoding.

krlmlr · 2016-08-10T22:06:42Z

A comment in the R source leaves little hope for an upstream fix, but we'll see.

krlmlr · 2016-11-08T07:20:09Z

Related: #1885, joining strings with different encodings.

krlmlr · 2017-01-25T20:53:21Z

But the following works on Windows:

~"成交日期"

CC @hadley

krlmlr · 2017-01-25T20:55:43Z

> data_frame(a = 1) %>% setNames("成交日期")
# A tibble: 1 × 1
  `<U+6210><U+4EA4><U+65E5><U+671F>`
                               <dbl>
1                                  1

…mbols

krlmlr mentioned this issue Jun 21, 2016

Re-encode character columns and column names to UTF-8 tidyverse/tibble#87

Closed

krlmlr mentioned this issue Jul 29, 2016

Backtick non-semantic variable names tidyverse/tibble#131

Closed

krlmlr mentioned this issue Aug 8, 2016

Support for UTF-8 encoded language symbols krlmlr/enc#9

Closed

krlmlr mentioned this issue Nov 8, 2016

Strings with and without encoding are not matched when joining #1885

Closed

This was referenced Dec 1, 2016

UTF-8 problem with group_by (once again) #2277

Closed

select() mistakenly converts colnames to UTF-8 #2284

Closed

krlmlr mentioned this issue Jan 6, 2017

joining when a tibble has duplicate column names causes one column to overwrite the others #2353

Closed

This was referenced Jan 27, 2017

Fix group_by() for column names in UTF-8 on Windows #2382

Merged

New group_names() to return group names as character vector #2384

Merged

yutannihilation mentioned this issue Jan 28, 2017

mutate()/transmute() fails to handle column names in UTF-8 on Windows #2387

Closed

krlmlr added a commit to krlmlr/dplyr that referenced this issue Jan 28, 2017

Merge branch 'master' into b-tidyverse#1950-symbols

3921662

This was referenced Jan 31, 2017

nest fails with space in column name tidyverse/tidyr#244

Closed

WIP: A more consistent way to specify query arguments #2386

Closed

hadley added bug an unexpected problem or unintended behavior data frame labels Feb 2, 2017

krlmlr mentioned this issue Feb 20, 2017

Use String instead of Symbol #2388

Merged

krlmlr added a commit to krlmlr/dplyr that referenced this issue Feb 20, 2017

Merge branch 'master' into b-tidyverse#1950-symbols

5ede6b5

krlmlr added a commit to krlmlr/dplyr that referenced this issue Feb 20, 2017

Merge branch 'master' into b-tidyverse#1950-symbols

c3ac212

krlmlr added a commit to krlmlr/dplyr that referenced this issue Feb 20, 2017

Merge branch 'master' into b-tidyverse#1950-symbols

8fd252e

krlmlr modified the milestone: data frame 1 Feb 21, 2017

krlmlr added a commit to krlmlr/dplyr that referenced this issue Feb 21, 2017

Merge remote-tracking branch 'origin/master' into b-tidyverse#1950-sy…

dfecb7f

…mbols

krlmlr closed this as completed in #2388 Feb 21, 2017

krlmlr added a commit to krlmlr/dplyr that referenced this issue Feb 23, 2017

Merge branch 'b-tidyverse#1950-symbols' into master-nosquash

71302e5

krlmlr mentioned this issue Feb 24, 2017

Full support for foreign UTF-8 characters on non-UTF-8 locales #2469

Closed

lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding problems on Windows caused by character -> symbol -> character roundtrip #1950

Encoding problems on Windows caused by character -> symbol -> character roundtrip #1950

krlmlr commented Jun 21, 2016

kevinushey commented Jun 21, 2016 •

edited

Loading

krlmlr commented Jun 21, 2016

hadley commented Jun 23, 2016

krlmlr commented Jun 23, 2016

hadley commented Jun 23, 2016

krlmlr commented Jun 23, 2016

krlmlr commented Aug 9, 2016

krlmlr commented Aug 10, 2016

krlmlr commented Nov 8, 2016

krlmlr commented Jan 25, 2017 •

edited

Loading

krlmlr commented Jan 25, 2017

Encoding problems on Windows caused by character -> symbol -> character roundtrip #1950

Encoding problems on Windows caused by character -> symbol -> character roundtrip #1950

Comments

krlmlr commented Jun 21, 2016

kevinushey commented Jun 21, 2016 • edited Loading

krlmlr commented Jun 21, 2016

hadley commented Jun 23, 2016

krlmlr commented Jun 23, 2016

hadley commented Jun 23, 2016

krlmlr commented Jun 23, 2016

krlmlr commented Aug 9, 2016

krlmlr commented Aug 10, 2016

krlmlr commented Nov 8, 2016

krlmlr commented Jan 25, 2017 • edited Loading

krlmlr commented Jan 25, 2017

kevinushey commented Jun 21, 2016 •

edited

Loading

krlmlr commented Jan 25, 2017 •

edited

Loading