-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Encoding problems on Windows caused by character -> symbol -> character roundtrip #1950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
IIUC, the over-arching issue is that the |
Yes -- if it's not the native encoding, the roundtrip loses it, and there's no way to recover. |
Maybe dplyr could have it's own symbol coercion methods that always assumed UTF-8? |
We could do that, but I think the safest option is not to use symbols to represent column names -- a simple S3 class should do the same job. |
I think symbols are the correct way to represent column names for a number of reasons. |
We could settle for a combination:
With a suitable |
Windows users are confined to their native encoding for column names anyway if they want to use them in expressions. These are always in the native encoding, the following doesn't work on Windows: ~成交日期
## Error: unexpected input in "~\" For characters that can be represented in the native encoding, I have submitted a bug report to R's bugzilla concerning the behavior of |
A comment in the R source leaves little hope for an upstream fix, but we'll see. |
Related: #1885, joining strings with different encodings. |
But the following works on Windows: ~"成交日期" CC @hadley |
> data_frame(a = 1) %>% setNames("成交日期")
# A tibble: 1 × 1
`<U+6210><U+4EA4><U+65E5><U+671F>`
<dbl>
1 1 |
Some of the encoding problems (e.g., grouping) seem to be caused by converting
character
tosymbol
and back. On Linux (UTF-8 locale):On Windows (latin-1 locale):
On Windows, one test fails because of that.
So, currently we should suggest using ASCII (or at least native-encoded) column names, and not fiddle with the encoding, in particular not set it to UTF-8 on Windows.
A lot of internal dplyr logic seems to be based on the
symbol
type, I don't see a quick solution here.The text was updated successfully, but these errors were encountered: