-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-encode character columns and column names to UTF-8 #87
Comments
Also need to provide a function to alert people to this problem, and to fix it themselves. Might be worth considering detecting the problem on print and warning the user. OTOH that's a rather expensive check because you have spider the complete list (including all attributes). We should at least have a way for the user to call this themselves. |
We could definitely have a warning if the column names are not all UTF-8 - that's one of the most common causes of problems. |
Not sure, see tidyverse/dplyr#1950. |
I think the most important thing is to check the encoding of the column names. I think this means:
|
I'd like to release utf8 before we can proceed here, because there we have efficient methods for checking the encoding of a character value. |
Apart from this and #151, I think the package is ready for another CRAN release. To recap: Column names should be in the native encoding (with the ugly How about making sure that column headers are in native encoding first, and postponing the encoding of the column data until later? |
Yeah, that sounds good. But I think we can leave to the next release? I need to think through the tradeoffs. |
A comment from @JanSchulz (#104 (comment)): regarding windows |
A way to find a locale with latin-1 encoding: system2("locale", "-a", stdout = TRUE) %>% grep("8859", ., value = TRUE) Works on my Ubuntu and OS X machines. This can then be used for withr::with_locale(
c(LC_CTYPE = system2("locale", "-a", stdout = TRUE) %>% grep(8859, ., value = TRUE) %>% .[[1]]),
Encoding(enc2native("ä")))
## [1] "latin1" |
Another useful locale from Winston: https://gist.github.com/wch/3a629cfe575846a14207 |
Fixing encoding here (i.e. encoding as native) should be responsibility of |
Fixes #463 Exposure (or lack thereof) of syntactic argument in tibble() and as_tibble() Fixes #461 Advertise .tidy_names Fixed #446 Justify tidy_names --- Other name-related tibble issues * Finalize interface of `.name_repair = <A_FUNCTION>` #476 *let's decide/discuss there* * tibble doesn't make it easy to get variable names that match our style guide #472 *let's solve in follow-up PR* * names<- method should protect against invalid names #466 *let's solve in follow-up PR* * Move name repair to vctrs #464 *PR lays groundwork for this long-term goal* * revisit strategy for making names syntactic #459 *let's solve in follow-up PR* * Re-encode character columns and column names to UTF-8 #87 *relevant, but I don't touch this* Related issue in [tidyverse/principles](https://github.com/tidyverse/principles): * [Vector names](tidyverse/design#4) --- Overview: * Defines and motivates 3 levels of repaired names: - `minimal`: not `NULL`, no `NA`s, à la `rlang::names2()` - `unique`: `minimal` + no `""`s and no dupes (cf `tidy_names(syntactic = FALSE)`) - `syntactic`: `unique` + syntactic (cf `tidy_names(syntactic = TRUE)`) * Internal utility functions for each level of name repair, i.e. all combinations of "repair vs check" and "operate on the names vs the named vector". * `tibble()` and `as_tibble()` expose name repair via `.name_repair`. Replaces `.tidy_names` (piloted only in dev) and `as_tibble(..., validate = TRUE/FALSE)` (in CRAN version). * Default behaviour (`.name_repair = "assert_unique")`): do no repair but check that the names are `unique`. * `.name_repair = "none"` means just make sure the names are `minimal` * `.name_repair = "unique"` and `.name_repair = "syntactic"` mean what you think * `.name_repair` can also take a user-supplied function. --- To discuss: * ~Should `valid` just be `unique`?~ YES and I have made it so in the source. * Let us review the "names degrees of freedom" and facilities offered by base, tidyverse for independent control: - exist or not - unique or not - syntactic or not * Automatic naming, based on expression. We currently offer no control over this in `tibble()`. * How much to reveal re: long-term plans. For example, why `tidy_names()` and `set_tidy_names()` are still here, but haven't gained the obvious counterparts like `unique_names()` and `set_syntactic_names()`. Do we need a Life Cycle section in the docs? * My backwards compatibility report
We now have: library(magrittr)
loc_info <- c(LC_CTYPE = system2("locale", "-a", stdout = TRUE) %>% grep(8859, ., value = TRUE) %>% .[[1]])
loc_info
#> LC_CTYPE
#> "en_US.iso88591"
alien <- "\u5e78\u798f"
alien
#> [1] "幸福"
withr::with_locale(
loc_info,
{
print(alien)
print(make.names(alien))
print(tibble:::make_syntactic(alien))
}
)
#> [1] "<U+5E78><U+798F>"
#> [1] "X.U.5E78..U.798F."
#> [1] ".U.5E78..U.798F." Created on 2018-10-27 by the reprex package (v0.2.1.9000) This means that "universal" names won't have any problems here, and I'm not sure if the transliteration should be part of a less restrictive strategy. The only remaining question here is if we should attempt to make these names look nicer, e.g. |
Name repair is in vctrs now and seems to do well with different encodings. |
This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary. |
tidyverse/dplyr#1885 (comment)
During construction and coercion. If possible, also during assignment.
The text was updated successfully, but these errors were encountered: