Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-encode character columns and column names to UTF-8 #87

Closed
krlmlr opened this issue Jun 4, 2016 · 15 comments
Closed

Re-encode character columns and column names to UTF-8 #87

krlmlr opened this issue Jun 4, 2016 · 15 comments

Comments

@krlmlr
Copy link
Member

krlmlr commented Jun 4, 2016

tidyverse/dplyr#1885 (comment)

During construction and coercion. If possible, also during assignment.

@hadley hadley changed the title Add UTF-8 encoding to character columns Re-encode character columns to UTF-8 Jun 8, 2016
@hadley
Copy link
Member

hadley commented Jun 8, 2016

Also need to provide a function to alert people to this problem, and to fix it themselves.

Might be worth considering detecting the problem on print and warning the user. OTOH that's a rather expensive check because you have spider the complete list (including all attributes). We should at least have a way for the user to call this themselves.

@hadley
Copy link
Member

hadley commented Jun 9, 2016

We could definitely have a warning if the column names are not all UTF-8 - that's one of the most common causes of problems.

@krlmlr krlmlr added the ready label Jun 13, 2016
krlmlr referenced this issue in krlmlr/enc Jun 14, 2016
@hadley: Can we use this for tibble?
@krlmlr
Copy link
Member Author

krlmlr commented Jun 21, 2016

Not sure, see tidyverse/dplyr#1950.

@hadley
Copy link
Member

hadley commented Aug 8, 2016

I think the most important thing is to check the encoding of the column names. I think this means:

  • always encode as UTF-8 on creation
  • provide a helper method to re-encode if necessary
  • on print, warn about non-UTF8 names

@krlmlr krlmlr modified the milestone: 2.0 Aug 8, 2016
@krlmlr krlmlr changed the title Re-encode character columns to UTF-8 Re-encode character columns and column names to UTF-8 Aug 9, 2016
@krlmlr
Copy link
Member Author

krlmlr commented Aug 16, 2016

I'd like to release utf8 before we can proceed here, because there we have efficient methods for checking the encoding of a character value.

@krlmlr
Copy link
Member Author

krlmlr commented Aug 16, 2016

Apart from this and #151, I think the package is ready for another CRAN release.

To recap: Column names should be in the native encoding (with the ugly <U+xxxx> formatting if necessary), because all language objects are natively encoded anyway. Column data should be in UTF-8.

How about making sure that column headers are in native encoding first, and postponing the encoding of the column data until later?

@hadley
Copy link
Member

hadley commented Aug 16, 2016

Yeah, that sounds good. But I think we can leave to the next release? I need to think through the tradeoffs.

@krlmlr krlmlr removed this from the 1.2 milestone Aug 16, 2016
@krlmlr
Copy link
Member Author

krlmlr commented Aug 16, 2016

#150 -> #151, but releasing now is probably safer anyway.

@krlmlr
Copy link
Member Author

krlmlr commented Aug 19, 2016

A comment from @JanSchulz (#104 (comment)):

regarding windows <U+....> printing: r-lib/evaluate#66 This problem seems to happen quite deep in R. We had quite a lot of fun in IRkernel (or better in repr) with that because we use sink (or better evaluate does) to get the output of a computation: https://github.com/IRkernel/repr/blob/master/R/repr_matrix_df.r#L16-L27

@krlmlr krlmlr modified the milestone: 1.3 Aug 26, 2016
@krlmlr krlmlr self-assigned this Aug 29, 2016
@krlmlr krlmlr added in progress and removed ready labels Aug 29, 2016
@krlmlr
Copy link
Member Author

krlmlr commented Jan 25, 2017

A way to find a locale with latin-1 encoding:

system2("locale", "-a", stdout = TRUE) %>% grep("8859", ., value = TRUE)

Works on my Ubuntu and OS X machines. This can then be used for LC_CTYPE:

withr::with_locale(
  c(LC_CTYPE = system2("locale", "-a", stdout = TRUE) %>% grep(8859, ., value = TRUE) %>% .[[1]]),
  Encoding(enc2native("ä")))
## [1] "latin1"

@hadley
Copy link
Member

hadley commented Jan 25, 2017

Another useful locale from Winston: https://gist.github.com/wch/3a629cfe575846a14207

@krlmlr
Copy link
Member Author

krlmlr commented Aug 21, 2018

Fixing encoding here (i.e. encoding as native) should be responsibility of tidy_names() too. Need to double-check if this should be done for all tibbles.

@krlmlr krlmlr modified the milestones: 1.4.0, 1.5.0 Aug 21, 2018
krlmlr pushed a commit that referenced this issue Aug 30, 2018
Fixes #463 Exposure (or lack thereof) of syntactic argument in tibble() and as_tibble() 
Fixes #461 Advertise .tidy_names
Fixed #446 Justify tidy_names

---

Other name-related tibble issues

  * Finalize interface of `.name_repair = <A_FUNCTION>` #476 *let's decide/discuss there*
  * tibble doesn't make it easy to get variable names that match our style guide #472 *let's solve in follow-up PR*
  * names<- method should protect against invalid names #466 *let's solve in follow-up PR*
  * Move name repair to vctrs #464 *PR lays groundwork for this long-term goal*
  * revisit strategy for making names syntactic #459 *let's solve in follow-up PR*
  * Re-encode character columns and column names to UTF-8 #87 *relevant, but I don't touch this*

Related issue in [tidyverse/principles](https://github.com/tidyverse/principles):

  * [Vector names](tidyverse/design#4)

---

Overview:

  * Defines and motivates 3 levels of repaired names:
    - `minimal`: not `NULL`, no `NA`s, à la `rlang::names2()`
    - `unique`: `minimal` + no `""`s and no dupes (cf `tidy_names(syntactic = FALSE)`)
    - `syntactic`: `unique` + syntactic (cf `tidy_names(syntactic = TRUE)`)
  * Internal utility functions for each level of name repair, i.e. all combinations of "repair vs check" and "operate on the names vs the named vector".
  * `tibble()` and `as_tibble()` expose name repair via `.name_repair`. Replaces `.tidy_names` (piloted only in dev) and `as_tibble(..., validate = TRUE/FALSE)` (in CRAN version).
  * Default behaviour (`.name_repair = "assert_unique")`): do no repair but check that the names are `unique`.
  * `.name_repair = "none"` means just make sure the names are `minimal`
  * `.name_repair = "unique"` and `.name_repair = "syntactic"` mean what you think
  * `.name_repair` can also take a user-supplied function.

---

To discuss:

  * ~Should `valid` just be `unique`?~ YES and I have made it so in the source.
  * Let us review the "names degrees of freedom" and facilities offered by base, tidyverse for independent control:
    - exist or not
    - unique or not
    - syntactic or not
  * Automatic naming, based on expression. We currently offer no control over this in `tibble()`.
  * How much to reveal re: long-term plans. For example, why `tidy_names()` and `set_tidy_names()` are still here, but haven't gained the obvious counterparts like `unique_names()` and `set_syntactic_names()`. Do we need a Life Cycle section in the docs?
  * My backwards compatibility report
@krlmlr
Copy link
Member Author

krlmlr commented Oct 27, 2018

We now have:

library(magrittr)

loc_info <- c(LC_CTYPE = system2("locale", "-a", stdout = TRUE) %>% grep(8859, ., value = TRUE) %>% .[[1]])
loc_info
#>         LC_CTYPE 
#> "en_US.iso88591"

alien <- "\u5e78\u798f"
alien
#> [1] "幸福"

withr::with_locale(
  loc_info,
  {
    print(alien)
    print(make.names(alien))
    print(tibble:::make_syntactic(alien))
  }
)
#> [1] "<U+5E78><U+798F>"
#> [1] "X.U.5E78..U.798F."
#> [1] ".U.5E78..U.798F."

Created on 2018-10-27 by the reprex package (v0.2.1.9000)

This means that "universal" names won't have any problems here, and I'm not sure if the transliteration should be part of a less restrictive strategy.

The only remaining question here is if we should attempt to make these names look nicer, e.g. ".." in the example above. CC @jennybc.

@krlmlr
Copy link
Member Author

krlmlr commented Feb 23, 2021

Name repair is in vctrs now and seems to do well with different encodings.

@krlmlr krlmlr closed this as completed Feb 23, 2021
@github-actions
Copy link
Contributor

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants