Re-encode character columns and column names to UTF-8 #87

krlmlr · 2016-06-04T20:14:20Z

During construction and coercion. If possible, also during assignment.

hadley · 2016-06-08T19:26:44Z

Also need to provide a function to alert people to this problem, and to fix it themselves.

Might be worth considering detecting the problem on print and warning the user. OTOH that's a rather expensive check because you have spider the complete list (including all attributes). We should at least have a way for the user to call this themselves.

hadley · 2016-06-09T13:01:46Z

We could definitely have a warning if the column names are not all UTF-8 - that's one of the most common causes of problems.

@hadley

@hadley: Can we use this for tibble?

krlmlr · 2016-06-21T23:26:48Z

Not sure, see tidyverse/dplyr#1950.

hadley · 2016-08-08T13:34:09Z

I think the most important thing is to check the encoding of the column names. I think this means:

always encode as UTF-8 on creation
provide a helper method to re-encode if necessary
on print, warn about non-UTF8 names

krlmlr · 2016-08-16T09:30:31Z

I'd like to release utf8 before we can proceed here, because there we have efficient methods for checking the encoding of a character value.

krlmlr · 2016-08-16T22:34:37Z

Apart from this and #151, I think the package is ready for another CRAN release.

To recap: Column names should be in the native encoding (with the ugly <U+xxxx> formatting if necessary), because all language objects are natively encoded anyway. Column data should be in UTF-8.

How about making sure that column headers are in native encoding first, and postponing the encoding of the column data until later?

hadley · 2016-08-16T22:41:18Z

Yeah, that sounds good. But I think we can leave to the next release? I need to think through the tradeoffs.

krlmlr · 2016-08-16T22:50:17Z

#150 -> #151, but releasing now is probably safer anyway.

krlmlr · 2016-08-19T18:17:42Z

A comment from @JanSchulz (#104 (comment)):

regarding windows <U+....> printing: r-lib/evaluate#66 This problem seems to happen quite deep in R. We had quite a lot of fun in IRkernel (or better in repr) with that because we use sink (or better evaluate does) to get the output of a computation: https://github.com/IRkernel/repr/blob/master/R/repr_matrix_df.r#L16-L27

krlmlr · 2017-01-25T20:24:51Z

A way to find a locale with latin-1 encoding:

system2("locale", "-a", stdout = TRUE) %>% grep("8859", ., value = TRUE)

Works on my Ubuntu and OS X machines. This can then be used for LC_CTYPE:

withr::with_locale(
  c(LC_CTYPE = system2("locale", "-a", stdout = TRUE) %>% grep(8859, ., value = TRUE) %>% .[[1]]),
  Encoding(enc2native("ä")))
## [1] "latin1"

hadley · 2017-01-25T22:08:28Z

Another useful locale from Winston: https://gist.github.com/wch/3a629cfe575846a14207

krlmlr · 2018-08-21T07:14:49Z

Fixing encoding here (i.e. encoding as native) should be responsibility of tidy_names() too. Need to double-check if this should be done for all tibbles.

Fixes #463 Exposure (or lack thereof) of syntactic argument in tibble() and as_tibble() Fixes #461 Advertise .tidy_names Fixed #446 Justify tidy_names --- Other name-related tibble issues * Finalize interface of `.name_repair = <A_FUNCTION>` #476 *let's decide/discuss there* * tibble doesn't make it easy to get variable names that match our style guide #472 *let's solve in follow-up PR* * names<- method should protect against invalid names #466 *let's solve in follow-up PR* * Move name repair to vctrs #464 *PR lays groundwork for this long-term goal* * revisit strategy for making names syntactic #459 *let's solve in follow-up PR* * Re-encode character columns and column names to UTF-8 #87 *relevant, but I don't touch this* Related issue in [tidyverse/principles](https://github.com/tidyverse/principles): * [Vector names](tidyverse/design#4) --- Overview: * Defines and motivates 3 levels of repaired names: - `minimal`: not `NULL`, no `NA`s, à la `rlang::names2()` - `unique`: `minimal` + no `""`s and no dupes (cf `tidy_names(syntactic = FALSE)`) - `syntactic`: `unique` + syntactic (cf `tidy_names(syntactic = TRUE)`) * Internal utility functions for each level of name repair, i.e. all combinations of "repair vs check" and "operate on the names vs the named vector". * `tibble()` and `as_tibble()` expose name repair via `.name_repair`. Replaces `.tidy_names` (piloted only in dev) and `as_tibble(..., validate = TRUE/FALSE)` (in CRAN version). * Default behaviour (`.name_repair = "assert_unique")`): do no repair but check that the names are `unique`. * `.name_repair = "none"` means just make sure the names are `minimal` * `.name_repair = "unique"` and `.name_repair = "syntactic"` mean what you think * `.name_repair` can also take a user-supplied function. --- To discuss: * ~Should `valid` just be `unique`?~ YES and I have made it so in the source. * Let us review the "names degrees of freedom" and facilities offered by base, tidyverse for independent control: - exist or not - unique or not - syntactic or not * Automatic naming, based on expression. We currently offer no control over this in `tibble()`. * How much to reveal re: long-term plans. For example, why `tidy_names()` and `set_tidy_names()` are still here, but haven't gained the obvious counterparts like `unique_names()` and `set_syntactic_names()`. Do we need a Life Cycle section in the docs? * My backwards compatibility report

krlmlr · 2018-10-27T15:31:39Z

We now have:

library(magrittr)

loc_info <- c(LC_CTYPE = system2("locale", "-a", stdout = TRUE) %>% grep(8859, ., value = TRUE) %>% .[[1]])
loc_info
#>         LC_CTYPE 
#> "en_US.iso88591"

alien <- "\u5e78\u798f"
alien
#> [1] "幸福"

withr::with_locale(
  loc_info,
  {
    print(alien)
    print(make.names(alien))
    print(tibble:::make_syntactic(alien))
  }
)
#> [1] "<U+5E78><U+798F>"
#> [1] "X.U.5E78..U.798F."
#> [1] ".U.5E78..U.798F."

^{Created on 2018-10-27 by the reprex package (v0.2.1.9000)}

This means that "universal" names won't have any problems here, and I'm not sure if the transliteration should be part of a less restrictive strategy.

The only remaining question here is if we should attempt to make these names look nicer, e.g. ".." in the example above. CC @jennybc.

krlmlr · 2021-02-23T06:06:25Z

Name repair is in vctrs now and seems to do well with different encodings.

github-actions · 2022-02-24T01:03:45Z

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

hadley changed the title ~~Add UTF-8 encoding to character columns~~ Re-encode character columns to UTF-8 Jun 8, 2016

krlmlr added the ready label Jun 13, 2016

krlmlr referenced this issue in krlmlr/enc Jun 14, 2016

show resistance to encoding updates

2bcf680

@hadley: Can we use this for tibble?

krlmlr modified the milestone: 2.0 Aug 8, 2016

krlmlr changed the title ~~Re-encode character columns to UTF-8~~ Re-encode character columns and column names to UTF-8 Aug 9, 2016

krlmlr removed this from the 1.2 milestone Aug 16, 2016

krlmlr mentioned this issue Aug 19, 2016

Table headers with UTF-8 characters outside current locale #135

Closed

krlmlr added the encoding 🗺️ label Aug 26, 2016

krlmlr modified the milestone: 1.3 Aug 26, 2016

krlmlr self-assigned this Aug 29, 2016

krlmlr added in progress and removed ready labels Aug 29, 2016

krlmlr modified the milestones: 1.4.0, 1.5.0 Aug 21, 2018

jennybc mentioned this issue Aug 24, 2018

Refactor and revise name repair #469

Merged

wush978 mentioned this issue Dec 9, 2018

Korean letter encoding error on Windows r-dbi/RSQLite#233

Closed

krlmlr removed this from the 2.0.0 milestone Jan 8, 2019

krlmlr closed this as completed Feb 23, 2021

github-actions bot locked and limited conversation to collaborators Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-encode character columns and column names to UTF-8 #87

Re-encode character columns and column names to UTF-8 #87

krlmlr commented Jun 4, 2016 •

edited

Loading

hadley commented Jun 8, 2016

hadley commented Jun 9, 2016

krlmlr commented Jun 21, 2016

hadley commented Aug 8, 2016

krlmlr commented Aug 16, 2016

krlmlr commented Aug 16, 2016 •

edited

Loading

hadley commented Aug 16, 2016

krlmlr commented Aug 16, 2016

krlmlr commented Aug 19, 2016

krlmlr commented Jan 25, 2017

hadley commented Jan 25, 2017

krlmlr commented Aug 21, 2018

krlmlr commented Oct 27, 2018

krlmlr commented Feb 23, 2021

github-actions bot commented Feb 24, 2022

Re-encode character columns and column names to UTF-8 #87

Re-encode character columns and column names to UTF-8 #87

Comments

krlmlr commented Jun 4, 2016 • edited Loading

hadley commented Jun 8, 2016

hadley commented Jun 9, 2016

krlmlr commented Jun 21, 2016

hadley commented Aug 8, 2016

krlmlr commented Aug 16, 2016

krlmlr commented Aug 16, 2016 • edited Loading

hadley commented Aug 16, 2016

krlmlr commented Aug 16, 2016

krlmlr commented Aug 19, 2016

krlmlr commented Jan 25, 2017

hadley commented Jan 25, 2017

krlmlr commented Aug 21, 2018

krlmlr commented Oct 27, 2018

krlmlr commented Feb 23, 2021

github-actions bot commented Feb 24, 2022

krlmlr commented Jun 4, 2016 •

edited

Loading

krlmlr commented Aug 16, 2016 •

edited

Loading