Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offer backwards compatibility to < 1.2.0 in name repair #546

Closed
HughParsonage opened this issue Feb 2, 2019 · 6 comments
Closed

Offer backwards compatibility to < 1.2.0 in name repair #546

HughParsonage opened this issue Feb 2, 2019 · 6 comments

Comments

@HughParsonage
Copy link

Since the introduction of .name_repair changes the output of read_excel and since it is quite difficult to reproduce the same effect, it would be nice to offer a package option, say getOption("readxl.old_repair_names"), to provide backwards compatibility. Otherwise, the only alternative appears to be a hack using assignInNamespace. That is, instead of

assignInNamespace(ns = "readxl",
                  x = "set_readxl_names",
                  value = function (l, .name_repair = "unique") {
    tibble_version <- utils::packageVersion("tibble")
    if (FALSE && tibble_version > "1.4.2") {
        if (is.null(.name_repair)) {
            return(tibble::as_tibble(l))
        }
        else {
            return(tibble::as_tibble(l, .name_repair = .name_repair))
        }
    }
    tibble::repair_names(tibble::as_tibble(l, validate = FALSE), 
        prefix = "X", sep = "__")
})

Just have options("readxl.old_repair_names" = TRUE) and then modify set_readxl_names like

set_readxl_names <- function (l, .name_repair = "unique") {
    tibble_version <- utils::packageVersion("tibble")
    if (!getOption("readxl.old_repair_names", FALSE) && tibble_version > "1.4.2") {
        if (is.null(.name_repair)) {
            return(tibble::as_tibble(l))
        }
        else {
            return(tibble::as_tibble(l, .name_repair = .name_repair))
        }
    }
    tibble::repair_names(tibble::as_tibble(l, validate = FALSE), 
        prefix = "X", sep = "__")
})
@raredd
Copy link

raredd commented Feb 4, 2019

This update has broken so much code, how could this be overlooked?

@jennybc
Copy link
Member

jennybc commented Feb 13, 2019

@HughParsonage

If you want the old style of name repair in the presence of the current CRAN versions of both readxl and tibble, I suggest that you define a function like legacy_repair(). Below I inline the logic from tibble::repair_names(). And then you can call it .name_repair argument.

library(readxl)

legacy_repair <- function(nms, prefix = "X", sep = "__") {
  if (length(nms) == 0) return(character())
  blank <- nms == ""
  nms[!blank] <- make.unique(nms[!blank], sep = sep)
  new_nms <- setdiff(paste(prefix, seq_along(nms), sep = sep), nms)
  nms[blank] <- new_nms[seq_len(sum(blank))]
  nms
}

legacy_repair(rep_len("", 3))
#> [1] "X__1" "X__2" "X__3"
legacy_repair(c("x", "", "x"))
#> [1] "x"    "X__1" "x__1"

readxl_test_sheet <- "~/rrr/readxl/tests/testthat/sheets/names-need-repair-xlsx.xlsx"

read_excel(readxl_test_sheet)
#> New names:
#> * `a b` -> `a b..1`
#> * `a b` -> `a b..2`
#> * `` -> `..3`
#> # A tibble: 2 x 4
#>   `a b..1` `a b..2` ..3   `c%&$`
#>      <dbl> <chr>    <chr>  <dbl>
#> 1        1 a        one      1.1
#> 2        2 b        two      2.2

read_excel(readxl_test_sheet, .name_repair = legacy_repair)
#> # A tibble: 2 x 4
#>   `a b` `a b__1` X__1  `c%&$`
#>   <dbl> <chr>    <chr>  <dbl>
#> 1     1 a        one      1.1
#> 2     2 b        two      2.2

Or you could decline name repair and use tibble::repair_names() after the fact.

library(readxl)

readxl_test_sheet <- "~/rrr/readxl/tests/testthat/sheets/names-need-repair-xlsx.xlsx"

x <- read_excel(readxl_test_sheet, .name_repair = "minimal")
x
#> # A tibble: 2 x 4
#>   `a b` `a b` ``    `c%&$`
#>   <dbl> <chr> <chr>  <dbl>
#> 1     1 a     one      1.1
#> 2     2 b     two      2.2

tibble::repair_names(
  tibble::as_tibble(x, validate = FALSE), prefix = "X", sep = "__"
)
#> # A tibble: 2 x 4
#>   `a b` `a b__1` X__1  `c%&$`
#>   <dbl> <chr>    <chr>  <dbl>
#> 1     1 a        one      1.1
#> 2     2 b        two      2.2

Or maybe stay on an earlier version of readxl or tibble until the time is right to take control of variable names? This should not affect sheet reads where the names are well-formed to begin with.

We really want to reduce the inconsistencies across tidyverse packages re: how column names are repaired. I know this requires some adjustments for people, but this was a considered move. It was also pre-announced (see comment I'm about to write in response to @raredd).

@jennybc
Copy link
Member

jennybc commented Feb 13, 2019

@raredd

This is not an oversight. readxl has been downloaded almost 600K times since its December 2018 release, from RStudio CRAN mirrors alone. This is the very first issue opened about name repair, about 6 weeks later. We go through complete revdep checks for readxl and for tibble and we search & analyze non-package code on GitHub before we make such changes. Changes to column name repair were pre-announced when v1.1.0 was released in April of 2018. There are costs to changing and to not changing and we think making this change is a net improvement to the ecosystem.

@jennybc jennybc closed this as completed Feb 13, 2019
@HughParsonage
Copy link
Author

Thank you very much @jennybc -- that solution works for the sheets I had had trouble with.

(Apologies for the lack of reprex and for seeming ungrateful -- I'm truly spoiled by this package.)

@jennybc
Copy link
Member

jennybc commented Feb 13, 2019

No worries @HughParsonage, I know this is a disruption and the average level of pain is not indicative of what any specific user experiences.

@raredd
Copy link

raredd commented Feb 14, 2019

@jennybc There are already four built-in name repairs, why not simply add a 5th that was the standard for however long since readxl was released?
Yes, it is an entire almost three months since Dec 2018, but only recently have I had to go back and update several projects which predate Dec 2018, all of which have thrown me errors about variable names. If I am the only one, I will send you a check.
As a package developer you know the vast majority of users will never even look at the other options of name repair much less use them--yet you support those three plus the new default, and already have a solution for legacy code. I honestly do not care what the default is--I'm sure you only have the best names--but adding a 5th option for name repair that hundreds of thousands of users have been using since readxl's first release seems apparent.

lionel- added a commit to lionel-/vctrs that referenced this issue Jun 19, 2019
lionel- added a commit to lionel-/vctrs that referenced this issue Jun 20, 2019
@lock lock bot locked and limited conversation to collaborators Feb 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants