Skip to content

Commit

Permalink
Refactor and revise name repair (#469)
Browse files Browse the repository at this point in the history
Fixes #463 Exposure (or lack thereof) of syntactic argument in tibble() and as_tibble() 
Fixes #461 Advertise .tidy_names
Fixed #446 Justify tidy_names

---

Other name-related tibble issues

  * Finalize interface of `.name_repair = <A_FUNCTION>` #476 *let's decide/discuss there*
  * tibble doesn't make it easy to get variable names that match our style guide #472 *let's solve in follow-up PR*
  * names<- method should protect against invalid names #466 *let's solve in follow-up PR*
  * Move name repair to vctrs #464 *PR lays groundwork for this long-term goal*
  * revisit strategy for making names syntactic #459 *let's solve in follow-up PR*
  * Re-encode character columns and column names to UTF-8 #87 *relevant, but I don't touch this*

Related issue in [tidyverse/principles](https://github.com/tidyverse/principles):

  * [Vector names](tidyverse/design#4)

---

Overview:

  * Defines and motivates 3 levels of repaired names:
    - `minimal`: not `NULL`, no `NA`s, à la `rlang::names2()`
    - `unique`: `minimal` + no `""`s and no dupes (cf `tidy_names(syntactic = FALSE)`)
    - `syntactic`: `unique` + syntactic (cf `tidy_names(syntactic = TRUE)`)
  * Internal utility functions for each level of name repair, i.e. all combinations of "repair vs check" and "operate on the names vs the named vector".
  * `tibble()` and `as_tibble()` expose name repair via `.name_repair`. Replaces `.tidy_names` (piloted only in dev) and `as_tibble(..., validate = TRUE/FALSE)` (in CRAN version).
  * Default behaviour (`.name_repair = "assert_unique")`): do no repair but check that the names are `unique`.
  * `.name_repair = "none"` means just make sure the names are `minimal`
  * `.name_repair = "unique"` and `.name_repair = "syntactic"` mean what you think
  * `.name_repair` can also take a user-supplied function.

---

To discuss:

  * ~Should `valid` just be `unique`?~ YES and I have made it so in the source.
  * Let us review the "names degrees of freedom" and facilities offered by base, tidyverse for independent control:
    - exist or not
    - unique or not
    - syntactic or not
  * Automatic naming, based on expression. We currently offer no control over this in `tibble()`.
  * How much to reveal re: long-term plans. For example, why `tidy_names()` and `set_tidy_names()` are still here, but haven't gained the obvious counterparts like `unique_names()` and `set_syntactic_names()`. Do we need a Life Cycle section in the docs?
  * My backwards compatibility report
  • Loading branch information
jennybc authored and krlmlr committed Aug 30, 2018
1 parent c2af751 commit e54aeeb
Show file tree
Hide file tree
Showing 17 changed files with 935 additions and 352 deletions.
10 changes: 5 additions & 5 deletions API
Expand Up @@ -7,9 +7,9 @@ add_column(.data, ..., .before = NULL, .after = NULL)
add_row(.data, ..., .before = NULL, .after = NULL)
as.tibble(x, ...)
as_data_frame(x, ...)
as_tibble(x, ..., .rows = NULL, .tidy_names = NULL, rownames = pkgconfig::get_config("tibble::rownames", NULL))
as_tibble(x, ..., .rows = NULL, .name_repair = c("assert_unique", "unique", "syntactic", "none", "minimal"), rownames = pkgconfig::get_config("tibble::rownames", NULL))
column_to_rownames(.data, var = "rowname")
data_frame(..., .rows = NULL, .tidy_names = NULL)
data_frame(..., .rows = NULL, .name_repair = c("assert_unique", "unique", "syntactic", "none", "minimal"))
data_frame_(xs)
deframe(x)
enframe(x, name = "name", value = "value")
Expand All @@ -29,7 +29,7 @@ rowid_to_column(.data, var = "rowid")
rownames_to_column(.data, var = "rowname")
set_tidy_names(x, syntactic = FALSE, quiet = FALSE)
tbl_sum(x)
tibble(..., .rows = NULL, .tidy_names = NULL)
tibble(..., .rows = NULL, .name_repair = c("assert_unique", "unique", "syntactic", "none", "minimal"))
tibble_(xs)
tidy_names(name, syntactic = FALSE, quiet = FALSE)
tribble(...)
Expand All @@ -39,9 +39,9 @@ view(x, title = NULL, ...)
## Own S3 methods

as_tibble.NULL(x, ...)
as_tibble.data.frame(x, validate = TRUE, ..., .rows = NULL, .tidy_names = if (validate) NULL else FALSE, rownames = pkgconfig::get_config("tibble::rownames", NULL))
as_tibble.data.frame(x, validate = TRUE, ..., .rows = NULL, .name_repair = c("assert_unique", "unique", "syntactic", "none", "minimal"), rownames = pkgconfig::get_config("tibble::rownames", NULL))
as_tibble.default(x, ...)
as_tibble.list(x, validate = TRUE, ..., .rows = NULL, .tidy_names = if (validate) NULL else FALSE)
as_tibble.list(x, validate = TRUE, ..., .rows = NULL, .name_repair = c("assert_unique", "unique", "syntactic", "none", "minimal"))
as_tibble.matrix(x, ...)
as_tibble.poly(x, ...)
as_tibble.table(x, `_n` = "n", ..., n = `_n`)
Expand Down
2 changes: 1 addition & 1 deletion DESCRIPTION
Expand Up @@ -24,12 +24,12 @@ Imports:
rlang (>= 0.2.2),
utils
Suggests:
bench,
covr,
dplyr,
htmltools,
import,
knitr (>= 1.5.32),
microbenchmark,
mockr,
nycflights13,
rmarkdown,
Expand Down
21 changes: 10 additions & 11 deletions NEWS.md
Expand Up @@ -27,17 +27,16 @@ The `tibble()` and `as_tibble()` functions, and the low-level `new_tibble()` con

- The new `.rows` argument to `tibble()` and `as_tibble()` allows specifying the expected number of rows explicitly, even if it's evident from the data. This allows writing more defensive code.

- Tibbles are now allowed to carry duplicate, empty or `NA` names, but checking is enabled by default, even when passing tibbles to `as_tibble()`. The new `.tidy_names` argument to `tibble()` and `as_tibble()` controls renaming:
- Column name repair has more direct support, via the new `.name_repair` argument to `tibble()` and `as_tibble()`. It takes the following values:

- `NULL`: default, throw an error if there are any missing or duplicated names,
- `FALSE`: deliberately request a tibble with invalid names,
- `TRUE`: apply `tidy_names()` to the names,
- a function: apply custom name repair (e.g., `.tidy_names = make.names`
to get base R equivalence).

Non-syntactic names will not be changed even with `.tidy_names = TRUE`, but might be when using a custom name repair such as `make.names`.

The `validate` argument is deprecated but supported (with a warning).
- `"none"`: No name repair or checks, beyond basic existence.
- `"minimal"`: Same as `"none"`.
- `"unique"`: Make sure names are unique and not empty.
- `"assert_unique"`: (default value), no name repair, but check they are `unique`.
- `"syntactic"`: Make the names `unique` and syntactic.
- a function: apply custom name repair (e.g., `.name_repair = make.names` for names in the style of base R).

The `validate` argument of `as_tibble()` is deprecated but supported (emits a message). Use `.name_repair = "none"` instead of `validate = FALSE` and `.name_repair = "assert_unique"` instead of `validate = TRUE` (#469, @jennybc).

- Row name handling is stricter. Row names are never (and never were) supported in `tibble()` and `new_tibble()`, and are now stripped by default in `as_tibble()`. The `rownames` argument to `as_tibble()` supports:

Expand All @@ -50,7 +49,7 @@ The `tibble()` and `as_tibble()` functions, and the low-level `new_tibble()` con

- `new_tibble()` and `as_tibble()` now also strip the `"dim"` attribute from columns that are one-dimensional arrays. (`tibble()` already did this before.)

- Internally, all `as_tibble()` implementation forward all extra arguments and `...` to `as_tibble.list()` where they are handled. This means that the common `.rows` and `.tidy_names` can be used for all inputs. We suggest that your implementations of this method do the same.
- Internally, all `as_tibble()` implementation forward all extra arguments and `...` to `as_tibble.list()` where they are handled. This means that the common `.rows` and `.name_repair` can be used for all inputs. We suggest that your implementations of this method do the same.

- `enframe()` (with `name = NULL`) and `deframe()` now support one-column tibbles (#449).

Expand Down
86 changes: 30 additions & 56 deletions R/as_tibble.R
Expand Up @@ -25,7 +25,8 @@
#' [.onLoad()] function.
#'
#' @seealso
#' [enframe()] converts a vector to a data frame with values in rows.
#' [enframe()] converts a vector to a data frame with values in rows,
#' [name-repair] documents the details of name repair.
#'
#' [pkgconfig::set_config()]
#'
Expand All @@ -48,32 +49,26 @@
#' colnames(m) <- c("a", "b", "c", "d", "e")
#' df <- as_tibble(m)
#'
#' as_tibble(1:3, .tidy_names = TRUE)
#' as_tibble(1:3, .name_repair = "unique")
#'
#' # as_tibble is considerably simpler than as.data.frame
#' # making it more suitable for use when you have things that are
#' # lists
#' # For list-like inputs, as_tibble() is considerably simpler than as.data.frame()
#' \dontrun{
#' if (requireNamespace("microbenchmark", quietly = TRUE)) {
#' if (requireNamespace("bench", quietly = TRUE)) {
#' l2 <- replicate(26, sample(letters), simplify = FALSE)
#' names(l2) <- letters
#' microbenchmark::microbenchmark(
#' as_tibble(l2, validate = FALSE),
#' bench::mark(
#' as_tibble(l2, .name_repair = "syntactic"),
#' as_tibble(l2, .name_repair = "unique"),
#' as_tibble(l2, .name_repair = "none"),
#' as_tibble(l2),
#' as.data.frame(l2)
#' as.data.frame(l2),
#' check = FALSE
#' )
#' }
#'
#' if (requireNamespace("microbenchmark", quietly = TRUE)) {
#' m <- matrix(runif(26 * 100), ncol = 26)
#' colnames(m) <- letters
#' microbenchmark::microbenchmark(
#' as_tibble(m),
#' as.data.frame(m)
#' )
#' }
#' }
as_tibble <- function(x, ..., .rows = NULL, .tidy_names = NULL,
as_tibble <- function(x, ...,
.rows = NULL,
.name_repair = c("assert_unique", "unique", "syntactic", "none", "minimal"),
rownames = pkgconfig::get_config("tibble::rownames", NULL)) {
UseMethod("as_tibble")
}
Expand All @@ -82,17 +77,18 @@ as_tibble <- function(x, ..., .rows = NULL, .tidy_names = NULL,
#' @rdname as_tibble
as_tibble.data.frame <- function(x, validate = TRUE, ...,
.rows = NULL,
.tidy_names = if (validate) NULL else FALSE,
.name_repair = c("assert_unique", "unique", "syntactic", "none", "minimal"),
rownames = pkgconfig::get_config("tibble::rownames", NULL)) {
if (!missing(validate)) {
warn("The `validate` argument to `as_tibble()` is deprecated. Please use `.tidy_names` to control column names.")
message("The `validate` argument to `as_tibble()` is deprecated. Please use `.name_repair` to control column names.")
.name_repair <- if (isTRUE(validate)) "assert_unique" else "none"
}

old_rownames <- raw_rownames(x)
if (is.null(.rows)) {
.rows <- nrow(x)
}
result <- as_tibble(unclass(x), ..., .rows = .rows, .tidy_names = .tidy_names)
result <- as_tibble(unclass(x), ..., .rows = .rows, .name_repair = .name_repair)

if (is.null(rownames)) {
result
Expand All @@ -110,47 +106,20 @@ as_tibble.data.frame <- function(x, validate = TRUE, ...,
#' @export
#' @rdname as_tibble
as_tibble.list <- function(x, validate = TRUE, ..., .rows = NULL,
.tidy_names = if (validate) NULL else FALSE) {
.name_repair = c("assert_unique", "unique", "syntactic", "none", "minimal")) {
if (!missing(validate)) {
warn("The `validate` argument to `as_tibble()` is deprecated. Please use `.tidy_names` to control column names.")
message("The `validate` argument to `as_tibble()` is deprecated. Please use `.name_repair` to control column names.")
.name_repair <- if (isTRUE(validate)) "assert_unique" else "none"
}

x <- auto_tidy_names(x, .tidy_names)
x <- set_repaired_names(x, .name_repair)
x <- check_valid_cols(x)
recycle_columns(x, .rows)
}

auto_tidy_names <- function(x, .tidy_names) {
if (is.null(.tidy_names)) {
x <- check_tibble(x)
} else if (isTRUE(.tidy_names)) {
names(x) <- tidy_names(names2(x))
} else if (is_function(.tidy_names)) {
names(x) <- .tidy_names(names2(x))
} else if (identical(.tidy_names, FALSE)) {
if (has_null_names(x)) {
names(x) <- rlang::rep_along(x, "")
}
} else {
abort(error_tidy_names_arg())
}
x
}

check_tibble <- function(x) {
# Names
check_valid_cols <- function(x) {
names_x <- names2(x)
bad_name <- which(is.na(names_x) | names_x == "")
if (has_length(bad_name)) {
abort(error_column_must_be_named(bad_name))
}

dups <- which(duplicated(names_x))
if (has_length(dups)) {
abort(error_column_must_have_unique_name(names_x[dups]))
}

# Types
is_xd <- which(!map_lgl(x, is_1d))
is_xd <- which(!map_lgl(x, is_1d_or_2d))
if (has_length(is_xd)) {
classes <- map_chr(x[is_xd], function(x) class(x)[[1]])
abort(error_column_must_be_vector(names_x[is_xd], classes))
Expand All @@ -164,6 +133,11 @@ check_tibble <- function(x) {
x
}

is_1d_or_2d <- function(x) {
# dimension check is for matrices and data.frames
(is_vector(x) && !needs_dim(x)) || is.data.frame(x) || is.matrix(x)
}

recycle_columns <- function(x, .rows) {
lengths <- col_lengths(x)
nrow <- guess_nrow(lengths, .rows)
Expand Down
16 changes: 12 additions & 4 deletions R/msg.R
Expand Up @@ -84,12 +84,16 @@ error_inconsistent_new_rows <- function(names) {
)
}

error_names_must_be_non_null <- function() {
"The `names` must not be `NULL`."
}

error_column_must_be_named <- function(names) {
invalid_df("must be named", names)
}

error_column_must_have_unique_name <- function(names) {
invalid_df("must have [a ]unique name(s)", names)
error_column_names_must_be_unique <- function(names) {
pluralise_commas("Column name(s) ", tick(names), " must not be duplicated.")
}

error_column_must_be_vector <- function(names, classes) {
Expand Down Expand Up @@ -180,15 +184,19 @@ error_tribble_non_rectangular <- function(cols, cells) {
)
}

error_name_length_required <- function() {
"`n` must be specified, when the `names` attribute is `NULL`."
}

error_frame_matrix_list <- function(pos) {
bullets(
"All values in `frame_matrix()` must be atomic:",
pluralise_commas("Found list-valued element(s) at position(s) ", pos, ".")
)
}

error_tidy_names_arg <- function() {
"The `.tidy_names` argument must be NULL, TRUE, FALSE, or a function."
error_name_repair_arg <- function() {
"The `.name_repair` argument must be a string or a function that specifies the name repair strategy."
}

error_new_tibble_needs_nrow <- function() {
Expand Down

0 comments on commit e54aeeb

Please sign in to comment.