New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and revise name repair #469

Merged
merged 44 commits into from Aug 30, 2018

Conversation

Projects
None yet
4 participants
@jennybc
Copy link
Member

jennybc commented Aug 24, 2018

Fixes #463 Exposure (or lack thereof) of syntactic argument in tibble() and as_tibble()
Fixes #461 Advertise .tidy_names
Fixed #446 Justify tidy_names


Other name-related tibble issues

  • Finalize interface of .name_repair = <A_FUNCTION> #476 let's decide/discuss there
  • tibble doesn't make it easy to get variable names that match our style guide #472 let's solve in follow-up PR
  • names<- method should protect against invalid names #466 let's solve in follow-up PR
  • Move name repair to vctrs #464 PR lays groundwork for this long-term goal
  • revisit strategy for making names syntactic #459 let's solve in follow-up PR
  • Re-encode character columns and column names to UTF-8 #87 relevant, but I don't touch this

Related issue in tidyverse/principles:


Overview:

  • Defines and motivates 3 levels of repaired names:
    • minimal: not NULL, no NAs, à la rlang::names2()
    • unique: minimal + no ""s and no dupes (cf tidy_names(syntactic = FALSE))
    • syntactic: unique + syntactic (cf tidy_names(syntactic = TRUE))
  • Internal utility functions for each level of name repair, i.e. all combinations of "repair vs check" and "operate on the names vs the named vector".
  • tibble() and as_tibble() expose name repair via .name_repair. Replaces .tidy_names (piloted only in dev) and as_tibble(..., validate = TRUE/FALSE) (in CRAN version).
  • Default behaviour (.name_repair = "assert_unique")): do no repair but check that the names are unique.
  • .name_repair = "none" means just make sure the names are minimal
  • .name_repair = "unique" and .name_repair = "syntactic" mean what you think
  • .name_repair can also take a user-supplied function.

To discuss:

  • Should valid just be unique? YES and I have made it so in the source.
  • Let us review the "names degrees of freedom" and facilities offered by base, tidyverse for independent control:
    • exist or not
    • unique or not
    • syntactic or not
  • Automatic naming, based on expression. We currently offer no control over this in tibble().
  • How much to reveal re: long-term plans. For example, why tidy_names() and set_tidy_names() are still here, but haven't gained the obvious counterparts like unique_names() and set_syntactic_names(). Do we need a Life Cycle section in the docs?
  • My backwards compatibility report

@jennybc jennybc requested review from krlmlr and hadley Aug 24, 2018

#'
#' All `tidy` names are `valid`, all `valid` names are `minimal`.
#'
#' For each LEVEL in ("minimal", "valid", "tidy"), there is a family of utility

This comment has been minimized.

@jennybc

jennybc Aug 24, 2018

Member

This is currently a lie, because there are no checkers for "tidy" names (yet). Because I don't think we've truly decided what the strategy will be.

This comment has been minimized.

@hadley

hadley Aug 24, 2018

Member

I'd prefer to not export these now because they seem like they might more naturally belong in vctrs, and in tibble we can expose through an argument

This comment has been minimized.

@jennybc

jennybc Aug 24, 2018

Member

OK I get that. Note that tidy_names() and set_tidy_names() already exist in tibble and are exported. And this PR changes their meaning, by creating this minimal < valid < tidy system. So we have to deal with that.

This comment has been minimized.

@krlmlr

krlmlr Aug 25, 2018

Member

In tidy_names() I see "tidy" as a verb. Maybe clean_names() and validate_names() ?

make_syntactic <- function(name, syntactic) {
if (!syntactic) return(name)

make_syntactic <- function(name) {

This comment has been minimized.

@jennybc

jennybc Aug 24, 2018

Member

There's still lots to still think about and maybe even do here. But I haven't gotten into it.

This comment has been minimized.

@jennybc

jennybc Aug 24, 2018

Member

For example, seems like tidy_names(c(n1, n2)) should be same as tidy_names(c(tidy_names(n1), tidy_names(n2))). Or if not, that should be justified. Perhaps this is already true? But that's one example of what I feel like I haven't thought through yet.

This comment has been minimized.

@hadley

hadley Aug 24, 2018

Member

And tidying should be idempotent - i.e. tidy_names(x) == tidy_names(tidy_names(x))

This comment has been minimized.

@jennybc

jennybc Aug 25, 2018

Member

Relevant promise from base::make.unique():

The algorithm used by make.unique has the property that make.unique(c(A, B)) == make.unique(c(make.unique(A), B)).

In other words, you can append one string at a time to a vector, making it unique each time, and get the same result as applying make.unique to all of the strings at once.

If character vector A is already unique, then make.unique(c(A, B)) preserves A.

This comment has been minimized.

@krlmlr

krlmlr Aug 25, 2018

Member

Seems that these desirable properties are true for some of our operations. Should we document them and add run time checks, so that we get an alert if we missed anything?

This comment has been minimized.

@jennybc

jennybc Aug 26, 2018

Member

I haven't focused on the "make syntactic" part yet. So I'm still not sure if this is the exact guarantee we'd want but we definitely want to be guided by one or more principles like this.

bad_name <- which(is.na(names_x) | names_x == "")
if (has_length(bad_name)) {
abort(error_column_must_be_named(bad_name))
if (is_function(.name_repair)) {

This comment has been minimized.

@jennybc

jennybc Aug 24, 2018

Member

Perhaps I should extract the all name repair below here into a function defined in repair-names.R, so this logic isn't locked up in as_tibble.list. Function would take (potentially named) object x and .name_repair as inputs.

This comment has been minimized.

@hadley

hadley Aug 24, 2018

Member

Yes, I think that would be a good idea. That would also mean that we don't have to expose valid_names() etc; otherwise we'll need to spend some time discussing a good naming strategy for those functions (i.e. we now have a family of functions where the family name is a suffix, not a prefix, which is undesirable)

This comment has been minimized.

@jennybc

jennybc Aug 24, 2018

Member

I know the names aren't great. The existence tidy_names() and set_tidy_names() must be reckoned with, when we do tackle it.

if (!missing(validate)) {
warn("The `validate` argument to `as_tibble()` is deprecated. Please use `.tidy_names` to control column names.")
warn("The `validate` argument to `as_tibble()` is deprecated. Please use `.name_repair` to control column names.")
.name_repair <- if (isTRUE(validate)) "none_passive" else "none"

This comment has been minimized.

@jennybc

jennybc Aug 24, 2018

Member

This might need refinement with an eye towards not breaking existing user code, but moving towards the light.

This comment has been minimized.

@hadley

hadley Aug 24, 2018

Member

Agreed - a soft deprecation might be a better strategy here.

This comment has been minimized.

@krlmlr

krlmlr Aug 24, 2018

Member

Same in as_tibble.data.frame() .

This comment has been minimized.

@jennybc

jennybc Aug 24, 2018

Member

I was talking about .name_repair <- ... but I suspect @hadley may be talking about the use of validate. Is that right? If so, what exact form of "soft depracation" do we want?

This comment has been minimized.

@hadley

hadley Aug 24, 2018

Member

A soft-dep here would just be to just turn the warning into a message so it doesn't break downstream packages

This comment has been minimized.

@krlmlr

krlmlr Aug 25, 2018

Member

How would the warning break downstream dependencies?

I'd rather be silent here for now, and start giving messages/warnings only after we have a version on CRAN where .name_repair works. This buys more time to adapt.

bad_name <- which(is.na(names_x) | names_x == "")
if (has_length(bad_name)) {
abort(error_column_must_be_named(bad_name))
if (is_function(.name_repair)) {

This comment has been minimized.

@hadley

hadley Aug 24, 2018

Member

Yes, I think that would be a good idea. That would also mean that we don't have to expose valid_names() etc; otherwise we'll need to spend some time discussing a good naming strategy for those functions (i.e. we now have a family of functions where the family name is a suffix, not a prefix, which is undesirable)

if (!missing(validate)) {
warn("The `validate` argument to `as_tibble()` is deprecated. Please use `.tidy_names` to control column names.")
warn("The `validate` argument to `as_tibble()` is deprecated. Please use `.name_repair` to control column names.")
.name_repair <- if (isTRUE(validate)) "none_passive" else "none"

This comment has been minimized.

@hadley

hadley Aug 24, 2018

Member

Agreed - a soft deprecation might be a better strategy here.

dups <- which(duplicated(names_x))
if (has_length(dups)) {
abort(error_column_must_have_unique_name(names_x[dups]))
name_fixer <- switch(

This comment has been minimized.

@hadley

hadley Aug 24, 2018

Member

I think this would be simpler if you put it in the else block of the above if statement - then you wouldn't need the dummy custom_fixer

#' Accomplished by setting all names to `""`. Enforced internally by
#' [tibble()] and [as_tibble()] and there is no opt-out. The objective is to
#' produce a substrate of the correct length for other name repair strategies.
#' TODO: decide if this should include replacing `NA` names with `""`

This comment has been minimized.

@hadley

hadley Aug 24, 2018

Member

Yes, it should (to be consistent with rlang::names2())

#'
#' All `tidy` names are `valid`, all `valid` names are `minimal`.
#'
#' For each LEVEL in ("minimal", "valid", "tidy"), there is a family of utility

This comment has been minimized.

@hadley

hadley Aug 24, 2018

Member

I'd prefer to not export these now because they seem like they might more naturally belong in vctrs, and in tibble we can expose through an argument

#'
#' @param prefix A string, the prefix to use for new column names.
#' @param sep A string inserted between the column name and de-duplicating
#' number.
#' number.
#' @export
repair_names <- function(x, prefix = "V", sep = "") {
if (length(x) == 0) {

This comment has been minimized.

@hadley

hadley Aug 24, 2018

Member

This should probably now be formally deprecated

This comment has been minimized.

@jennybc

jennybc Aug 25, 2018

Member

I went with soft deprecation (message()) because dplyr::bind_cols() calls this.

@krlmlr
Copy link
Member

krlmlr left a comment

Thanks for working on it.

if (!missing(validate)) {
warn("The `validate` argument to `as_tibble()` is deprecated. Please use `.tidy_names` to control column names.")
warn("The `validate` argument to `as_tibble()` is deprecated. Please use `.name_repair` to control column names.")
.name_repair <- if (isTRUE(validate)) "none_passive" else "none"

This comment has been minimized.

@krlmlr

krlmlr Aug 24, 2018

Member

Same in as_tibble.data.frame() .

valid = valid_names,
tidy = tidy_names,
custom = custom_fixer,
abort(error_name_repair_arg())

This comment has been minimized.

@krlmlr

krlmlr Aug 24, 2018

Member

Do we want to support formulas and coerce with rlang::as_function() , to get a purrr-like syntax?

This comment has been minimized.

@jennybc

jennybc Aug 29, 2018

Member

I opened a separate issue for this: #476

R/msg.R Outdated
@@ -84,6 +84,10 @@ error_inconsistent_new_rows <- function(names) {
)
}

error_names_must_be_non_null <- function() {
"The `names` must not be NULL"

This comment has been minimized.

@krlmlr

krlmlr Aug 24, 2018

Member

This (and the message below) needs a full stop.

R/msg.R Outdated
error_tidy_names_arg <- function() {
"The `.tidy_names` argument must be NULL, TRUE, FALSE, or a function."
error_name_repair_arg <- function() {
"The `.name_repair` argument must be NULL, \"none\", \"valid\", \"tidy\", or a function."

This comment has been minimized.

@krlmlr

krlmlr Aug 24, 2018

Member

Do we explicitly support "none_passive"? What's a good way to keep error message and option list in sync?

This comment has been minimized.

@jennybc

jennybc Aug 24, 2018

Member

I view "none_passive" as an ugly temporary measure that lives as long as the validate argument is tolerated. But it's strictly an internal matter.

I don't have a plan re: keeping the error message and the possible values of .name_repair in sync 🤔

This comment has been minimized.

@jennybc

jennybc Aug 24, 2018

Member

Wait, "none_passive" is very important as it is our long-term default.

I think this is the question: is it defined implicitly as "the thing that happens when you don't specify .repair_names" (remains an internal matter) or do we give it name the public can see?

This comment has been minimized.

@krlmlr

krlmlr Aug 25, 2018

Member

Maybe rename "none_passive" to "check" and expose?

This comment has been minimized.

@krlmlr

krlmlr Aug 26, 2018

Member

The supported values could be an argument to the error reporting function (to keep the definitions in one place).

This comment has been minimized.

@jennybc

jennybc Aug 27, 2018

Member

I went with "assert_valid". Haven't dealt with propagating the supported values into the error message yet. but can do that as this matures.

if (is.null(name)) {
abort(error_names_must_be_non_null())
}
name

This comment has been minimized.

@krlmlr

krlmlr Aug 24, 2018

Member

This function would be called for its side effects, if at all we should return invisibly (here and below).

Show resolved Hide resolved tests/testthat/test-data-frame.R

@hadley hadley referenced this pull request Aug 24, 2018

Closed

Tibble documentation improvements #470

11 of 11 tasks complete

@jennybc jennybc referenced this pull request Aug 25, 2018

Closed

Vector names #4

@codecov

This comment has been minimized.

Copy link

codecov bot commented Aug 25, 2018

Codecov Report

Merging #469 into master will decrease coverage by 0.18%.
The diff coverage is 88.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #469      +/-   ##
==========================================
- Coverage   90.77%   90.59%   -0.19%     
==========================================
  Files          26       26              
  Lines        1084     1127      +43     
==========================================
+ Hits          984     1021      +37     
- Misses        100      106       +6
Impacted Files Coverage Δ
R/utils.r 100% <ø> (ø) ⬆️
R/tibble.R 100% <100%> (ø) ⬆️
R/msg.R 100% <100%> (ø) ⬆️
R/as_tibble.R 95.16% <77.77%> (+0.49%) ⬆️
R/repair-names.R 90.98% <89%> (-9.02%) ⬇️
src/coerce.c 100% <0%> (ø) ⬆️
src/matrixToDataFrame.c 85.18% <0%> (+0.65%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2af751...abc7e3a. Read the comment docs.

if (is_function(.name_repair)) {
repair_fun <- .name_repair
} else {
repair_fun <- switch(

This comment has been minimized.

@krlmlr

krlmlr Aug 25, 2018

Member

The error message isn't helpful if .name_repair is not a string, see also r-lib/rlang#589.

as_tibble(iris, .name_repair = TRUE)

@jennybc jennybc force-pushed the name-repair branch from 9e4ff82 to b3260cc Aug 26, 2018

API Outdated
as_tibble.matrix(x, ...)
as_tibble.poly(x, ...)
as_tibble.table(x, `_n` = "n", ..., n = `_n`)
as_tibble.table(x, _n = "n", ..., n = _n)

This comment has been minimized.

@krlmlr

krlmlr Aug 26, 2018

Member

This needs r-lib/pkgapi#26, but I can fix after merging.

R/msg.R Outdated
error_tidy_names_arg <- function() {
"The `.tidy_names` argument must be NULL, TRUE, FALSE, or a function."
error_name_repair_arg <- function() {
"The `.name_repair` argument must be NULL, \"none\", \"valid\", \"tidy\", or a function."

This comment has been minimized.

@krlmlr

krlmlr Aug 26, 2018

Member

The supported values could be an argument to the error reporting function (to keep the definitions in one place).

Show resolved Hide resolved vignettes/tibble.Rmd
@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Aug 26, 2018

Also need to think about the connection to using expressions to name columns, which happens before all of the name repair I'm working on 🤔:

library(tibble)

x <- 3:1

tibble(1:3, c(4,5,6), x)
#> # A tibble: 3 x 3
#>   `1:3` `c(4, 5, 6)`     x
#>   <int>        <dbl> <int>
#> 1     1            4     3
#> 2     2            5     2
#> 3     3            6     1

tibble(1:3, c(4,5,6), x, .name_repair = "tidy")
#> New names:
#> * `1:3` -> X1.3
#> * `c(4, 5, 6)` -> c.4..5..6.
#> # A tibble: 3 x 3
#>    X1.3 c.4..5..6.     x
#>   <int>      <dbl> <int>
#> 1     1          4     3
#> 2     2          5     2
#> 3     3          6     1

Created on 2018-08-26 by the reprex package (v0.2.0.9000).

@hadley

hadley approved these changes Aug 29, 2018

Copy link
Member

hadley left a comment

Looks good to me. I think any remaining concerns can be addressed in follow-up PRs.

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Aug 29, 2018

I will review and merge tomorrow. Does the following NEWS entry describe the changes well?

  • Repair of column names has been revised thoroughly. The new .name_repair argument to as_tibble() and tibble() controls the renaming, the following values are supported:

    • "none": Do nothing: no name repair, no name checking,
    • "valid": Eliminate missing or duplicated names,
    • "assert_valid": (default value), do not repair the names, but check they are "valid",
    • "syntactic": Use TODO: fill this in to create valid and syntactic
      names
    • a function: apply custom name repair (e.g., .name_repair = make.names for names in the style of base R).

    The validate argument to as_tibble() and tibble() is deprecated, use .name_repair = "none" instead of validate = FALSE (#469, @jennybc).

More things to look at:

  • Use rlang::as_function() to support anonymous functions with the formula notation
  • Message only once?

Thanks a lot for working on this!

R/tibble.R Outdated
#' - `"none"`: Do nothing: no name repair, no name checking,
#' - `"valid"`: Eliminate missing or duplicated names,
#' - `"assert_valid"`: (default value), do not repair the names, but check they are `valid`,
#' - `"syntactic"`: Use `TODO: fill this in` to create `valid` and syntactic

This comment has been minimized.

@krlmlr

krlmlr Aug 29, 2018

Member

What's missing here?

This comment has been minimized.

@jennybc

jennybc Aug 29, 2018

Member

I've reworded now. If we had exported new name-fixers, I would list the relevant one here. But we are holding off on that.

jennybc added some commits Aug 29, 2018

Say `unique`, instead of `valid`
Conveys a lot more about what we actually mean

@jennybc jennybc referenced this pull request Aug 30, 2018

Closed

Justify tidy_names #446

@jennybc jennybc force-pushed the name-repair branch from 4501588 to e4e2771 Aug 30, 2018

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Aug 30, 2018

@krlmlr I am done here. The AppVeyor failure seems unrelated and ?possibly tic-related?.

I edited the NEWS and opened issues re: function interface and maybe messaging once.

Please squash!

@krlmlr krlmlr merged commit e54aeeb into master Aug 30, 2018

2 of 4 checks passed

continuous-integration/appveyor/branch AppVeyor build failed
Details
continuous-integration/appveyor/pr AppVeyor build failed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Aug 30, 2018

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment