Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ...j the standard disambiguating suffix in name repair #566

Closed
jennybc opened this issue Jan 25, 2019 · 31 comments
Closed

Make ...j the standard disambiguating suffix in name repair #566

jennybc opened this issue Jan 25, 2019 · 31 comments
Milestone

Comments

@jennybc
Copy link
Member

@jennybc jennybc commented Jan 25, 2019

I think we should go ahead and use ...j as the standard suffix when .name_repair = "unique". These names don't have to be syntactic but it's so easy to make them so (they are already ugly and have lots of dots) and it will reduce the downstream burden of dealing with non-syntactic names (example: tidyverse/dplyr#4094).

This would be good to do ASAP I think in a tiny release. What do you think @krlmlr?

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 25, 2019

Can we instead recommend to use .name_repair = "universal" as a general rule, or .name_repair = "unique" and a mandatory cleaning of column names?

  • I'd rather think carefully about another change of the behavior
  • It won't help if the column name is a keyword
  • Non-syntactic names prompt the user to do something about it

@jennybc
Copy link
Member Author

@jennybc jennybc commented Jan 25, 2019

I'm motivated by the case of empty names and a phenomenon I think will be fairly prevalent: the nonsyntactic names are short-lived or beside the point, but they prevent someone from doing something that seems totally unrelated. Like the dplyr issue above where, surprisingly, nonsyntactic names prevent mutate_all() from working.

@jennybc
Copy link
Member Author

@jennybc jennybc commented Jan 25, 2019

I agree it feels icky to make another change but we're still in early days of the tibble 2.x series and I suspect it's a net positive, long-term.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 25, 2019

I still believe we can educate users to default to .name_repair = "universal", and use .name_repair = "unique" only if they are comfortable with non-syntactic names. The invariant here is that "unique" doesn't do more than necessary, and both "unique" and "universal" use ..# as disambiguator. (This holds no matter what the disambiguator is.) I don't think that another release is a good use of our time, even in the long run.

dplyr is being fixed and now also works for names like if, $ and o o p s .

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 15, 2019

tidyverse/dplyr#4094 seems difficult to fix for names like ..1. If we avoid creating them, that issue becomes much less of a problem.

How about a renaming strategy that sits between "unique" and "universal" and creates ...1 instead of ..1, and .... instead of ..., but does not touch names that contain keywords? The features of each repair strategy would be:

  • Minimal: Proper names
  • Unique: Unique names
  • Quotable: Names can be used with backticks around them
  • Universal: Names can be used without any backticks

What would quotable repair do to backticks?

library(tibble)
library(magrittr)

data <- tibble(`\`` = 1)
data %>% names()
#> [1] "`"
as_tibble(data, .name_repair = "unique")
#> # A tibble: 1 x 1
#>    `\``
#>   <dbl>
#> 1     1
as_tibble(data, .name_repair = "universal")
#> New names:
#> * `\`` -> .
#> # A tibble: 1 x 1
#>       .
#>   <dbl>
#> 1     1

Created on 2019-02-15 by the reprex package (v0.2.1.9000)

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 15, 2019

I still feel like there's time to just make three dots the standard suffix. I really think we should.

@nathancday

This comment has been hidden.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 16, 2019

@jennybc: Why three dots? Why not just one?

@krlmlr

This comment has been hidden.

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 16, 2019

Well, .1 is not syntactic either, so we gain nothing.

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 16, 2019

Just to get it all written out, the reason to reconsider the "disambiguating numeric suffix" is that, in cases where the name was missing, it becomes the entire name. And even though we don't promise to make names syntactic when .name_repair = "unique", I think the world would probably be a better place if we did, for these cases.

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 16, 2019

Recording a bit more motivation ...

We are happy for our automatically repaired names to be ugly, because that helps create awareness that the user is working with names created by someone else.

But in many cases, part of the reason people are oblivious -- and can safely remain so -- is that they were never going to use those weird variable names anyway. But the mere presence of non-syntactic names can cause friction in unexpected places, such as mutate_all(). So that's why I'd like to make the form of the suffix one that produces syntactic names (albeit ugly).

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 16, 2019

.1 is syntactic, but doesn't mean what I thought it means :-\

I hear you, and I agree now that changing the default suffix is the best way to handle that problem in the short-medium term.

The problem with ..1 and ... (and also with ``) is that these names not only not syntactic, they're invalid even as a quoted name. That's worse -- mutate_all() doesn't have a problem with unquoted names, but does with non-quotable names. Maybe we can even support a naming strategy that's conceptually between "minimal" and "unique", and make that naming strategy the default. (This means that all empty names become a dot -- that's not too bad for tibbles IMO.)

I'll implement the changes and see how many revdep failures we have. If we can't fix all downstream failures in the very short term, I'd rather add a compatibility mode.

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 18, 2019

.1 is syntactic, but doesn't mean what I thought it means :-\

Can you spell this out more? It's so easy to overlook weird stuff and edge cases in this naming domain, that I want to make sure I'm holding the important facts in my head.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 18, 2019

It's a number. That's why .1x is forbidden, too.

.1 <- 5
#> Error in 0.1 <- 5: invalid (do_set) left-hand side to assignment
a <- .1
a
#> [1] 0.1

Created on 2019-02-18 by the reprex package (v0.2.1.9000)

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 18, 2019

OK so you mean it's a syntactic expression but not a syntactic name.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 18, 2019

Yes. I'm already sold on the idea that we should use three dots.

What do you think about converting `` to . by default? Perhaps not in minimal, but at least in unique repair mode?

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 18, 2019

What do you think about converting `` to . by default? Perhaps not in minimal, but at least in unique repair mode?

Well, this is already addressed for "unique", i.e. an empty name is always "repaired" as if it is a duplicate. So I think we're already good there, i.e. it will come along for the ride when the disambiguating suffix gets three dots.

tibble:::unique_names("")
#> New names:
#> * `` -> `..1`
#> [1] "..1"

Created on 2019-02-18 by the reprex package (v0.2.1.9000)

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 18, 2019

We got this "for free" with the motivating principle for "unique" names: "All columns can be accessed by name via df[["name"]]."

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 18, 2019

Nice, I wasn't aware of that. If we also expand ... to .... in "unique", we can extend this guarantee to "...accessed by name via ... and df$`name` .

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 18, 2019

Or perhaps we always append a number to ... ?

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 18, 2019

My feelings about ... are less strong than what happens with empty names, just because I think it's comparatively much more rare.

But I'm not sure that it's worth special casing it for "unique" (because I suspect it's so rare).

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 18, 2019

..accessed by name via ... and df$`name`

But I do like the idea of having some operational principle whereby each level of repair has this sort of practical significance.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 18, 2019

So how about always converting ... to ...# for "unique"? It is rare, but then we have that nice guarantee.

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 18, 2019

If it were up to me, I wouldn't do it. Or I would treat it like "" and give it the full suffix and let it have an awful lot of dots. That still delivers the backtick guarantee. I'm not really keen on the special case, code- and documentation-wise.

But I can let it go if you feel strongly.

@krlmlr krlmlr added this to the 2.x.y milestone Feb 22, 2019
@dpprdan
Copy link

@dpprdan dpprdan commented Feb 22, 2019

Coming here from community.rstudio.com, I assume the new behaviour means that instead of

x <- tibble::as_tibble(list(1), .name_repair = "unique")
#> New names:
#> * `` -> `..1`
dplyr::rename(x, a = `..1`)
#> Error in .f(.x[[i]], ...): ..1 used in an incorrect context, no ... to look in

there will be this?

x <- tibble::as_tibble(list(1), .name_repair = "unique")
#> New names:
#> * `` -> `...1`
dplyr::rename(x, a = `...1`)
#> # A tibble: 1 x 1
#>       a 
#>   <dbl> 
#> 1     1 

(BTW, this originated for me as a problem with readxl).

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 22, 2019

@dpprdan Yeah, exactly. Yes, readxl is also fueling my interest in both name repair and this specific issue thread, as Excel seems to be the R world's main supplier of horrifically-named tibbles.

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 22, 2019

@krlmlr Does this summarize our current status?

  • We will make ...j the standard disambiguating suffix.
  • ... joins "" as a special value that always gets the suffix when .name_repair = "unique", i.e. they are are considered duplicates if they appear even once
  • One of us needs to convince ourselves and the other that, when we're done, the rationale for "unique" names becomes: "All columns can be accessed by name via df[["name"]] or df$`name`."

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Feb 22, 2019

Agreed.

FWIW, users shouldn't rely on these auto-generated names by default. One way to rename is by position:

library(tidyverse)

data <-
  list(a = 1, 2, 3, q = 4) %>%
  as_tibble(.name_repair = "minimal")

data
#> # A tibble: 1 x 4
#>       a    ``    ``     q
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     2     3     4

data %>%
  as_tibble(.name_repair = "unique") %>%
  rename(b = 2, c = 3)
#> New names:
#> * `` -> `..2`
#> * `` -> `..3`
#> # A tibble: 1 x 4
#>       a     b     c     q
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     2     3     4

Created on 2019-02-22 by the reprex package (v0.2.1.9000)

@jennybc
Copy link
Member Author

@jennybc jennybc commented Feb 22, 2019

FWIW, users shouldn't rely on these auto-generated names by default

Agreed. In the medium term, I'd like to add an article in readxl on this. And it will definitely contain the advice that if your downstream code has hard-wired variable names, you need to be intentional at import time. Many names won't need repair but if they do and you refer to them by name later, you should choose .name_repair deliberately.

I think our main goal with this change to the disambiguating suffix is to prevent some mysterious problems, like we've seen with the scoped dplyr verbs. In that example, the user is not making any explicit name use, but turns out it matters because of an implementation detail in the scoped verbs.

@krlmlr krlmlr closed this in #584 Mar 8, 2019
krlmlr added a commit that referenced this issue Mar 8, 2019
- Three dots are used even for `"unique"` name repair (#566).
@github-actions
Copy link

@github-actions github-actions bot commented Dec 8, 2020

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants