Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

repair_names() feature requests #217

Closed
jennybc opened this issue Feb 6, 2017 · 15 comments 路 Fixed by #248
Closed

repair_names() feature requests #217

jennybc opened this issue Feb 6, 2017 · 15 comments 路 Fixed by #248

Comments

@jennybc
Copy link
Member

jennybc commented Feb 6, 2017

I've just started running readxl's output through repair_names() so it stops producing tibbles with empty, NA, or duplicated column names 馃帀.

But I noticed that tibble::repair_names() and readr are not consistent.

x <- list(1:3,
          var2 = letters[1:3],
          c(1, 2, 3) + 0.1,
          var2 = letters[26:24])
names(x)[3] <- NA

tibble::repair_names(tibble::as_data_frame(x, validate = FALSE))
#> # A tibble: 3 脳 4
#>      V1  var2    V2 var21
#>   <int> <chr> <dbl> <chr>
#> 1     1     a   1.1     z
#> 2     2     b   2.1     y
#> 3     3     c   3.1     x

readr::read_csv(",var2,,var2\n1,'a',1.1,'z'\n2,'b',2.1,'y'\n3,'c',3.1,'x'")
#> Warning: Missing column names filled in: 'X1' [1], 'X3' [3]
#> Warning: Duplicated column names deduplicated: 'var2' => 'var2_1' [4]
#> # A tibble: 3 脳 4
#>      X1  var2    X3 var2_1
#>   <int> <chr> <dbl>  <chr>
#> 1     1   'a'   1.1    'z'
#> 2     2   'b'   2.1    'y'
#> 3     3   'c'   3.1    'x'

Feature requests, some inspired by readr:

  • Report what and where re: modifications
  • Number wrt absolute column position, as opposed to relative within the missing names
  • Separate the de-deduplicating suffix with something regex-friendly
  • Use a double underscore as prefix for new names and separator in de-duplication, e.g. __X1 and var2__1.

The first one is for a better interactive experience. Otherwise, aimed at programmatic work with tibbles that may have been subjected to name repair.

@krlmlr
Copy link
Member

krlmlr commented Feb 6, 2017

We could add arguments absolute = FALSE and verbose = interactive(), in addition to the existing prefix and sep arguments. This doesn't change the default behavior, but allows to replicate what readr does. Would that help?

@jennybc
Copy link
Member Author

jennybc commented Feb 6, 2017

Yes, it certainly helps.

It feels like the tidyverse should have a preferred remedy for nonexistent and duplicated column names. And tibble is currently the logical place to implement that. So I'd still like to reach a consensus across packages.

What do you think of exporting make_unique()? Sometimes you have the names before you have the object.

@hadley
Copy link
Member

hadley commented Feb 6, 2017

I'd like to make repair_names() do what readr currently does so readr can just call it.

I think it's probably a good idea to have two variants: one that works on a character vector of names, and the other that works on a named vector.

@jennybc
Copy link
Member Author

jennybc commented Feb 6, 2017

Perhaps my suggestion about leading underscores is not so bright. It creates non-syntactic names. I still think the convention should be something easily detectable via regex, i.e. unlikely to be present in original column names.

I expected sep to only be used during de-duplication.

x <- list(1:3,
          var2 = letters[1:3],
          c(1, 2, 3) + 0.1,
          var2 = letters[26:24])
names(x)[3] <- NA

tibble::repair_names(tibble::as_data_frame(x, validate = FALSE),
                     prefix = "__X", sep = "__")
#> # A tibble: 3 脳 4
#>   `__X__1`  var2 `__X__2` var2__1
#>      <int> <chr>    <dbl>   <chr>
#> 1        1     a      1.1       z
#> 2        2     b      2.1       y
#> 3        3     c      3.1       x

@jimhester
Copy link
Collaborator

A leading . would work for this and is not very likely to happen by chance, you can even use two .. and still be syntactically valid.

@jennybc
Copy link
Member Author

jennybc commented Feb 6, 2017

@jimhester Do you propose . or .. as the separator to be used for deduplication as well? Seems like a good idea.

@jennybc
Copy link
Member Author

jennybc commented Feb 6, 2017

Moving towards a consensus on what should happen in current example:

..X1, var2, ..X3, var2..1

Or we could decide whenever we incorporate a number, it refers to absolute column position

..X1, var2, ..X3, var2..4

Even more consistent? Drop the V or X.

..1, var2, ..3, var2..4

Now every name that needs repair simply gets ..j appended.

@jimhester
Copy link
Collaborator

FWIW I think the doubled identifiers are quite ugly, but I suppose that is part of the point...

@jennybc
Copy link
Member Author

jennybc commented Feb 6, 2017

Yeah, I think it is a feature if the repaired names are unsightly.

@hadley
Copy link
Member

hadley commented Feb 6, 2017

I don't like the use of dots because it goes against the usual convention to use _. But I don't see a good alternative here, given that you can't start a syntactic name with _.

@krlmlr
Copy link
Member

krlmlr commented Feb 7, 2017

The way dplyr::rename() currently works, the names don't need to be syntactic. On the contrary, making them non-syntactic might encourage the user to use proper names as soon as possible, or to drop with select(-starts_with("__")).

I particularly like @jennybc's idea of simply appending the absolute position, without using V or X. I'm mildly concerned about backward compatibility, we might want to use pkgconfig to simplify transition from the old behavior.

@hadley
Copy link
Member

hadley commented Feb 9, 2017

I think we should:

  • Switch to from relative to absolute positions.
  • Use .. as common prefix. I think it's the best compromise of ugliness and utility.
  • Show what transformation is happening (a la readr). We can add quiet = FALSE to suppress
  • Add a syntactic argument which, when TRUE, ensures that all names are syntactic

I'm not sure what we should do with respect to backward compatibility. I suspect few packages depend on this behaviour; it's going to be more user code. I think as long as there's a clear way to get to the old behaviour it shouldn't cause much hassle (especially since we're now describing what's happening)

@jennybc
Copy link
Member Author

jennybc commented Apr 1, 2017

@krlmlr What's the outlook on this re: next tibble release?

@krlmlr
Copy link
Member

krlmlr commented Apr 1, 2017

The next tibble release will have a better repair_names(), but probably only after the next dplyr release.

@krlmlr krlmlr mentioned this issue May 6, 2017
krlmlr added a commit that referenced this issue May 10, 2017
- New `set_tidy_names()` and `tidy_names()`, a simpler version of `repair_names()` which works unchanged for now (#217).
krlmlr added a commit that referenced this issue May 13, 2017
- The `print()`, `format()`, and `tbl_sum()` methods are now implemented for class `"tbl"` and not for `"tbl_df"`. This allows subclasses to use tibble's formatting facilities. The formatting of the header can be tweaked by implementing `tbl_sum()` for the subclass.
- New `set_tidy_names()` and `tidy_names()`, a simpler version of `repair_names()` which works unchanged for now (#217).
- Printing now uses `x` again instead of the Unicode multiplication sign, to avoid encoding issues (#216).
- `glimpse()` now properly displays tibbles with foreign characters in column names (#235).
krlmlr added a commit that referenced this issue May 17, 2017
- Subsetting zero columns no longer returns wrong number of rows (#241, @echasnovski).

- New `set_tidy_names()` and `tidy_names()`, a simpler version of `repair_names()` which works unchanged for now (#217).
- New `rowid_to_column()` that adds a `rowid` column as first column and removes row names (#243, @barnettjacob).
- The `all.equal.tbl_df()` method has been removed, calling `all.equal()` now forwards to `base::all.equal.data.frame()`. To compare tibbles ignoring row and column order, please use `dplyr::all_equal()` (#247).

- Printing now uses `x` again instead of the Unicode multiplication sign, to avoid encoding issues (#216).
- String values are now quoted when printing if they contain non-printable characters or quotes (#253).
- The `print()`, `format()`, and `tbl_sum()` methods are now implemented for class `"tbl"` and not for `"tbl_df"`. This allows subclasses to use tibble's formatting facilities. The formatting of the header can be tweaked by implementing `tbl_sum()` for the subclass, which is expected to return a named character vector. The `print.tbl_df()` method is still implemented for compatibility with downstream packages, but only calls `NextMethod()`.
- Own printing routine, not relying on `print.data.frame()` anymore. Now providing `format.tbl_df()` and full support for Unicode characters in names and data, also for `glimpse()` (#235).

- Improve formatting of error messages (#223).
- Using `rlang` instead of `lazyeval` (#225, @lionel-), and `rlang` functions (#244).
- `tribble()` now handles values that have a class (#237, @NikNakk).
- Minor efficiency gains by replacing `any(is.na())` with `anyNA()` (#229, @csgillespie).
- The `microbenchmark` package is now used conditionally (#245).
- `pkgdown` website.
jennybc added a commit to tidyverse/readxl that referenced this issue Apr 15, 2018
jennybc added a commit to tidyverse/readxl that referenced this issue Apr 15, 2018
jennybc added a commit to tidyverse/readxl that referenced this issue Apr 15, 2018
jennybc added a commit to tidyverse/readxl that referenced this issue Apr 25, 2018
jennybc added a commit to tidyverse/readxl that referenced this issue Apr 25, 2018
@github-actions
Copy link
Contributor

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants