New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

repair_names() feature requests #217

Closed
jennybc opened this Issue Feb 6, 2017 · 14 comments

Comments

Projects
4 participants
@jennybc
Member

jennybc commented Feb 6, 2017

I've just started running readxl's output through repair_names() so it stops producing tibbles with empty, NA, or duplicated column names 馃帀.

But I noticed that tibble::repair_names() and readr are not consistent.

x <- list(1:3,
          var2 = letters[1:3],
          c(1, 2, 3) + 0.1,
          var2 = letters[26:24])
names(x)[3] <- NA

tibble::repair_names(tibble::as_data_frame(x, validate = FALSE))
#> # A tibble: 3 脳 4
#>      V1  var2    V2 var21
#>   <int> <chr> <dbl> <chr>
#> 1     1     a   1.1     z
#> 2     2     b   2.1     y
#> 3     3     c   3.1     x

readr::read_csv(",var2,,var2\n1,'a',1.1,'z'\n2,'b',2.1,'y'\n3,'c',3.1,'x'")
#> Warning: Missing column names filled in: 'X1' [1], 'X3' [3]
#> Warning: Duplicated column names deduplicated: 'var2' => 'var2_1' [4]
#> # A tibble: 3 脳 4
#>      X1  var2    X3 var2_1
#>   <int> <chr> <dbl>  <chr>
#> 1     1   'a'   1.1    'z'
#> 2     2   'b'   2.1    'y'
#> 3     3   'c'   3.1    'x'

Feature requests, some inspired by readr:

  • Report what and where re: modifications
  • Number wrt absolute column position, as opposed to relative within the missing names
  • Separate the de-deduplicating suffix with something regex-friendly
  • Use a double underscore as prefix for new names and separator in de-duplication, e.g. __X1 and var2__1.

The first one is for a better interactive experience. Otherwise, aimed at programmatic work with tibbles that may have been subjected to name repair.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Feb 6, 2017

We could add arguments absolute = FALSE and verbose = interactive(), in addition to the existing prefix and sep arguments. This doesn't change the default behavior, but allows to replicate what readr does. Would that help?

@jennybc

This comment has been minimized.

Member

jennybc commented Feb 6, 2017

Yes, it certainly helps.

It feels like the tidyverse should have a preferred remedy for nonexistent and duplicated column names. And tibble is currently the logical place to implement that. So I'd still like to reach a consensus across packages.

What do you think of exporting make_unique()? Sometimes you have the names before you have the object.

@hadley

This comment has been minimized.

Member

hadley commented Feb 6, 2017

I'd like to make repair_names() do what readr currently does so readr can just call it.

I think it's probably a good idea to have two variants: one that works on a character vector of names, and the other that works on a named vector.

@jennybc

This comment has been minimized.

Member

jennybc commented Feb 6, 2017

Perhaps my suggestion about leading underscores is not so bright. It creates non-syntactic names. I still think the convention should be something easily detectable via regex, i.e. unlikely to be present in original column names.

I expected sep to only be used during de-duplication.

x <- list(1:3,
          var2 = letters[1:3],
          c(1, 2, 3) + 0.1,
          var2 = letters[26:24])
names(x)[3] <- NA

tibble::repair_names(tibble::as_data_frame(x, validate = FALSE),
                     prefix = "__X", sep = "__")
#> # A tibble: 3 脳 4
#>   `__X__1`  var2 `__X__2` var2__1
#>      <int> <chr>    <dbl>   <chr>
#> 1        1     a      1.1       z
#> 2        2     b      2.1       y
#> 3        3     c      3.1       x
@jimhester

This comment has been minimized.

Member

jimhester commented Feb 6, 2017

A leading . would work for this and is not very likely to happen by chance, you can even use two .. and still be syntactically valid.

@jennybc

This comment has been minimized.

Member

jennybc commented Feb 6, 2017

@jimhester Do you propose . or .. as the separator to be used for deduplication as well? Seems like a good idea.

@jennybc

This comment has been minimized.

Member

jennybc commented Feb 6, 2017

Moving towards a consensus on what should happen in current example:

..X1, var2, ..X3, var2..1

Or we could decide whenever we incorporate a number, it refers to absolute column position

..X1, var2, ..X3, var2..4

Even more consistent? Drop the V or X.

..1, var2, ..3, var2..4

Now every name that needs repair simply gets ..j appended.

@jimhester

This comment has been minimized.

Member

jimhester commented Feb 6, 2017

FWIW I think the doubled identifiers are quite ugly, but I suppose that is part of the point...

@jennybc

This comment has been minimized.

Member

jennybc commented Feb 6, 2017

Yeah, I think it is a feature if the repaired names are unsightly.

@hadley

This comment has been minimized.

Member

hadley commented Feb 6, 2017

I don't like the use of dots because it goes against the usual convention to use _. But I don't see a good alternative here, given that you can't start a syntactic name with _.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Feb 7, 2017

The way dplyr::rename() currently works, the names don't need to be syntactic. On the contrary, making them non-syntactic might encourage the user to use proper names as soon as possible, or to drop with select(-starts_with("__")).

I particularly like @jennybc's idea of simply appending the absolute position, without using V or X. I'm mildly concerned about backward compatibility, we might want to use pkgconfig to simplify transition from the old behavior.

@hadley

This comment has been minimized.

Member

hadley commented Feb 9, 2017

I think we should:

  • Switch to from relative to absolute positions.
  • Use .. as common prefix. I think it's the best compromise of ugliness and utility.
  • Show what transformation is happening (a la readr). We can add quiet = FALSE to suppress
  • Add a syntactic argument which, when TRUE, ensures that all names are syntactic

I'm not sure what we should do with respect to backward compatibility. I suspect few packages depend on this behaviour; it's going to be more user code. I think as long as there's a clear way to get to the old behaviour it shouldn't cause much hassle (especially since we're now describing what's happening)

@jennybc

This comment has been minimized.

Member

jennybc commented Apr 1, 2017

@krlmlr What's the outlook on this re: next tibble release?

@krlmlr

This comment has been minimized.

Member

krlmlr commented Apr 1, 2017

The next tibble release will have a better repair_names(), but probably only after the next dplyr release.

@krlmlr krlmlr added this to To Do in krlmlr May 4, 2017

@krlmlr krlmlr referenced this issue May 6, 2017

Merged

New tidy_names() #248

@krlmlr krlmlr closed this in #248 May 10, 2017

krlmlr added a commit that referenced this issue May 10, 2017

Merge pull request #248 from tidyverse/f-#217-tidy-names
- New `set_tidy_names()` and `tidy_names()`, a simpler version of `repair_names()` which works unchanged for now (#217).

@krlmlr krlmlr moved this from To Do to Done in krlmlr May 12, 2017

krlmlr added a commit that referenced this issue May 13, 2017

Merge tag 'v1.3.0.9003'
- The `print()`, `format()`, and `tbl_sum()` methods are now implemented for class `"tbl"` and not for `"tbl_df"`. This allows subclasses to use tibble's formatting facilities. The formatting of the header can be tweaked by implementing `tbl_sum()` for the subclass.
- New `set_tidy_names()` and `tidy_names()`, a simpler version of `repair_names()` which works unchanged for now (#217).
- Printing now uses `x` again instead of the Unicode multiplication sign, to avoid encoding issues (#216).
- `glimpse()` now properly displays tibbles with foreign characters in column names (#235).

krlmlr added a commit that referenced this issue May 17, 2017

Merge tag 'v1.3.1'
- Subsetting zero columns no longer returns wrong number of rows (#241, @echasnovski).

- New `set_tidy_names()` and `tidy_names()`, a simpler version of `repair_names()` which works unchanged for now (#217).
- New `rowid_to_column()` that adds a `rowid` column as first column and removes row names (#243, @barnettjacob).
- The `all.equal.tbl_df()` method has been removed, calling `all.equal()` now forwards to `base::all.equal.data.frame()`. To compare tibbles ignoring row and column order, please use `dplyr::all_equal()` (#247).

- Printing now uses `x` again instead of the Unicode multiplication sign, to avoid encoding issues (#216).
- String values are now quoted when printing if they contain non-printable characters or quotes (#253).
- The `print()`, `format()`, and `tbl_sum()` methods are now implemented for class `"tbl"` and not for `"tbl_df"`. This allows subclasses to use tibble's formatting facilities. The formatting of the header can be tweaked by implementing `tbl_sum()` for the subclass, which is expected to return a named character vector. The `print.tbl_df()` method is still implemented for compatibility with downstream packages, but only calls `NextMethod()`.
- Own printing routine, not relying on `print.data.frame()` anymore. Now providing `format.tbl_df()` and full support for Unicode characters in names and data, also for `glimpse()` (#235).

- Improve formatting of error messages (#223).
- Using `rlang` instead of `lazyeval` (#225, @lionel-), and `rlang` functions (#244).
- `tribble()` now handles values that have a class (#237, @NikNakk).
- Minor efficiency gains by replacing `any(is.na())` with `anyNA()` (#229, @csgillespie).
- The `microbenchmark` package is now used conditionally (#245).
- `pkgdown` website.

jennybc added a commit to tidyverse/readxl that referenced this issue Apr 15, 2018

Switch to tibble::set_tidy_names() from tibble::repair_names()
Resolves #357

Our discussion about name repair is here: tidyverse/tibble#217

jennybc added a commit to tidyverse/readxl that referenced this issue Apr 15, 2018

Switch to tibble::set_tidy_names() from tibble::repair_names()
Resolves #357

Our discussion about name repair is here: tidyverse/tibble#217

jennybc added a commit to tidyverse/readxl that referenced this issue Apr 15, 2018

Switch to tibble::set_tidy_names() from tibble::repair_names()
Resolves #357

Our discussion about name repair is here: tidyverse/tibble#217

jennybc added a commit to tidyverse/readxl that referenced this issue Apr 25, 2018

Switch to tibble::set_tidy_names() from tibble::repair_names()
Resolves #357

Our discussion about name repair is here: tidyverse/tibble#217

jennybc added a commit to tidyverse/readxl that referenced this issue Apr 25, 2018

Switch to tibble::set_tidy_names() from tibble::repair_names()
Resolves #357

Our discussion about name repair is here: tidyverse/tibble#217

jennybc added a commit to tidyverse/readxl that referenced this issue Apr 25, 2018

@jennybc jennybc referenced this issue Jun 13, 2018

Closed

Vector names #4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment