New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider an abbreviated account of "New names" #406

Closed
jennybc opened this Issue Apr 29, 2018 · 8 comments

Comments

Projects
None yet
3 participants
@jennybc
Copy link
Member

jennybc commented Apr 29, 2018

I've switched to tibble::set_tidy_names() now in the dev version of readxl. And frequently readxl::read_excel() does a lot of name repair. Especially before you really dial in the range or skip values that isolate the data rectangle. I'm beginning to wonder if the messaging around new names should have some upper limit, after which you get a summary. Below is a very modest example. It can get much worse.

readxl::read_excel("investigations/Medicare Part D 2015 Plan Report 03182015.xls")
#> New names:
#>  -> ..2
#>  -> ..3
#>  -> ..4
#>  -> ..5
#>  -> ..6
#>  -> ..7
#>  -> ..8
#>  -> ..9
#>  -> ..10
#>  -> ..11
#>  -> ..12
#>  -> ..13
#>  -> ..14
#>  -> ..15
#>  -> ..16
#>  -> ..17
#>  -> ..18
#>  -> ..19
#>  -> ..20
#>  -> ..21
#>  -> ..22
#>  -> ..23
#>  -> ..24
#>  -> ..25
#>  -> ..26
#> # A tibble: 16,665 x 26
#>    `2015 Plan and P… ..2   ..3   ..4   ..5   ..6   ..7   ..8   ..9   ..10 
#>    <chr>             <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 Alabama to Monta… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  2 * See next works… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  3 State             Coun… Orga… Plan… Cont… Plan… Segm… Orga… Plan… Spec…
#>  4 Alabama           (All… Aetn… Aetn… S5810 182   0     PDP   Medi… No   
#>  5 Alabama           (All… Aetn… Aetn… S5810 46    0     PDP   Medi… No   
#>  6 Alabama           (All… Blue… Blue… S1030 6     0     PDP   Medi… No   
#>  7 Alabama           (All… Blue… Blue… S1030 1     0     PDP   Medi… No   
#>  8 Alabama           (All… Cign… Cign… S5617 220   0     PDP   Medi… No   
#>  9 Alabama           (All… Cign… Cign… S5617 182   0     PDP   Medi… No   
#> 10 Alabama           (All… Cign… Cign… S5617 257   0     PDP   Medi… No   
#> # ... with 16,655 more rows, and 16 more variables: ..11 <chr>,
#> #   ..12 <chr>, ..13 <chr>, ..14 <chr>, ..15 <chr>, ..16 <chr>,
#> #   ..17 <chr>, ..18 <chr>, ..19 <chr>, ..20 <chr>, ..21 <chr>,
#> #   ..22 <chr>, ..23 <chr>, ..24 <chr>, ..25 <chr>, ..26 <chr>

Created on 2018-04-28 by the reprex package (v0.2.0).

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Jun 6, 2018

Agreed, in particular for duplicate names we should (also) print the index.

According to the tidyverse style guide, we should only print the first five entries with an ellipsis.

Will the "rename info" object be available as an attribute in the returned tibble? This object could contain the full set of operations, and print all of them with e.g. print(..., n = Inf).

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Jun 6, 2018

Agreed, in particular for duplicate names we should (also) print the index.

FWIW the index is recoverable from the repaired name, if I follow you.

Will the "rename info" object be available as an attribute in the returned tibble? This object could contain the full set of operations, and print all of them with e.g. print(..., n = Inf).

I had not thought about this, but perhaps it is a good idea. I don't have a strong opinion. It would be a 2-column tibble with var names "before" and "after" repair?

@batpigandme

This comment has been minimized.

Copy link
Member

batpigandme commented Jun 6, 2018

but perhaps it is a good idea

This would get my vote for sure. I've seen people request a Ctrl+z for R transformations before, which makes me think a before and after could be useful.

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Jun 6, 2018

It's true that I have already had people who want to apply their own name repair logic in readxl, which isn't really possible. But this attribute would allow them to do this post hoc. I think it should be a full tibble of "before" vs. "after" var names, if it exists, i.e. not just the ones that were repaired.

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Jun 6, 2018

I wonder if we should return a tibble with "broken" column names and provide repair_broken() and ignore_broken() functions that applies or discards the transformations that were guessed. We wouldn't need an undo, and the broken tibble shows instructions about how to fix it when printed.

The new column renaming change in RSQLite causes trouble: r-dbi/RSQLite#259.

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Aug 20, 2018

For now, I'd suggest that readxl and friends use tidy_names() and apply the renaming manually, if the rename info needs to be returned to the client. I have updated the documentation to include this pattern.

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Aug 21, 2018

I see this:

To make the rename information available to callers, call tidy_names() and assign the result via names() or rlang::set_names().

but that doesn't seem to address how we're going to return the rename info. Feels like we should take a position on this: in an attribute? if so, with what name? 2-colum tibble or named vector?

Do you have an opinion? If not, I can think/experiment and make a proposal.

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Aug 21, 2018

Like in #458, this could be the responsibility of a different package. I'd rather not add this to tibble for now, but e.g. readxl could suggest an implementation which we then adopt. Would that work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment