Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider an abbreviated account of "New names" #406

Closed
jennybc opened this issue Apr 29, 2018 · 9 comments
Closed

Consider an abbreviated account of "New names" #406

jennybc opened this issue Apr 29, 2018 · 9 comments

Comments

@jennybc
Copy link
Member

@jennybc jennybc commented Apr 29, 2018

I've switched to tibble::set_tidy_names() now in the dev version of readxl. And frequently readxl::read_excel() does a lot of name repair. Especially before you really dial in the range or skip values that isolate the data rectangle. I'm beginning to wonder if the messaging around new names should have some upper limit, after which you get a summary. Below is a very modest example. It can get much worse.

readxl::read_excel("investigations/Medicare Part D 2015 Plan Report 03182015.xls")
#> New names:
#>  -> ..2
#>  -> ..3
#>  -> ..4
#>  -> ..5
#>  -> ..6
#>  -> ..7
#>  -> ..8
#>  -> ..9
#>  -> ..10
#>  -> ..11
#>  -> ..12
#>  -> ..13
#>  -> ..14
#>  -> ..15
#>  -> ..16
#>  -> ..17
#>  -> ..18
#>  -> ..19
#>  -> ..20
#>  -> ..21
#>  -> ..22
#>  -> ..23
#>  -> ..24
#>  -> ..25
#>  -> ..26
#> # A tibble: 16,665 x 26
#>    `2015 Plan and P… ..2   ..3   ..4   ..5   ..6   ..7   ..8   ..9   ..10 
#>    <chr>             <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 Alabama to Monta… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  2 * See next works… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  3 State             Coun… Orga… Plan… Cont… Plan… Segm… Orga… Plan… Spec…
#>  4 Alabama           (All… Aetn… Aetn… S5810 182   0     PDP   Medi… No   
#>  5 Alabama           (All… Aetn… Aetn… S5810 46    0     PDP   Medi… No   
#>  6 Alabama           (All… Blue… Blue… S1030 6     0     PDP   Medi… No   
#>  7 Alabama           (All… Blue… Blue… S1030 1     0     PDP   Medi… No   
#>  8 Alabama           (All… Cign… Cign… S5617 220   0     PDP   Medi… No   
#>  9 Alabama           (All… Cign… Cign… S5617 182   0     PDP   Medi… No   
#> 10 Alabama           (All… Cign… Cign… S5617 257   0     PDP   Medi… No   
#> # ... with 16,655 more rows, and 16 more variables: ..11 <chr>,
#> #   ..12 <chr>, ..13 <chr>, ..14 <chr>, ..15 <chr>, ..16 <chr>,
#> #   ..17 <chr>, ..18 <chr>, ..19 <chr>, ..20 <chr>, ..21 <chr>,
#> #   ..22 <chr>, ..23 <chr>, ..24 <chr>, ..25 <chr>, ..26 <chr>

Created on 2018-04-28 by the reprex package (v0.2.0).

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jun 6, 2018

Agreed, in particular for duplicate names we should (also) print the index.

According to the tidyverse style guide, we should only print the first five entries with an ellipsis.

Will the "rename info" object be available as an attribute in the returned tibble? This object could contain the full set of operations, and print all of them with e.g. print(..., n = Inf).

@jennybc
Copy link
Member Author

@jennybc jennybc commented Jun 6, 2018

Agreed, in particular for duplicate names we should (also) print the index.

FWIW the index is recoverable from the repaired name, if I follow you.

Will the "rename info" object be available as an attribute in the returned tibble? This object could contain the full set of operations, and print all of them with e.g. print(..., n = Inf).

I had not thought about this, but perhaps it is a good idea. I don't have a strong opinion. It would be a 2-column tibble with var names "before" and "after" repair?

@batpigandme
Copy link
Member

@batpigandme batpigandme commented Jun 6, 2018

but perhaps it is a good idea

This would get my vote for sure. I've seen people request a Ctrl+z for R transformations before, which makes me think a before and after could be useful.

@jennybc
Copy link
Member Author

@jennybc jennybc commented Jun 6, 2018

It's true that I have already had people who want to apply their own name repair logic in readxl, which isn't really possible. But this attribute would allow them to do this post hoc. I think it should be a full tibble of "before" vs. "after" var names, if it exists, i.e. not just the ones that were repaired.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jun 6, 2018

I wonder if we should return a tibble with "broken" column names and provide repair_broken() and ignore_broken() functions that applies or discards the transformations that were guessed. We wouldn't need an undo, and the broken tibble shows instructions about how to fix it when printed.

The new column renaming change in RSQLite causes trouble: r-dbi/RSQLite#259.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Aug 20, 2018

For now, I'd suggest that readxl and friends use tidy_names() and apply the renaming manually, if the rename info needs to be returned to the client. I have updated the documentation to include this pattern.

@jennybc
Copy link
Member Author

@jennybc jennybc commented Aug 21, 2018

I see this:

To make the rename information available to callers, call tidy_names() and assign the result via names() or rlang::set_names().

but that doesn't seem to address how we're going to return the rename info. Feels like we should take a position on this: in an attribute? if so, with what name? 2-colum tibble or named vector?

Do you have an opinion? If not, I can think/experiment and make a proposal.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Aug 21, 2018

Like in #458, this could be the responsibility of a different package. I'd rather not add this to tibble for now, but e.g. readxl could suggest an implementation which we then adopt. Would that work?

@github-actions
Copy link

@github-actions github-actions bot commented Dec 10, 2020

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants