New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collapse multiple rows of headers (feature request) #486

Closed
brianwdavis opened this Issue Jun 8, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@brianwdavis
Copy link

brianwdavis commented Jun 8, 2018

I frequently receive Excel files that have "complex" headers, where there are multiple rows of cells encoding a single column name. Usually there's included units, labels, or additional metadata, but rarely can the extra rows be thrown away and only the first kept. Here's an example:

image

  • In the workflow below, the first attempt erroneously puts the metadata in the first row of the tibble. This incorrectly coerces the parser to guess that 4 columns are character, not numeric. Additionally, we lose the name of the 6th column, converted to X__1, and the second treatment column is coerced to Trt__1.
  • The second attempt correctly parses the column types, but we lose the names on the first 5 columns, and keep only the metadata. Since there are multiple columns with the unit "grams", that's not enough to reconstruct the original headers.
  • The third attempt is a flavor of the first, manually removing the metadata row and forcing a reparse. Again, this loses one of the column names, and confuses the two treatment columns.
  • The fourth attempt is my current workflow: I read in the top few rows that I know contain both names and metadata, collapse them with paste, and unlist them to get a vector of column names. Then I read in the data part of the sheet separately and add the names.

My desire: an additional argument to read_excel that's like row_collapse = 1:2, telling it to do that.


library(readxl)                                                             
library(dplyr)                                                              
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
                                                                            
file_path <- "https://github.com/brianwdavis/public/raw/master/example.xlsx"
                                                                            
download.file(file_path, "example.xlsx", mode = "wb")                       
                                                                            
read_excel("example.xlsx")                                                  
#> # A tibble: 7 x 6
#>   Trt      Trt__1   `Dry wt` `Fresh wt` Temp  X__1     
#>   <chr>    <chr>    <chr>    <chr>      <chr> <chr>    
#> 1 Factor 1 Factor 2 grams    grams      C     replicate
#> 2 a        d        3        5          20    1        
#> 3 a        d        5        10         22    2        
#> 4 a        e        8        15         21    3        
#> 5 b        e        4        9          30    1        
#> 6 b        f        6        11         31    2        
#> 7 b        f        2        4          32    3
read_excel("example.xlsx", skip = 1)                                        
#> # A tibble: 6 x 6
#>   `Factor 1` `Factor 2` grams grams__1     C replicate
#>   <chr>      <chr>      <dbl>    <dbl> <dbl>     <dbl>
#> 1 a          d              3        5    20         1
#> 2 a          d              5       10    22         2
#> 3 a          e              8       15    21         3
#> 4 b          e              4        9    30         1
#> 5 b          f              6       11    31         2
#> 6 b          f              2        4    32         3
                                                                            
                                                                            
                                                                            
read_excel("example.xlsx") %>%                                              
slice(-1) %>%                                                               
mutate_all(readr::parse_guess)                                              
#> # A tibble: 6 x 6
#>   Trt   Trt__1 `Dry wt` `Fresh wt`  Temp  X__1
#>   <chr> <chr>     <int>      <int> <int> <int>
#> 1 a     d             3          5    20     1
#> 2 a     d             5         10    22     2
#> 3 a     e             8         15    21     3
#> 4 b     e             4          9    30     1
#> 5 b     f             6         11    31     2
#> 6 b     f             2          4    32     3
                                                                            
                                                                            
nms <-                                                                      
read_excel("example.xlsx", range = cell_rows(1:2), col_names = F) %>%       
summarise_all(funs(paste(na.omit(.), collapse = "_"))) %>%                  
unlist()                                                                    
nms                                                                         
#>             X__1             X__2             X__3             X__4 
#>   "Trt_Factor 1"   "Trt_Factor 2"   "Dry wt_grams" "Fresh wt_grams" 
#>             X__5             X__6 
#>         "Temp_C"      "replicate"
                                                                            
read_excel("example.xlsx", skip = 2) %>%                                    
magrittr::set_colnames(nms)                                                 
#> # A tibble: 5 x 6
#>   `Trt_Factor 1` `Trt_Factor 2` `Dry wt_grams` `Fresh wt_grams` Temp_C
#>   <chr>          <chr>                   <dbl>            <dbl>  <dbl>
#> 1 a              d                           5               10     22
#> 2 a              e                           8               15     21
#> 3 b              e                           4                9     30
#> 4 b              f                           6               11     31
#> 5 b              f                           2                4     32
#> # ... with 1 more variable: replicate <dbl>
@brianwdavis

This comment has been minimized.

Copy link

brianwdavis commented Jun 18, 2018

Having searched again, I saw that I missed #379 (closed) somehow, which starts out dealing with this exact sort of issue and then veers off into how to process pivot tables/wide data.

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Jun 18, 2018

readxl could perhaps gain a convenience function to deal with such sheets, that would be some version of what you are doing with your two reads: one of header rows and another of data.

So, it would be a thin wrapper around 2 calls to read_excel(). I don't think it's worth it to deal with any of this on C/C++ side. It would be a bit more efficient to do the first read, make colnames, then provide these colnames to the second read (vs. applying them after the fact).

Main question is what should the interface be re: describing how to make column names from multiple header rows.

cc @richfitz, who I think had a similar question recently

@brianwdavis

This comment has been minimized.

Copy link

brianwdavis commented Jun 18, 2018

It is indeed true that two calls to read_excel is about 2x as fast as one call to read_excel and then readr::parse_guess on each column (which surprised me! the C/C++ side must be very fast).

My stab would be something like below. In all the examples I've come across in the wild, it would be fine to have a single integer argument row_collapse that counts the number of rows to concatenate. I'm curious what other use cases people have where that wouldn't fit though.


read_excel_multiline <- function(filename, row_collapse = 1, ...) {
  nms <- read_excel(filename, range = cell_rows(seq_len(row_collapse)), col_names = F)  
  nms <- lapply(nms, na.omit)
  nms <- lapply(nms, paste, collapse = "_")

  read_excel(filename, skip = row_collapse, col_names = unlist(nms), ...)                                     
}

read_excel_multiline("example.xlsx", row_collapse = 2)

> # A tibble: 6 x 6
>   `Trt_Factor 1` `Trt_Factor 2` `Dry wt_grams` `Fresh wt_grams` Temp_C replicate
>   <chr>          <chr>                   <dbl>            <dbl>  <dbl>     <dbl>
> 1 a              d                           3                5     20         1
> 2 a              d                           5               10     22         2
> 3 a              e                           8               15     21         3
> 4 b              e                           4                9     30         1
> 5 b              f                           6               11     31         2
> 6 b              f                           2                4     32         3
@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Jul 10, 2018

@brianwdavis would you subscribe/watch #492? Once that mini vignette exists and is merged, perhaps you could add a second worked example, based on the above? While we ruminate on whether to add something to readxl itself and, if so, what, it's nice to have clear worked examples available for users right now.

@jennybc

This comment has been minimized.

Copy link
Member

jennybc commented Dec 14, 2018

I still think building this into readxl itself is a non-starter. I still welcome worked examples in the articles of the pkgdown so that people who share this problem can customize some general patterns for tackling this.

@jennybc jennybc closed this Dec 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment