New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty columns: to drop or not to drop? #157

Closed
huftis opened this Issue Jan 13, 2016 · 13 comments

Comments

Projects
None yet
7 participants
@huftis

huftis commented Jan 13, 2016

When importing Excel files, empty columns seem to be deleted if one uses a non-zero skip argument in read_excel(). Here’s an example file:

http://huftis.org/nedlasting/excel/row-col-test.xlsx

The file contains two similar worksheets/tabs. Both have 8 columns (named A–H) and 8 rows. The diagonal is filled with the numbers 1 to 8 for the first worksheet. The second worksheet is similar, but is missing the number 4 and 7, resulting in column 4 and and 7 being empty.

Reading the first worksheet works fine:

> read_excel("row-col-test.xlsx", col_names = FALSE, sheet=1, skip=2)
  X1 X2 X3 X4 X5 X6 X7
1  2 NA NA NA NA NA NA
2 NA  3 NA NA NA NA NA
3 NA NA  4 NA NA NA NA
4 NA NA NA  5 NA NA NA
5 NA NA NA NA  6 NA NA
6 NA NA NA NA NA  7 NA
7 NA NA NA NA NA NA  8

But reading the second worksheet results in a data frame where the empty columns are missing:

> read_excel("row-col-test.xlsx", col_names = FALSE, sheet=2, skip=2)
  X1 X2 X4 X5 X7
1  2 NA NA NA NA
2 NA  3 NA NA NA
3 NA NA  5 NA NA
4 NA NA NA  6 NA
5 NA NA NA NA  8
6 NA NA NA NA NA
7 NA NA NA NA NA

Note that the (automatically generated) column names indicate that there once was an X3 and X6 column.

If one uses skip=0, empty columns are not dropped:

> read_excel("row-col-test.xlsx", col_names = FALSE, sheet=2, skip=0)
    X0   X1   X2   X3   X4   X5   X6   X7
1    A    B    C    D    E    F    G    H
2    1 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA>    2 <NA> <NA> <NA> <NA> <NA> <NA>
4 <NA> <NA>    3 <NA> <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA>    5 <NA> <NA> <NA>
6 <NA> <NA> <NA> <NA> <NA>    6 <NA> <NA>
7 <NA> <NA> <NA> <NA> <NA> <NA> <NA>    8
8 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
9 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>

(Here the values have been converted to character values, since the first row contains non-numeric data.)

@huftis

This comment has been minimized.

huftis commented Jan 13, 2016

I now see that dropping empty rows and columns is actually mentioned as a feature in the readxl README. But I’d consider it a _mis_feature. Having columns disappear or appear depending on their content is almost a recipe for disaster, IMO. (It certainly caused some hard-to-debug problems for me, resulting in this bug report and bug report #156.)

Also, the ‘feature’ of automatically dropping empty columns (arguably) doesn’t actually work, as shown by the following code:

> read_excel("row-col-test.xlsx",
             col_names = TRUE, sheet=2, skip=0)
   A  B  C  D  E  F  G  H
1  1 NA NA NA NA NA NA NA
2 NA  2 NA NA NA NA NA NA
3 NA NA  3 NA NA NA NA NA
4 NA NA NA NA  5 NA NA NA
5 NA NA NA NA NA  6 NA NA
6 NA NA NA NA NA NA NA  8
7 NA NA NA NA NA NA NA NA
8 NA NA NA NA NA NA NA NA

Columns D and G are empty but not dropped.

@hadley

This comment has been minimized.

Member

hadley commented Jan 13, 2016

This seems mutually exclusive with #156 ;)

The problem is that in excel there's no way to tell the difference between an empty cell and a cell that does not exist, but that difference exists in the on-disk serialisation of the workbook. I think the intent is for readxl to treat them both the same way, as excel does, but it looks like we're inconsistent wrt cols and rows.

@huftis

This comment has been minimized.

huftis commented Jan 13, 2016

Hmm. I’m not sure how this is mutually exclusive with #156. The preferred solution for both bugs would be that both empty rows and columns should be included in the resulting data frame (perhaps with an option to drop them, which I agree can be useful sometimes, especially for messy Excel data).

BTW, I didn’t know how the data was being stored in the Excel file. I have now taken a look. It’s not pretty. :/

@jcbannon

This comment has been minimized.

jcbannon commented Apr 9, 2016

I came here because I was trying to read a folder of excel files produced in a poorly formatted way. The given column names were paragraphs, so I did the skip=1, produced my own names and types -- and got the same problem with dropped columns, specifically that the number of names and columns were not equal. Also, got a number of weird date formatting warnings (expecting date but got '12/28/1986'), but that's another topic.

@MichaelChirico

This comment has been minimized.

MichaelChirico commented Apr 16, 2016

Adding my woes here.

I have a file where the row names in the Excel file are not analysis-friendly -- spaces, punctuation, overly-long names, etc.

So I tried to read it by using skip = 1 to ignore those column names and supply col_names from a separate register I made by hand

Unfortunately a lot of the columns are empty besides the header, which caused the hard-to-detect error mentioned by huftis.

Seems inconsistent to get a data.frame with 100 columns when omitting skip but to get one with 50 columns when using skip, as well as being counterintuitive that tinkering with a rows parameter would have an effect on the columns of the output.

An easy fix might be to ignore skip when determining if a column is empty?

For now I'm setting skip = 0 and changing the names ex-post, but it's pretty duct-tape-y.

@anders-sjogren

This comment has been minimized.

anders-sjogren commented Sep 16, 2016

Hi!

This is also a major problem for me. I'm importing a bunch of data, that are filled in template xlsx files that has empty columns. However, sometimes users can have entered something in a cell of some empty column, and then all columns is shifted, and there is no convenient and safe way to see how that has been done.

What would be most convenient is to be able to ask for the column names to be set according to the excel standard (i.e. "A", "B", ...). One can then write expressions such as df[4,"B"] to get that cell value, no matter if the "A"-column exists or not.

An open question: What is the purpose of naming the columns X1,...,XN when empty columns are not counted? In read.xlsx of the xlsx package, empty columns are counted and the colnames can for example be X2,X3,X4 if the first column is missing (and thus skipped). In read_excel however, they are always named X1,..., even if some columns are skipped. To me, these names seem rather redundant since one might as well access a column with 1 as with "X1", and in the current setting they are always 1-to-1!?

Suggestion: Create a from_excel variable with a special class (with the same name) or other attribute, and then allow that to be sent to read_excel as in read_excel(path, col_names=from_excel). If read_excel detects the special class or attribute of col_names it then takes the path of returning colnames of the type "A","B" (but making sure to adjust for lost columns, so that the colnames of existing columns match those in the xlsx-file).

How difficult would it be to detect what actual column each read column corresponds to? Since MS Excel and read.xlsx does it, it must be feasible somehow!?

@jennybc

This comment has been minimized.

Member

jennybc commented Jan 31, 2017

The row skipping behaviour of xlsx is about to get straightened out (#240), so I'm editing the title of this issue to better reflect the important points of discussion. Which is really about column retention and naming.

@jennybc jennybc changed the title from When skipping rows, empty columns are deleted to Handling of columns with no names and/or no data Jan 31, 2017

@jennybc jennybc changed the title from Handling of columns with no names and/or no data to Empty columns: to drop or not to drop? Feb 5, 2017

@jennybc

This comment has been minimized.

Member

jennybc commented Feb 5, 2017

I'm beginning to think that readxl should NOT drop empty columns automatically. readr does not do so either:

readr::read_csv(",x,,y,\n,1,,2,\n,,,,\n")
#> Warning: Missing column names filled in: 'X1' [1], 'X3' [3], 'X5' [5]
#> # A tibble: 2 × 5
#>      X1     x    X3     y    X5
#>   <chr> <int> <chr> <int> <chr>
#> 1  <NA>     1  <NA>     2  <NA>
#> 2  <NA>    NA  <NA>    NA  <NA>

I am talking specifically about leading or embedded empty columns, as opposed to trailing. Trailing empty, unnamed columns are impossible to get because they are basically undefined. I think that's what @hadley means by

no way to tell the difference between an empty cell and a cell that does not exist

readr even retains trailing empty rows (see above), which, for the same reason as for columns, isn't really possible in the Excel domain.

If someone does NOT want an embedded empty column, it is already possible to skip it by setting its type to "blank", though the column specification system needs some work (#198).

path <- system.file("extdata/datasets.xlsx", package = "readxl")
readxl::read_excel(path, col_types = c("blank", "blank", "numeric", "text", 'text'))
#> # A tibble: 150 × 3
#>    Petal.Length Petal.Width Species
#>           <dbl>       <chr>   <chr>
#> 1           1.4         0.2  setosa
#> 2           1.4         0.2  setosa
#> 3           1.3         0.2  setosa
#> 4           1.5         0.2  setosa
#> 5           1.4         0.2  setosa
#> 6           1.7         0.4  setosa
#> 7           1.4         0.3  setosa
#> 8           1.5         0.2  setosa
#> 9           1.4         0.2  setosa
#> 10          1.5         0.1  setosa
#> # ... with 140 more rows

@hadley Do you agree that readxl should not drop an embedded empty column, even if it is unnamed?

If people want to read from a specific cell rectangle, even if there's no data there in the current sheet, that is a matter for #8.

@hadley

This comment has been minimized.

Member

hadley commented Feb 5, 2017

Agreed. I think the current behaviour is mostly due to my naïveté about how excel records empty cells.

@stephhazlitt

This comment has been minimized.

stephhazlitt commented Jul 19, 2018

I think I have run into something related to this thread. Sometimes header information falls into 2 or more rows in an Excel file, akin to this helpful new vignette.

In wrangling some data with a similar format, I think I have found a situation where read_xlsx() is dropping the leading empty cells in a target row.

Using clippy.xlsx from readxl v1.1.0.9000, you can pull the rows with the header info:

library(readxl) #using dev vesion from GitHub for clippy.xlsx with Sheet #2
library(magrittr)

clippy_shipped_colnames1 <- readxl_example("clippy.xlsx") %>% 
  read_excel(sheet = 2, col_names = FALSE, n_max = 1)
#> New names:
#>  -> ..1
#>  -> ..2
#>  -> ..3
#>  -> ..4

clippy_shipped_colnames2 <- readxl_example("clippy.xlsx") %>% 
  read_excel(sheet = 2, col_names = FALSE, n_max = 1, skip = 1)
#> New names:
#>  -> ..1
#>  -> ..2
#>  -> ..3
#>  -> ..4

Created on 2018-07-19 by the reprex
package
(v0.2.0).

If you have a file with some leading NAs in one of the target rows, they are dropped. Using this edited clippy file as an example: clippy_foo.xlsx

foo_clippy <- read_xlsx("clippy_foo.xlsx", sheet = 2)

foo_clippy_colnames1 <- read_xlsx("clippy_foo.xlsx", sheet = 2, col_names = FALSE, n_max = 1)
## returns 6 columns 

foo_clippy_colnames2 <- read_xlsx("clippy_foo.xlsx", sheet = 2, col_names = FALSE, n_max = 1, skip = 1)
## returns 4 columns

This might be intentional and there are other ways to pull the row w/o dropping the NAs (I landed on using range = ). Just adding the observation here as it seems related and I think different to how readr::read_csv behaves?

@jennybc

This comment has been minimized.

Member

jennybc commented Jul 19, 2018

I note that you're using pretty different arguments than the vignette, i.e. col_names = FALSE and n_max = 1 instead of n_max = 0 (and accepting the default col_names = TRUE). But perhaps that is neither here nor there.

But yes, by default, readxl "shrink-wraps" data. So leading or trailing blank cells could cause the initial col-name read to shrink to the contiguous non-empty cells and produce too few column names. If that is an issue, you'll need to provide more external info about your desired geometry. This vignette explains the various ways to express this:

https://readxl.tidyverse.org/articles/sheet-geometry.html

@stephhazlitt

This comment has been minimized.

stephhazlitt commented Jul 19, 2018

Thanks @jennybc - my apologies for not reading deeper into the available documentation, I would have stopped at "shrink-wrapped"! In this case, cell_limits() saves the day.

@jennybc

This comment has been minimized.

Member

jennybc commented Jul 19, 2018

No worries! It is hard to know what to do re: blank cells but the current "shrink wrap by default" approach was chosen with much thought and has held up pretty well since it went info full force >1 year ago. But yeah it means you will sometimes have to take the range reins if you need a two-pass solution to deal with awkward data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment