New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reads an empty column (NA) #162

Closed
rappster opened this Issue Feb 6, 2016 · 8 comments

Comments

Projects
None yet
5 participants
@rappster

rappster commented Feb 6, 2016

Either when writing to Excel XLConnect and/or openxlsx or when manipulationg Excel files via the Java API addressed by these packages (necessary when writing formulas), I sometimes seem to end up with a xlsx file that has an empty column.

While this certainly is not a good sign (probably still a minor cell offset bug in my code), it surprised me that the "empty" column is only recognized by readxl::read_excel, but none of the other functions that will read from xlsx.

> read_excel(path, sheet = sheet)
Source: local data frame [42 x 5]

   variable 2016-10-01 2016-10-01 2016-11-01      
      (chr)      (chr)      (chr)      (chr) (dbl)
1     var_1       asdf       asdf       asdf    NA
2     var_2       asdf       asdf       asdf    NA
3     var_3       asdf       asdf       asdf    NA
4     var_4       asdf       asdf       asdf    NA
5     var_5       asdf       asdf       asdf    NA
6     var_6       asdf       asdf       asdf    NA
7     var_7       asdf       asdf       asdf    NA
8     var_8       asdf       asdf       asdf    NA
9     var_9       asdf       asdf       asdf    NA
10   var_10       asdf       asdf       asdf    NA
..      ...        ...        ...        ...   ...

I've "examplified" one of those files that have an empty column and uploaded it.

Currently, I'm loading the data via read_excel and selecting the "valid" data range via the following function as it's still faster than any of the other functions out there (see microbenchmark):

ensureNonNaRange <- function(dat) {
  idx_col <- ! dat %>% sapply(function(ii) ii %>% is.na() %>% all())
  idx_row <- ! sapply(1:nrow(dat),
    function(ii) unlist(dat[ii, ]) %>% is.na() %>% all())
  dat[idx_row, idx_col]
}

Microbenchmark:

microbenchmark::microbenchmark(
  xlconnect = {
    wb <- XLConnect::loadWorkbook(path)
    readWorksheet(wb, sheet = sheet)
  },
  openxlsx = readWorkbook(path, sheet = sheet),
  openxlsx_2 = openxlsx::read.xlsx(path, sheet = sheet),
  readxl = read_excel(path, sheet = sheet),
  readxl_2 = ensureNonNaRange(read_excel(path, sheet = sheet)),
  xlsx = read.xlsx2(path, sheetName = sheet, check.names = FALSE)
)

Unit: milliseconds
       expr       min        lq      mean    median        uq        max neval
  xlconnect 17.495881 18.994216 21.121950 19.487015 20.566011  70.583209   100
   openxlsx 16.785055 17.826075 19.057666 18.397903 18.982633  80.022474   100
 openxlsx_2 16.160443 17.476526 20.848930 17.994984 18.577368 164.240113   100
     readxl  4.437091  4.730923  5.014602  4.847048  5.151876   8.763629   100
   readxl_2 11.283191 12.240050 13.082205 12.560126 12.907476  36.800497   100
       xlsx 15.674242 16.540928 17.723509 17.022730 18.393797  22.991293   100
@rappster

This comment has been minimized.

rappster commented Feb 6, 2016

This version of ensureNonNaRange has a bit less overhead due to avoiding magrittr pipes, so we're quite close to the actual time that read_excel needs:

ensureNonNaRange2 <- function(dat) {
  idx_col <- ! sapply(dat, function(ii) all(is.na(ii)))
  idx_row <- ! sapply(1:nrow(dat), function(ii) all(is.na(unlist(dat[ii, ]))))
  dat[idx_row, idx_col]
}

microbenchmark::microbenchmark(
  xlconnect = {
    wb <- XLConnect::loadWorkbook(path)
    readWorksheet(wb, sheet = sheet)
  },
  openxlsx = readWorkbook(path, sheet = sheet),
  openxlsx_2 = openxlsx::read.xlsx(path, sheet = sheet),
  readxl = read_excel(path, sheet = sheet),
  readxl_2 = ensureNonNaRange(read_excel(path, sheet = sheet)),
  readxl_3 = ensureNonNaRange2(read_excel(path, sheet = sheet)),
  xlsx = read.xlsx2(path, sheetName = sheet, check.names = FALSE)
)
Unit: milliseconds
       expr       min        lq      mean    median        uq        max neval
  xlconnect 17.380050 18.858885 20.343006 19.664869 21.079482  38.049721   100
   openxlsx 16.219092 17.519634 21.949898 18.156268 18.684843 170.887099   100
 openxlsx_2 16.076281 17.633120 18.934128 18.358901 18.689975  61.733366   100
     readxl  4.236512  4.559228  4.893129  4.817576  5.065222   9.611987   100
   readxl_2 11.428347 12.102811 12.857676 12.513648 13.026680  21.758785   100
   readxl_3  5.939386  6.377347  6.743487  6.671619  6.927769  10.387914   100
       xlsx 15.434662 16.293577 17.298867 16.807488 17.595584  24.602382   100

Would it make sense/be possible to include a non-NA check of this sort in read_excel?

@rappster rappster changed the title from Reads a NA column to Reads an empty column (NA) Feb 6, 2016

@RMHogervorst

This comment has been minimized.

RMHogervorst commented Apr 29, 2016

I also end up with a empty column in readxl. However I'm sure my xlsx files are created by excel itself.
The problem manifests when using dplyr: error:
Error in filter_impl(.data, dots) : attempt to use zero-length variable name

@RMHogervorst

This comment has been minimized.

RMHogervorst commented Apr 29, 2016

This also happens in the dev version of readxl from github.

Session info -----------------------------------------------------------------------------------------------------------------------
 setting  value                                      
 version  R version 3.2.4 Revised (2016-03-16 r70336)
 system   x86_64, mingw32                            
 ui       RStudio (0.99.893)                         
 language (EN)                                       
 collate  Dutch_Netherlands.1252                     
 tz       Europe/Berlin                              
 date     2016-04-29   
Packages ------------------selection-----------------------------
 readxl         0.1.1.9000 2016-04-29 Github (hadley/readxl@3ba34ab)
 dplyr        * 0.4.3      2015-09-01 CRAN (R 3.2.3)
@burchill

This comment has been minimized.

burchill commented Dec 12, 2016

I opened up the .xlsx document, and for whatever reason, it has the dimensions of the document including that column. I honestly have no idea why--maybe there's something formatted somewhere in it or something?

@RMHogervorst

This comment has been minimized.

RMHogervorst commented Dec 14, 2016

The file provided by @rappster must contain some information in the final column. When I delete that seemingly empty column (remove with excel) and read it in again, it returns 4 columns in stead of 5.

@jennybc

This comment has been minimized.

Member

jennybc commented Feb 4, 2017

Note to self: The file provided by @rappster has style information for a subset of the seemingly empty cells in column E. Here's the xml for an example, cell E22:

<row r="22" spans="1:5" x14ac:dyDescent="0.25">
<c r="A22" t="s">
<v>28</v>
</c>
<c r="B22" t="s">
<v>50</v>
</c>
<c r="C22" t="s">
<v>50</v>
</c>
<c r="D22" t="s">
<v>50</v>
</c>
<c r="E22" s="1"/>
</row>

In Excel, when you enter such a cell, you can see the style switch from "General" to "Custom".

@jennybc

This comment has been minimized.

Member

jennybc commented Feb 4, 2017

@RMHogervorst Your problem sounds much more like #199. Of course, it's possible you are suffering from both.

This was referenced Feb 4, 2017

@stla

This comment has been minimized.

stla commented Feb 5, 2017

@rappster
A faster one:

ensureNonNaRange2 <- function(dat) {
  idx_col <- ! sapply(dat, function(ii) all(is.na(ii)))
  idx_row <- ! sapply(1:nrow(dat), function(ii) all(is.na(unlist(dat[ii, ]))))
  dat[idx_row, idx_col]
}
ensureNonNaRange3 <- function(dat) {
  notna <- !is.na(dat)
  idx_col <- apply(notna, 2, any)
  idx_row <- apply(notna, 1, any)
  dat[idx_row, idx_col]
}
path <- system.file("extdata", "datasets.xlsx", package="readxl")
microbenchmark::microbenchmark(
  readxl_3 = ensureNonNaRange2(read_excel(path, sheet = 1)),
  readxl_4 = ensureNonNaRange3(read_excel(path, sheet = 1))
)
#
Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval cld
 readxl_3 34.67895 35.93513 37.04277 36.82717 37.99789 41.43955   100   b
 readxl_4 24.49031 25.08962 25.81716 25.50953 26.02584 31.22101   100  a 

@jennybc jennybc closed this in c6e0f8f Feb 6, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment