New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

excel_sheets does not work with *.xltx/*.xltm #411

Closed
thothal opened this Issue Dec 7, 2017 · 10 comments

Comments

Projects
None yet
2 participants
@thothal

thothal commented Dec 7, 2017

Problem

While excel_sheets cannot handle Excel templates (*.xlt[xm]), the underlying workhorse readxl:::xlsx_sheets can. The error comes from readxl:::excel_format which throws an error as soon as the file is not a *.(xls|xlsx|xlsm).

Possible Fix

Possible fix would be to change readxl:::excel_format to:

function (path) 
{
    ext <- tolower(tools::file_ext(path))
    switch(ext, xls = "xls", xlsx = "xlsx", xlsm = "xlsx", xltx = "xlsx", xltm = "xlsx", 
        if (nzchar(ext)) {
            stop("Unknown file extension: ", ext, call. = FALSE)
         } else {
            stop("Missing file extension.", call. = FALSE)
        })
}

There are of course other Excel formats which may deserve to be handled too, Excel 97-2003 templates for instance (*.xlt).

Use Case

If you are wondering, why somebody wants to read in templates, let me give you my use case: in a Shiny app I want the user to upload a file containing some data. In order to make sure that the data is compliant with my Shiny app, I do some checks.

Basically, I compare the structure of the user data with the structure of a template stored in my app folder. This approach avoids hardcoding all the checks. As I am providing an Excel template anyways to the users of my tool, I gain quite some flexibility. I could of course fall back to a *.xlsx file, but this does not feel right conceptually.

Reprex

Sample file:
Book1.zip

download.file("https://github.com/tidyverse/readxl/files/1539048/Book1.zip", "./Book1.zip")
unzip("Book1.zip")

readxl:::excel_sheets("Book1.xltx") 
# Error: Unknown file extension: xltx

readxl:::xlsx_sheets("Book1.xltx")
# [1] "Sheet1"

unlink("Book1.xltx")
unlink("Book1.zip")
@jennybc

This comment has been minimized.

Member

jennybc commented Dec 7, 2017

Agree this would be good to fix, i.e. to not fail too early simply because we've ruled out a format. I don't have any experience with templates....

@jennybc jennybc added the feature label Apr 15, 2018

jennybc added a commit that referenced this issue Apr 16, 2018

@jennybc

This comment has been minimized.

Member

jennybc commented Apr 16, 2018

Will you see if the improvements in #454 work for you?

devtools::install_github("tidyverse/readxl#454")

@jennybc jennybc added this to the v1.1.0 milestone Apr 16, 2018

jennybc added a commit that referenced this issue Apr 17, 2018

@jennybc

This comment has been minimized.

Member

jennybc commented Apr 17, 2018

Will you see if the improvements in #457 work for you?

devtools::install_github("tidyverse/readxl#457")

jennybc added a commit that referenced this issue Apr 17, 2018

@jennybc jennybc closed this in #457 Apr 17, 2018

jennybc added a commit that referenced this issue Apr 17, 2018

@thothal

This comment has been minimized.

thothal commented Apr 17, 2018

I get:

devtools::install_github("tidyverse/readxl#457")
Downloading GitHub repo tidyverse/readxl@refactor-format-guessing
from URL https://api.github.com/repos/tidyverse/readxl/zipball/refactor-format-guessing
Installation failed: 404: Not Found
 (404)

But I guess I was too late to test it?

@jennybc

This comment has been minimized.

Member

jennybc commented Apr 17, 2018

I’d still love for you to test! It’s in master now.

@thothal

This comment has been minimized.

thothal commented Apr 18, 2018

Works like a charm now. Thanks!

@jennybc

This comment has been minimized.

Member

jennybc commented Apr 18, 2018

Great! Submitting to CRAN today.

@jennybc

This comment has been minimized.

Member

jennybc commented Apr 22, 2018

@thothal I'm interested to learn more about your workflow. Especially this part:

In order to make sure that the data is compliant with my Shiny app, I do some checks.

Basically, I compare the structure of the user data with the structure of a template stored in my app folder. This approach avoids hardcoding all the checks.

I have vague aspirations, along with Aaron Berg at RStudio, to provide some workflow-level guidance when a keen R user needs to work with people more comfortable in a spreadsheet application. It sounds like you might have some interesting insights.

What sort of checks do you on the uploaded user file? I'm curious to hear more.

@thothal

This comment has been minimized.

thothal commented Apr 23, 2018

For the Shiny tools we are developing, the user (mostly non stats scientists) is typically asked to upload their data. For this task, we provide them with an Excel template where they can fill the data and sometimes these templates also includes some VBA checks to ensure data consistency.

For the Shiny app it is of course important that the required data is where it supposed to be. That is, we would like to have the right columns available. Thus, the approach we have pursued in the past is the following:

  1. Compare whether all sheets in the template (which we store in a Shiny sub directory) are also present in the uploaded file. There is also an option to define a blacklist of sheets which do not need to be checked.
  2. If the former test passes, we loop through all sheets and compare column headings. We check only if these headings are present and not whether they are in the same order. We typically start to read and compare the second row (which is usually hidden to the user), but the user can specify any other starting point on the Shiny side. In the first row, we keep long human readable names, but in the second row, we store R compliant column headings (cf. screenshot)

image

Idea behind this approach is that if the template changes slightly, the Shiny app ideally does not need to be changed - all what is needed that the new template is also in the relevant sub folder. The code is modular and does not require any hard coding on the Shiny side (unless the column headers do not start in the default row). If we start a new project, all we have to do is to provide an Excel template and the testing procedure. With this set up we ensure

  1. Data consistency: checked on an Excel basis, i.e. there where the user enters the data
  2. Uploaded Data structure consistency: check files for the formal requirements, that is all the columns we expect (as defined in the template) are present.

For future iterations we thought about providing a template template, from which project specific templates derive and which has one ".SYSTEM" sheet (say), where all the project specific settings (e.g. which sheets not to check, what are the cell boundaries for comparing etc.) are stored. In this way we should be able to eliminate the last bit of Shiny hard coding completely.

Does that make sense to you?

@jennybc

This comment has been minimized.

Member

jennybc commented Apr 23, 2018

Interesting! Thanks for providing more details. The hidden row of "true" column names is interesting.

I also want to make named ranges more of a "first class" addressable object in future. I'm baking this in to the reboot of googlesheets. Because I think that is another tool for this sort of workflow, where you can mark off an area in the spreadsheet that needs to be managed your way (and possibly force certain practices there, with data validation, etc.) and then let people do whatever they want elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment