New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readxl skip and column designation not working if there are blanks #101

Closed
AmyMikhail opened this Issue May 18, 2015 · 3 comments

Comments

Projects
None yet
3 participants
@AmyMikhail

AmyMikhail commented May 18, 2015

I have an MS excel sheet containing around 16,000 rows and 16 columns. There is some un-needed text above the column headers, so I need to skip the first three rows when reading the data in (4th row contains column headers, 5th row is the start of the data).

However I noticed that using skip=3 resulted in the column headers not being read in and everything started on the 5th row.

Also, the 2nd and 3rd column of my data contain text (names) but in one file the first 1500 cells in these columns are blank because the names are missing. These columns are misidentified as numeric unless I specify col_types. The 4th and subsequent columns contain data for all rows. I tried adding an initial column with the sequence number but the result was the same.

I think both behaviors are due to the missing data; not sure if it would be possible to correct this by going to the first cell that has data in it for each column and then determining the column type from there and below (I guess if the data is mixed numeric and text, the column type would default to text and in order to do this multiple cells would have to be evaluated?)

If the automated column type assignation would work even when there is some missing data, this would be helpful as although not too much of a problem to manually specify the types for 16 columns, I often have many more than this.

Would be great if there was a fix for this - otherwise the package works great and I found it much faster than the java based ones.

@DavoOZ

This comment has been minimized.

DavoOZ commented Jul 7, 2015

I have experienced a similar glitch.
If skip>0 and one or more of those header rows is empty, then the read starts at the wrong row.
For example, if skip=9 but there are two empty rows among those 9, then the read starts two rows further down the data than intended.
In other words, as long as some text is present somewhere in every skipped row, then the read will be fine.

@jennybc

This comment has been minimized.

Member

jennybc commented Jan 7, 2017

Concrete examples of the above would be helpful to work against.

How to provide a readxl reprex

We're in a much better position to address your issue if you can provide a reprex (reproducible example). Provide as much of this as you can:

  • An actual xls or xlsx file. Preferably stripped down to the minimal size and complexity to demonstrate your point. And, obviously, stripped of any sensitive data.
  • A small bit of R code that uses readxl on the provided xls or xlsx file and demonstrates your point.
    • Consider using the reprex package to prepare this. In addition to nice formatting, this ensures your reprex is self-contained. Currently GitHub-only, but going to CRAN soon.
  • Any details about your environment that seem clearly relevant, such as operating system.
    reprex(..., si = TRUE)
    will append a standard summary, folded neatly away, at the bottom of your reprex.

How to provide the xls/xlsx file? In order of preference:

  1. Attach the file directly to your issue. Instructions are always at the bottom of the issue or comment box.
  2. Share via DropBox or Google Drive and provide the link in your issue.
  3. Explain you absolutely cannot provide a relevant file via github.com and offer to provide privately.
  4. Don't share a file and realize you're hoping for, e.g., a bug fix with no concrete example to go on.

@jennybc jennybc added the reprex label Jan 7, 2017

@jennybc

This comment has been minimized.

Member

jennybc commented Jan 31, 2017

Will be fixed by #240. And the interaction with column names is covered by #157.

@jennybc jennybc closed this Jan 31, 2017

jennybc added a commit that referenced this issue Jan 31, 2017

Load cells at xlsx worksheet ingest; handle skipping and/or blank row… (
#240)

* Load cells at xlsx worksheet ingest; handle skipping and/or blank rows; fixes #224

* Actually these *should* be the same

* Make skipping tests more challenging (blank row btwn col names and data, plus another embedded blank row)

* Simplify return of 0x0 tibble for completely empty worksheet

* Mark cells to start reading from at the time of worksheet construction

* Remove vestigial, internal cell printing function

* Be quiet about empty worksheet

* More tests of nothingness

* Inform worksheet about its own name

* Tighten up the NEWs bullet for this PR

* Make some accessor member functions const

* Use skip() to issue note-to-future-self

Expedient place to park a summary of what this entire PR does.

* Improved parsing of sheet geometry for xlsx. (#240, @jennybc).

    - Better handling of leading and embedded blank rows and explicit row skipping. (#224, #194, #178, #156, #101)
    - Worksheets that are completely empty or that contain only column names no longer error, but return a tibble with zero rows. (#222, #144, #65)
    - Location is inferred for cells that do not declare their location (e.g. xlsx written by JMP). (#163, #102)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment