New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

n blank rows in xlsx lead to n trailing NA rows in data frame #224

Closed
jennybc opened this Issue Jan 7, 2017 · 1 comment

Comments

Projects
None yet
1 participant
@jennybc
Member

jennybc commented Jan 7, 2017

Moving from #203, which is about a different problem with blank cells.

Comment I deleted there from @burchill:

readxl gets the initial dimensions of the worksheet, which it thinks ALWAYS starts at A1, and then goes to whatever the maximum row and column are.

But .xlsx files don't store empty rows, so if the first few rows are entirely empty, readCols starts from the first NON-EMPTY row, but then still iterates until it gives values for every cell in a matrix of the dimensions it originally measured, You can replicate it with any simple .xlsx file where the data doesn't start until a few rows down. E.g.:

A B C
val1 val2 val3
val4 val5 val6

becomes:

val1 val2 val3
val4 val5 val6
NA NA NA
NA NA NA
NA NA NA
@jennybc

This comment has been minimized.

Member

jennybc commented Jan 7, 2017

If there are n leading blank rows in xlsx, the data frame readxl produces will have n trailing rows of NA. Reprex:

library(readxl)
excel_sheets("leading-blank-rows.xlsx")
#> [1] "one_blank_row"   "four_blank_rows"
df <- read_excel("leading-blank-rows.xlsx", sheet = "one_blank_row")
df
#> # A tibble: 2 × 3
#>   `val1,1` `val1,2` `val1,3`
#>      <chr>    <chr>    <chr>
#> 1   val2,1   val2,2   val2,3
#> 2     <NA>     <NA>     <NA>

df2 <- read_excel("leading-blank-rows.xlsx", sheet = "four_blank_rows")
df2
#> # A tibble: 5 × 3
#>   `val1,1` `val1,2` `val1,3`
#>      <chr>    <chr>    <chr>
#> 1   val2,1   val2,2   val2,3
#> 2     <NA>     <NA>     <NA>
#> 3     <NA>     <NA>     <NA>
#> 4     <NA>     <NA>     <NA>
#> 5     <NA>     <NA>     <NA>

@jennybc jennybc changed the title from Leading blank rows in xlsx lead to trailing NA rows in data frame to n leading blank rows in xlsx lead to n trailing NA rows in data frame Jan 7, 2017

@jennybc jennybc added bug blanks labels Jan 7, 2017

@jennybc jennybc changed the title from n leading blank rows in xlsx lead to n trailing NA rows in data frame to n blank rows in xlsx lead to n trailing NA rows in data frame Jan 7, 2017

@jennybc jennybc closed this in #240 Jan 31, 2017

jennybc added a commit that referenced this issue Jan 31, 2017

Load cells at xlsx worksheet ingest; handle skipping and/or blank row… (
#240)

* Load cells at xlsx worksheet ingest; handle skipping and/or blank rows; fixes #224

* Actually these *should* be the same

* Make skipping tests more challenging (blank row btwn col names and data, plus another embedded blank row)

* Simplify return of 0x0 tibble for completely empty worksheet

* Mark cells to start reading from at the time of worksheet construction

* Remove vestigial, internal cell printing function

* Be quiet about empty worksheet

* More tests of nothingness

* Inform worksheet about its own name

* Tighten up the NEWs bullet for this PR

* Make some accessor member functions const

* Use skip() to issue note-to-future-self

Expedient place to park a summary of what this entire PR does.

* Improved parsing of sheet geometry for xlsx. (#240, @jennybc).

    - Better handling of leading and embedded blank rows and explicit row skipping. (#224, #194, #178, #156, #101)
    - Worksheets that are completely empty or that contain only column names no longer error, but return a tibble with zero rows. (#222, #144, #65)
    - Location is inferred for cells that do not declare their location (e.g. xlsx written by JMP). (#163, #102)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment