New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reads blank rows after end of data #203

Closed
tcquinn opened this Issue Sep 17, 2016 · 7 comments

Comments

Projects
None yet
5 participants
@tcquinn

tcquinn commented Sep 17, 2016

With some Excel files, readxl seems to read a large number of blank rows after the end of the visible data. For example, when reading this file from the New York City Department of Finance (with skip=4), readxl reads 25,141 rows, but 19,990 of them are blank. I'm using the CRAN version (0.1.1). (Sorry if it's verboten to report issues with the CRAN version here; I posted on Stack Overflow but didn't get any useful responses)

@smckenna

This comment has been minimized.

smckenna commented Nov 2, 2016

I have seen same behavior and it's a killer! Deleting rows by selecting the cells and hitting the delete key does not seem to delete the content. It disappears, but readxl finds it somehow! I did find that if you delete rows by selecting rows using their row numbers, right-clicking, and doing delete, that seems to really delete the rows.

@jamesui

This comment has been minimized.

jamesui commented Nov 18, 2016

Dealing with the same issue, though it is easy to fix by using tidyr::drop.na(df) after using read_excel.

@jennybc

This comment has been minimized.

Member

jennybc commented Nov 18, 2016

@smckenna @jamesui Do you see this with xls also (as the example @tcquinn links to) or with xlsx? If xlsx, can you share or link an example, to complete the set?

@jennybc jennybc referenced this issue Jan 7, 2017

Closed

Skip empty rows #22

@jennybc jennybc added bug blanks labels Jan 7, 2017

@jennybc

This comment has been minimized.

Member

jennybc commented Jan 7, 2017

library(readxl)
#download.file("https://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2012/2012_bronx.xls", destfile = "2012_bronx.xls")

df <- read_excel("2012_bronx.xls")
#> DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 04 00 04 00 00 00 14 00 
#> and so on and so forth
df
#> # A tibble: 25,145 × 21
#>    `BRONX ANNUALIZE SALE FOR YEAR 2012.  All Sales From  Jan 1, 2012 - Dec 31, 2012.`
#>                                                                                 <chr>
#> 1                      Sales File as of 03/31/2013  Coop Sales Files as of 03/31/2013
#> 2                            Neighborhood Name and Descriptive Data is as of 04/19/13
#> 3                 Building Class Category is based on Building Class at Time of Sale.
#> 4                                                                           BOROUGH\n
#> 5                                                                            2.000000
#> 6                                                                            2.000000
#> 7                                                                            2.000000
#> 8                                                                            2.000000
#> 9                                                                            2.000000
#> 10                                                                           2.000000
#> # ... with 25,135 more rows, and 20 more variables: `NA` <chr>,
#> #   `NA` <chr>, `NA` <chr>, `NA` <chr>, `NA` <chr>, `` <chr>, `` <chr>,
#> #   `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>,
#> #   `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>
dim(df)
#> [1] 25145    21

df2 <- tidyr::drop_na(df)
dim(df2)
#> [1] 5152   21

as.data.frame(tail(df))
#>   BRONX ANNUALIZE SALE FOR YEAR 2012.  All Sales From  Jan 1, 2012 - Dec 31, 2012.
#> 1                                                                             <NA>
#> 2                                                                             <NA>
#> 3                                                                             <NA>
#> 4                                                                             <NA>
#> 5                                                                             <NA>
#> 6                                                                             <NA>
#>     NA   NA   NA   NA   NA                                             
#> 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#>                                
#> 1 <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 <NA> <NA> <NA> <NA> <NA> <NA>
@jennybc

This comment has been minimized.

Member

jennybc commented Jan 7, 2017

I've "moved" @burchill's comment to #224, because it's about a distinct problem.

@jennybc

This comment has been minimized.

Member

jennybc commented Feb 5, 2017

This comes from taking the declared dimensions literally. I can already fix for xlsx, by replacingcacheDimensions() with computeDimensions() here:

cacheDimension();

Verified with xlsx saved from 2012_bronx.xls sheet from above.

@jennybc jennybc closed this in c6e0f8f Feb 6, 2017

@zbrainsoft

This comment was marked as spam.

zbrainsoft commented Oct 18, 2018

You can highlight or delete empty or blank rows easily using Dose for Excel Add-In which provides more than +100 Features, check their website in below:
https://www.zbrainsoft.com/excel-delete-rows.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment