New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing column when using read_excel #99

Closed
reinierv4 opened this Issue May 14, 2015 · 18 comments

Comments

Projects
None yet
@reinierv4

reinierv4 commented May 14, 2015

I try to read in a .xls file that is downloaded from the internet. When I use

df=read_excel("FILEPATH",col_names=FALSE)

The df is created but is missing the last column. When I manually open the .xls file first and than click on save this issue disappears and the file is read in normally.To reproduce the behavior it is therefore important to not open the excel file and save "changes" manually before reading it in to R.

An example of the file I try to read in can be downloaded here:

http://www.screencast.com/users/Reinier7840/folders/Default/media/8ab2a43b-6560-4730-890f-e93c1a5ef931

I tried different things like skipping the first line, declaring the columnames and types, and setting col_names equal to true, but that doesn't seem to solve the issue.

Edit: Loading the same file with the xlsReadWrite package leads to the following error message:

"Error: IOException (Java): block[ 2 ] already removed - does your POIFS have circular or duplicate block references?"

@drpeterfoster

This comment has been minimized.

drpeterfoster commented Jun 18, 2015

bump. experienced the same thing. re-saving .xls fixed the issue.

@sfirke

This comment has been minimized.

sfirke commented Jul 9, 2015

I have the same problem, caused by some columns having many empty rows before the first value. E.g., 671 empty rows before first value.

Re-saving did not help, did not matter if it was .xls or .xlsx, it was only reading 117 of 184 variables. I added a row of dummy values across the whole data set in Excel and then it loaded fine.

@jomuller

This comment has been minimized.

jomuller commented Aug 18, 2015

I had the same issue with a .xls file. When the last column don't have a value on the row just abowe the header, the column is not imported with read_excel. If I add a value (e.g. "NA"), the column is correctly imported. To re-save the .xls file with Excel 2010, it also fixed the problem.

@morsem

This comment has been minimized.

morsem commented Aug 25, 2015

I'm having a similar issue too. There are 17 columns in the data, but only the first 11 appear. This is because column 12 has empty cells until row 641. My work-around has been just adding place-holders to the first row.

> library(readxl)
> 
> # https://www.dropbox.com/sh/7ldzfux4epgbwfv/AAC6ytAMgO5nt08nyugwwxpVa?dl=0
> 
> data <- read_excel("~/Dropbox/readxl_example/de_lose_cols.xlsx", col_names = F)
> dim(data)
[1] 7165   11
> data[1, ]
Source: local data frame [1 x 11]

                 X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 UNITED STATES SENATOR NA NA NA NA NA NA NA NA NA  NA
> 
> data2<- read_excel("~/Dropbox/readxl_example/de_placeholders.xlsx", col_names = F)
> dim(data2)
[1] 7165   17
> data2[1, ]
Source: local data frame [1 x 17]

                 X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16
1 UNITED STATES SENATOR  X  X  X  X  X  X  X  X  X   X   X   X   X   X   X   X

> # yes, this data is really messy - here's the first instance of a data in would-be-missing col
> data2[640:642, ]
Source: local data frame [3 x 17]

                      X0    X1         X2 X3 X4         X5 X6 X7        X8 X9 X10        X11 X12 X13 X14 X15 X16
1 REPRESENTATIVE IN CONGRESS    NA         NA NA NA         NA NA NA        NA NA  NA         NA  NA  NA  NA  NA  NA
2                   District Total   CARNEY J NA NA     IZZO R NA NA  AUGUST B NA  NA    GESTY S  NA  NA  NA  NA  NA
3                         NA    NA DEMOCRATIC NA NA REPUBLICAN NA NA     GREEN NA  NA LIBERTARIN  NA  NA  NA  NA  NA
@bjornerstedt

This comment has been minimized.

bjornerstedt commented Oct 16, 2015

Here is an example of the problem. The script test.R opens test.xls correctly with 3 columns with the xlsx package but only with two using readxl. Unfortunately I could not upload the files, so I temporarily put them here:

https://github.com/bjornerstedt/test

@danielsjf

This comment has been minimized.

danielsjf commented Feb 18, 2016

I experienced the same problem. Here is an example file:
https://www.entsoe.eu/fileadmin/template/other/statistical_database/excel.php?pid=137&&opt_period=2&opt_Country=3&opt_Month=1&opt_Year=2008&send=send&opt_Response=99&dataindx=0

If I do

tmp_dest <- link_to_file
read_excel(tmp_dest, sheet=1, skip=9) # data begins on 10th line

the last column never loads.

@aranryan

This comment has been minimized.

aranryan commented Apr 21, 2016

Thanks drpeterfoster. I was having the same issue, read_excel would miss the last column. Re-saving the Excel file did the trick. It is a pretty ordinary Excel file, no rows with missing data.

@matteodefelice

This comment has been minimized.

matteodefelice commented Jun 15, 2016

I had the same problem with the same data by @danielsjf , last column never load. The package xlsx gives me also an error

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  org.apache.poi.hssf.OldExcelFormatException: The supplied spreadsheet seems to be Excel 5.0/7.0 (BIFF5) format. POI only supports BIFF8 format (from Excel versions 97/2000/XP/2003)

perhaps the files are not proper Microsoft Excel files...
UPDATE: I succeeded by using gdata package

@petersmp

This comment has been minimized.

petersmp commented Nov 10, 2016

Of note, this appears to still be an issue. I am not sure where the cause is -- these are files generated by a third-party tool as with some of the others (particularly those in other, related, issues). Re-saving (without any other change) worked. Note, I am running Ubuntu 16.04 and using LibreOffice Calc to view the files. The column loss happens whether or not I have opened the file, but goes away once I save.

@jennybc

This comment has been minimized.

Member

jennybc commented Jan 31, 2017

Once #240 is merged, I expect matters to improve for xlsx (please open a new issue if not!) or be covered by #157. I'm going to regard this issue as specific to xls going forward.

@jennybc

This comment has been minimized.

Member

jennybc commented Feb 24, 2017

I can reproduce this and at the moment #271 does not appear to fix it.

@jennybc

This comment has been minimized.

Member

jennybc commented Feb 25, 2017

Duplicate of #152

@kamilien1

This comment has been minimized.

kamilien1 commented Mar 14, 2017

nice catch!

@ea-guerette

This comment has been minimized.

ea-guerette commented Mar 29, 2017

EDIT: I installed the github version. This one imports all the columns (yay!), BUT col_names seems broken and, more importantly, it reads my Date and Time columns as NAs whatever col_types I assign to them...

I would attach an example, but this platform does not accept excel files! -which is probably a policy I should adopt myself! :P

Original post:
Was this issue resolved for xls files?
I just installed readxl today (from CRAN) and am encountering this issue.
readxl won't import the last column, even though it has a header.
Opening the file and saving it solves the issue, as mentioned by other users. This is impractical as I have hundreds of files and they keep coming...
Thanks.

@jennybc

This comment has been minimized.

Member

jennybc commented Mar 29, 2017

Sounds like you need to open a new issue demonstrating the problems you see.

How to provide a readxl reprex

We're in a much better position to address your issue if you can provide a reprex (reproducible example). Provide as much of this as you can:

  • An actual xls or xlsx file. Pick one:
    • Your personal xls or xlsx: try to strip it down to the minimal size and complexity to demonstrate your point. And, obviously, remove any sensitive data.
    • A publicly available xls or xlsx: provide URL and the code you used to download.
  • A small bit of R code that uses readxl on the provided xls or xlsx file and demonstrates your point.
    • Consider using the reprex package to prepare this. In addition to nice formatting, this ensures your reprex is self-contained.
  • Any details about your environment that seem clearly relevant, such as operating system.
    reprex(..., si = TRUE)
    will append a standard summary, folded neatly away, at the bottom of your reprex.

How to provide your own xls/xlsx file? In order of preference:

  1. Attach the file directly to your issue. Instructions are always at the bottom of the issue or comment box.
  2. Share via DropBox or Google Drive and provide the link in your issue.
  3. Explain you absolutely cannot provide a relevant file via github.com and offer to provide privately.
  4. Don't share a file and realize you're hoping for, e.g., a bug fix with no concrete example to go on.
@ea-guerette

This comment has been minimized.

ea-guerette commented Mar 30, 2017

@pranathichunduru

This comment has been minimized.

pranathichunduru commented Oct 2, 2017

Hi ,
I have had the same issue. The R-script read_excel, read_xls,read.xls(from gdata pakcage) only imports say 100 columns of 150 column excel file (50 columns missing) .
I tried re-saving the excel file with dummy values (NA) in the missing rows ,still no luck.
Is there anything that I am missing or can anyone suggest an alternative.

Thanks

@jennybc

This comment has been minimized.

Member

jennybc commented Oct 2, 2017

@pranathichunduru Are you absolutely sure you are using the most recent released version of readxl? What exactly is your version number? Use packageVersion("readxl").If you follow the link above, you'll see that a readxl fix did address with that person's problem and sheet.

If this is still a problem with the CRAN version, then I need access to your sheet. It is probably written by some non-Excel tool?

Also, are you missing rows or columns? Big difference!

You should open a new issue and do this:

How to provide a readxl reprex

We're in a much better position to address your issue if you can provide a reprex (reproducible example). Provide as much of this as you can:

  • An actual xls or xlsx file. Pick one:
    • Your personal xls or xlsx: try to strip it down to the minimal size and complexity to demonstrate your point. And, obviously, remove any sensitive data.
    • A publicly available xls or xlsx: provide URL and the code you used to download.
  • A small bit of R code that uses readxl on the provided xls or xlsx file and demonstrates your point.
    • Consider using the reprex package to prepare this. In addition to nice formatting, this ensures your reprex is self-contained.
  • Any details about your environment that seem clearly relevant, such as operating system.
    reprex(..., si = TRUE)
    will append a standard summary, folded neatly away, at the bottom of your reprex.

How to provide your own xls/xlsx file? In order of preference:

  1. Attach the file directly to your issue. Instructions are always at the bottom of the issue or comment box. .xlsx is a supported file type. You'll need to zip or gzip .xls so it appears as .zip or .gz.
  2. Share via DropBox or Google Drive and provide the link in your issue.
  3. Explain you absolutely cannot provide a relevant file via github.com and offer to provide privately.
  4. Don't share a file and realize you're hoping for, e.g., a bug fix with no concrete example to go on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment