New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem reading some xlsx - Error: Couldn't find 'xl/worksheets/sheet1.xml' #80

Closed
e-kotov opened this Issue Apr 16, 2015 · 28 comments

Comments

Projects
None yet
@e-kotov

e-kotov commented Apr 16, 2015

readxl version 0.1.0

There is a problem with reading some xlsx files (not the ones generated by Google Docs - this is some other xlsx generating BI)

please try the following:

download.file("https://raw.githubusercontent.com/e-kotov/tmp/master/Ekaterinburg_IP_9.xlsx",
                              method = "curl", dest = "Ekaterinburg_IP_9.xlsx")
                x <- read_excel("Ekaterinburg_IP_9.xlsx", skip = 5)

It returns " Error: Couldn't find 'xl/worksheets/sheet1.xml' ".

As soon as the file in question is opened in Excel and re-saved, it is loaded with no problem.

@e-kotov

This comment has been minimized.

e-kotov commented Apr 16, 2015

excel_sheets("Ekaterinburg_IP_9.xlsx")

returns

  [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
 [42] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
 [83] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

After re-saving it returns the correct number of sheets, which in this case is one:

[1] "СПАРК - Список"

read.xlsx2 deals just fine with this kind of files.

@kcandrews

This comment has been minimized.

kcandrews commented Jul 2, 2015

I am getting the same issue, and can resolve it the same way (e.g. by resaving them with Excel). openxlsx has the same issue, but the Java-based XLConnect::getSheets works.

Emacs tells me the zipped structure for the "broken" xlsx file is:

M Filemode      Length  Date         Time      File
- ----------  --------  -----------  --------  --------------------------
  -rw-rw-rw-       365   2-Jul-2015  15:04:08  xl/workbook.xml
  -rw-rw-rw-       296   2-Jul-2015  15:04:08  _rels/.rels
  -rw-rw-rw-       125   2-Jul-2015  15:04:08  xl/styles.xml
  -rw-rw-rw-       436   2-Jul-2015  15:04:08  xl/_rels/workbook.xml.rels
  -rw-rw-rw-     48684   2-Jul-2015  15:04:08  xl/worksheets/sheet.xml
  -rw-rw-rw-       599   2-Jul-2015  15:04:08  [Content_Types].xml
- ----------  --------  -----------  --------  --------------------------
                 50505                         6 files

whereas, for the file I saved over, it became:

M Filemode      Length  Date         Time      File
- ----------  --------  -----------  --------  ---------------------------------------
  -rw-rw-rw-      1409   1-Jan-1980  00:00:00  [Content_Types].xml
  -rw-rw-rw-       735   1-Jan-1980  00:00:00  _rels/.rels
  -rw-rw-rw-       698   1-Jan-1980  00:00:00  xl/_rels/workbook.xml.rels
  -rw-rw-rw-       554   1-Jan-1980  00:00:00  xl/workbook.xml
  -rw-rw-rw-      3625   1-Jan-1980  00:00:00  xl/sharedStrings.xml
  -rw-rw-rw-       322   1-Jan-1980  00:00:00  xl/worksheets/_rels/sheet1.xml.rels
  -rw-rw-rw-      7079   1-Jan-1980  00:00:00  xl/theme/theme1.xml
  -rw-rw-rw-      1260   1-Jan-1980  00:00:00  xl/styles.xml
  -rw-rw-rw-     12371   1-Jan-1980  00:00:00  xl/worksheets/sheet1.xml
  -rw-rw-rw-       791   1-Jan-1980  00:00:00  docProps/app.xml
  -rw-rw-rw-       505   1-Jan-1980  00:00:00  docProps/core.xml
  -rw-rw-rw-      7420   1-Jan-1980  00:00:00  xl/printerSettings/printerSettings1.bin
  -rw-rw-rw-      1502   1-Jan-1980  00:00:00  docProps/custom.xml
- ----------  --------  -----------  --------  ---------------------------------------
                 38271                         13 files
@e-kotov

This comment has been minimized.

e-kotov commented Jul 2, 2015

The problem is that I have hundreds of these xlsx files form the same source (and consequently with the same error), and it takes ages and lots of memory (compared to readxl) to read those with XLConnect::getSheets or xlsx package.

Any thoughts on how to work around that?

@kcandrews

This comment has been minimized.

kcandrews commented Jul 7, 2015

For XLConnect you can increase the amount of memory by setting options(java.parameters = "-Xmx4096m") before loading the library.

@e-kotov

This comment has been minimized.

e-kotov commented Jul 7, 2015

Thanks, I know about the memory workaround, but this is still inconvenient and very slow.

@hadley

This comment has been minimized.

Member

hadley commented Jul 7, 2015

It must be a non-standard xlsx file - I'll take a look when I'm next working on readxl (which unfortunately is unlikely to be for at least a month)

@e-kotov

This comment has been minimized.

e-kotov commented Jul 7, 2015

Your work is very much appreciated!

@bilydr

This comment has been minimized.

bilydr commented Nov 6, 2015

Just encountered the same issue. Looking forward to a solution with readxl :)

@dmirman

This comment has been minimized.

dmirman commented Jan 22, 2016

Just encountered the same issue and fixed it the same way. In case it helps, it was an xlsx file created by the Mac version of Excel.

@hrbrmstr

This comment has been minimized.

hrbrmstr commented Jan 28, 2016

Just had this happen with a CDC FOOD database .xlsx. It has sheet.xml vs sheet1.xml. I'll fix it and shoot a PR

@kamilien1

This comment has been minimized.

kamilien1 commented Feb 1, 2016

I just encountered the same problem. I download from google spreadsheets using the googlesheets package:

   # load in the sheets available to read
    allSheets <- gs_url(url)
    # get the specific spreadsheet with the name equal to the title variable
    spreadsheet <- gs_title(title)
    # download the worksheet from this spreadsheet with name equal to the worksheetName variable
    gs_download(spreadsheet, ws = worksheetName, to = downloadPath)

and then I try to autoread in the spreadsheet

rawData <- read_excel(filein)
Which prints the error
Error: Couldn't find 'xl/worksheets/sheet1.xml' in '(my local filepath to an xlsx file)'

Opening up the xlsx file and saving it locally seems to do the trick. However, I'd love to automatically open it, or find a better workaround that doesn't require opening/closing a file manually.

@hrbrmstr

This comment has been minimized.

hrbrmstr commented Feb 1, 2016

Current code is in C++ and in a numeric for loop so it just requires a bit of logic testing to handle this edge case.

@jennybc

This comment has been minimized.

Member

jennybc commented Feb 1, 2016

I suspect the sheet-identifying problems with spreadsheets obtained from Google are duplicates of #104.

@hrbrmstr The likely problem has been identified by @richfitz in that issue -- it's about mapping the i-th sheet to a specific .xml file. There's an assumption in readxl that's natural and seems to hold for spreadsheets from Excel, but that does not hold for .xlsx generated by Google.

@hrbrmstr

This comment has been minimized.

hrbrmstr commented Feb 1, 2016

Aye (that's the for loop I mentioned and it's cool to have confirmation
:-)

I'll ref #104 as well if I can get to fixing the mappings this week. I
(thankfully?) don't have a wide corpus of Excel/GSheets files to test
mappings against, but I think the issue is both with GSheets and ODS
converted sheets in OpenOffice so I'll generate a few from both (along with
Excel) to test with.

On Mon, Feb 1, 2016 at 1:59 AM, Jennifer (Jenny) Bryan <
notifications@github.com> wrote:

I suspect the sheet-identifying problems with spreadsheets obtained from
Google are duplicates of #104
#104.

@hrbrmstr https://github.com/hrbrmstr The likely problem has been
identified by @richfitz https://github.com/richfitz in that issue --
it's about mapping the i-th sheet to a specific .xml file. There's an
assumption in readxl that's natural and seems to hold for spreadsheets
from Excel, but that does not hold for .xlsx generated by Google.


Reply to this email directly or view it on GitHub
#80 (comment).

@kamilien1

This comment has been minimized.

kamilien1 commented Feb 8, 2016

Good catch, sorry I didn't read close enough!

@jennybc

This comment has been minimized.

Member

jennybc commented May 22, 2016

I looked at the Ekaterinburg_IP_9.xlsx sheet shared by OP @e-kotov. In addition to the file list differences already identified (e.g. sheet.xml vs sheet1.xml), many of the usual files have different namespaces from typical xlsx. And some files have different structure. For example xl/sharedStrings.xml contains font information that is normally found in only xl/styles.xml. Correction: the more complicated structure of string items is a consequence of the Cyrillic alphabet and this persists even when I resave the file with Excel. But the namespaces are also different. So there would need to be many adaptations to parse sheets like this one.

@e-kotov Can you be more specific about how this file was created? If the packages that use the Java API can read it w/o trouble, there's probably a proper name for this, which would be helpful.

It would also be helpful if others on this thread posted actual example files. I somehow doubt these sheets all have the same problem. For example, I know from personal experience that the trouble with xlsx files generated by Google Sheets is usually different and much simpler (#104).

@e-kotov

This comment has been minimized.

e-kotov commented May 22, 2016

@jennybc xlsx package, that relies on Java and is therefore vey slow, reads this file just fine:

install.package("xlsx")
download.file("https://raw.githubusercontent.com/e-kotov/tmp/master/Ekaterinburg_IP_9.xlsx",
                              method = "curl", dest = "Ekaterinburg_IP_9.xlsx")
x <- xlsx::read.xlsx("Ekaterinburg_IP_9.xlsx", sheetIndex = 1, startRow = 6)

While openxlsx also fails to work with this file, saying there are no sheets. :(

install.package("openxlsx")
download.file("https://raw.githubusercontent.com/e-kotov/tmp/master/Ekaterinburg_IP_9.xlsx",
                              method = "curl", dest = "Ekaterinburg_IP_9.xlsx")
openxlsx::getSheetNames("Ekaterinburg_IP_9.xlsx")

As for how this file was created, I'm afraid there isn't much more to say than that it was created by a BI system that generates subsets of a large database into xlsx files for downloading.

I suppose it is a rather small deal, as this problem is specific to me and this BI system I'm using.

Moreover, as I have just checked, this BI has upgraded the way it generates those xlsx files, so the problem is now only with my old files (though I have a few hundred of them...)

Anyway, I think it would benefit the readxl package if it were able to read problematic files like mine, when there is in fact a sheet in the file, but it is a little bit broken.

@jennybc

This comment has been minimized.

Member

jennybc commented Jun 9, 2016

@e-kotov Is it OK if @richfitz and I include Ekaterinburg_IP_9.xlsx as an internal testing/example sheet in https://github.com/rsheets/rexcel? It's very valuable to us in its weirdness and I can't easily get my hands on similar sheets. Though it sounds like @hrbrmstr might be able to hook me up with something from this CDC FOOD database?

@e-kotov

This comment has been minimized.

e-kotov commented Jun 9, 2016

@jennybc yes, it is no problem at all! This file is not classified, which is why I uploaded it to GitHub and shared openly.

@pchtsp

This comment has been minimized.

pchtsp commented Jul 7, 2016

Hello, I have just seen this same problem with an excel file:
0. I am using the CRAN version of readxl (0.1.1)

  1. When I run excel_sheets on it, I get the correct sheet name as a result.
  2. When I try to run read_excel I get the Error: Couldn't find 'xl/worksheets/sheet1.xml' in file error.
  3. I have tried: giving no arguments to sheets, giving sheet=1 and giving it sheet="name".
  4. I have confirmed that opening and closing th excel file with Excel fixes the issue.

Is there any other workaround that doesn't involve a user having to do anything manual like opening the file?

Thank you for the work in this library! it's being really useful so far.

@jennybc

This comment has been minimized.

Member

jennybc commented Jul 7, 2016

@pchtsp I doubt there is another workaround. Is your sheet something you can share? Or can you say how it was created?

@pchtsp

This comment has been minimized.

pchtsp commented Jul 7, 2016

Unfortunately, I cannot share it nor know how it was generated.
I know it was generated automatically by a program somewhere I don't have access by people whom I will never meet.
I also know that this problem is consisted: all files produced by the same system show the same behavior / error.

I have uncompressed the excel file and it does have the sheet1.xml where it is complaining:

file/xl/worksheets/Sheet1.xml

more information

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252   
[3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                  
[5] LC_TIME=Spanish_Spain.1252    

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readr_0.2.2         futile.logger_1.4.1 taskscheduleR_1.0   stringr_1.0.0      
 [5] zoo_1.7-13          RMySQL_0.10.9       DBI_0.4-1           magrittr_1.5       
 [9] tidyr_0.5.1         dplyr_0.5.0         readxl_0.1.1        data.table_1.9.6   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.5          lattice_0.20-33      assertthat_0.1      
 [4] chron_2.3-47         grid_3.3.1           R6_2.1.2            
 [7] futile.options_1.0.0 stringi_1.1.1        lazyeval_0.2.0      
[10] lambda.r_1.1.7       rsconnect_0.4.3      tibble_1.1          
@pchtsp

This comment has been minimized.

pchtsp commented Jul 10, 2016

I tried with the "openxlsx" package's read.xlsx function and it did read these files correctly. Not sure if this helps though.

I guess I'll be using this function for these files while keeping read_excel for the rest.

Thanks anyway!

@almartin82

This comment has been minimized.

almartin82 commented Jul 29, 2016

I can source one of the 'in the wild' areas where this appears to be showing up - the "View in Excel 2007 Data" output of IBM Cognos Viewer, v 10.2.1
I have an example that I can share of a no-row (headers only) output file for testing (would love a second set of eyes on it to verify that it's not leaking any sensitive metadata before it gets added to the public tests - I am at almartin@gmail.com)

@jennybc

This comment has been minimized.

Member

jennybc commented Jul 29, 2016

@almartin82 I'll send you an email. I'm interested in getting this example.

@mcastagnaa

This comment has been minimized.

mcastagnaa commented Jan 14, 2017

Hi, as requested by your email related to a different issue, this is the file which throws the error I mentioned. You can get the error using the following code:

The XL spreadsheet is created by Bloomberg.

library(readxl)
fullFileName <- list.files(pattern = "*var.+xls$")[1]
rawSheet <- read_excel(fullFileName, sheet = 1)

GMAITvar_3735618_20161019103822.xls.zip

@jennybc

This comment has been minimized.

Member

jennybc commented Jan 16, 2017

Thanks @mcastagnaa.

jennybc added a commit that referenced this issue Jan 24, 2017

More robust logic for xlsx sheet lookup; fixes #104, relates to #80 (#…
…233)

* Add test for sheet data xml lookup

* Initial implementation of more correct Sheet Referencing logic, by @jimhester

* Add test workbook with embedded chartsheet

* Sheet indexing for .xls is still 0-indexed

* Tests for bad worksheet requests; add a check on the R side

* Braces

* Make sure requested sheet is not past the end

* sheet_id_ is not actually used and is easily confused with id_

* Switch to std::map

* Cautionary note in benchmarks.cpp

* Indentation of long lines in tests

* Follow convention re: linking tests and issues

* Remove file that is not used in tests

* Missing &

* Comment to motivate SheetRelations class
@jennybc

This comment has been minimized.

Member

jennybc commented Mar 21, 2017

The Ekaterinburg sheet discussed above presents 3 different challenges: worksheet rels (fixed in #233), XML namespaces (fixed in #295), and one last thing for which I opened a separate issue (#294).

@jennybc jennybc closed this in da6f8be Mar 21, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment