New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when a sheet doesn't have a file in /xl/worksheets #408

Closed
nacnudus opened this Issue Nov 27, 2017 · 8 comments

Comments

Projects
None yet
2 participants
@nacnudus
Contributor

nacnudus commented Nov 27, 2017

One of the Enron files has a sheet, Module1, that doesn't have a corresponding file in /xl/worksheets. This causes an error when importing the sheet, e.g. when importing every sheet of every file in a directory.

daren_farmer__6529__egmnom-Jan.xlsx

readxl::excel_sheets("daren_farmer__6529__egmnom-Jan.xlsx")
#> [1] "EGM 60"       "DELV SUMMARY" "EGM 201"      "EGM 202"     
#> [5] "RHODIA"       "Module1"

readxl::read_excel("daren_farmer__6529__egmnom-Jan.xlsx", "Module1")
#> Error in read_fun(path = path, sheet = sheet, limits = limits, shim = shim, : `` not found

The problem disappears (for me) when resaving the file with Excel 2010, because Excel creates a file /xl/worksheets/sheet6.xml. So this is presumably rare.

It is named by excel_sheets() because it is named in /xl/workbook.xml, with an empty r:id. So one way to handle it would be not to include a sheet in excel_sheets() if the r:id is empty. Or maybe a graceful failure would be preferable -- I don't really have an opinion.

<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" mc:Ignorable="x15">
<fileVersion appName="xl" lastEdited="6" lowestEdited="6" rupBuild="14420"/>
<workbookPr codeName="ThisWorkbook" defaultThemeVersion="153222"/>
<mc:AlternateContent xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006">
<mc:Choice Requires="x15">
<x15ac:absPath xmlns:x15ac="http://schemas.microsoft.com/office/spreadsheetml/2010/11/ac" url="C:\Users\Felienne\Enron\EnronSpreadsheets\"/>
</mc:Choice>
</mc:AlternateContent>
<bookViews>
<workbookView xWindow="240" yWindow="90" windowWidth="9135" windowHeight="4965" tabRatio="721" activeTab="2"/>
</bookViews>
<sheets>
<sheet name="EGM 60" sheetId="20" r:id="rId1"/>
<sheet name="DELV SUMMARY" sheetId="21" r:id="rId2"/>
<sheet name="EGM 201" sheetId="22" r:id="rId3"/>
<sheet name="EGM 202" sheetId="23" r:id="rId4"/>
<sheet name="RHODIA" sheetId="24" r:id="rId5"/>
<sheet name="Module1" sheetId="19" state="veryHidden" r:id=""/>
</sheets>
<definedNames>
<definedName name="_xlnm.Print_Area" localSheetId="2">'EGM 201'!$A$1:$J$75</definedName>
<definedName name="_xlnm.Print_Area" localSheetId="3">'EGM 202'!$A$1:$D$53</definedName>
<definedName name="_xlnm.Print_Area" localSheetId="0">'EGM 60'!$A$1:$D$34</definedName>
<definedName name="_xlnm.Print_Area" localSheetId="4">RHODIA!$A$1:$D$26</definedName>
</definedNames>
<calcPr calcId="152511" iterate="1" iterateCount="1"/>
</workbook>

I don't think the state="veryHidden" property is anything to do with it, at least not with modern versions of Excel. It doesn't prevent there being an /xl/worksheets/sheet6.xml file. The property controls the visibility of the sheet in the Excel UI, and veryHidden means it is invisible in both the normal view and the VBA window, and can only be accessed via VBA code (or by unzipping the file).

@jennybc

This comment has been minimized.

Member

jennybc commented Nov 27, 2017

How do you think such an xlsx file comes to be 🤔? What I really mean is, do you think this happens in the wild or is something deeply peculiar to the history of that file? If it happens in the wild, I'm surprised we've never had a report before but who knows.

@nacnudus

This comment has been minimized.

Contributor

nacnudus commented Nov 27, 2017

I was wondering the same thing, because Enron predates the xlsx format. The file came from https://gitlab.com/rsheets/enron_corpus -- do you remember whether they were converted from xls to xls first? Anyway, I'd say it's deeply peculiar and I'm not bothered by it, just thought it worth noting.

@jennybc

This comment has been minimized.

Member

jennybc commented Nov 27, 2017

I'm 99% sure @richfitz did not do the xls to xlsx conversion, i.e. that was done by Felienne Hermanns group or earlier, when the corpus was assembled.

@jennybc

This comment has been minimized.

Member

jennybc commented Apr 15, 2018

Does this one sheet remain our sole example of this? If so, I am inclined to not act on it.

@jennybc

This comment has been minimized.

Member

jennybc commented Apr 25, 2018

Something I did seems to have fixed this. Probably eeeebf8.

🎉

@jennybc jennybc closed this Apr 25, 2018

@jennybc

This comment has been minimized.

Member

jennybc commented Apr 25, 2018

Oh nevermind. I still can't read that exact sheet, although I can read others. I think, until I see more examples of this, I am not going to worry about it.

@jennybc

This comment has been minimized.

Member

jennybc commented Apr 25, 2018

I did some excelgesis on it. Here's the sheets node in workbook.xml:

<sheets>
<sheet name="EGM 60" sheetId="20" r:id="rId1"/>
<sheet name="DELV SUMMARY" sheetId="21" r:id="rId2"/>
<sheet name="EGM 201" sheetId="22" r:id="rId3"/>
<sheet name="EGM 202" sheetId="23" r:id="rId4"/>
<sheet name="RHODIA" sheetId="24" r:id="rId5"/>
<sheet name="Module1" sheetId="19" state="veryHidden" r:id=""/>
</sheets>

So this is a matter of: how to handle "veryHidden" sheets?

Update: I now see you @nacnudus said most of this above yourself.

@jennybc jennybc reopened this Apr 25, 2018

@jennybc

This comment has been minimized.

Member

jennybc commented Apr 25, 2018

I think having a stance on hidden sheets is important. But I doubt this spreadsheet is a good example to base such work on. I suspect that this hit some weird edge case in whatever tool was used to create the Enron corpus, which we suspect involved some xls to xlsx conversion.

If I resave this file, the "Module1" sheet remains listed in workbook.xml, remains veryHidden, gains a proper r:id attribute, and gains its own XML file below xl/worksheets/ (but with empty sheetData node). As @nacnudus saw as well.

Back to this position: unless we have ongoing reports of this, I will let this be. I could change the sheet parsing to only acknowledge a sheet if the r:id attribute has more than zero characters but I think it's not worth complicating the code for this. It seems reasonable to use tryCatch() when doing something ambitious, like reading every putative worksheet in the Enron corpus.

@jennybc jennybc closed this Apr 25, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment