Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when a sheet doesn't have a file in /xl/worksheets #408

Closed
nacnudus opened this issue Nov 27, 2017 · 8 comments
Closed

Error when a sheet doesn't have a file in /xl/worksheets #408

nacnudus opened this issue Nov 27, 2017 · 8 comments

Comments

@nacnudus
Copy link
Contributor

@nacnudus nacnudus commented Nov 27, 2017

One of the Enron files has a sheet, Module1, that doesn't have a corresponding file in /xl/worksheets. This causes an error when importing the sheet, e.g. when importing every sheet of every file in a directory.

daren_farmer__6529__egmnom-Jan.xlsx

readxl::excel_sheets("daren_farmer__6529__egmnom-Jan.xlsx")
#> [1] "EGM 60"       "DELV SUMMARY" "EGM 201"      "EGM 202"     
#> [5] "RHODIA"       "Module1"

readxl::read_excel("daren_farmer__6529__egmnom-Jan.xlsx", "Module1")
#> Error in read_fun(path = path, sheet = sheet, limits = limits, shim = shim, : `` not found

The problem disappears (for me) when resaving the file with Excel 2010, because Excel creates a file /xl/worksheets/sheet6.xml. So this is presumably rare.

It is named by excel_sheets() because it is named in /xl/workbook.xml, with an empty r:id. So one way to handle it would be not to include a sheet in excel_sheets() if the r:id is empty. Or maybe a graceful failure would be preferable -- I don't really have an opinion.

<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" mc:Ignorable="x15">
<fileVersion appName="xl" lastEdited="6" lowestEdited="6" rupBuild="14420"/>
<workbookPr codeName="ThisWorkbook" defaultThemeVersion="153222"/>
<mc:AlternateContent xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006">
<mc:Choice Requires="x15">
<x15ac:absPath xmlns:x15ac="http://schemas.microsoft.com/office/spreadsheetml/2010/11/ac" url="C:\Users\Felienne\Enron\EnronSpreadsheets\"/>
</mc:Choice>
</mc:AlternateContent>
<bookViews>
<workbookView xWindow="240" yWindow="90" windowWidth="9135" windowHeight="4965" tabRatio="721" activeTab="2"/>
</bookViews>
<sheets>
<sheet name="EGM 60" sheetId="20" r:id="rId1"/>
<sheet name="DELV SUMMARY" sheetId="21" r:id="rId2"/>
<sheet name="EGM 201" sheetId="22" r:id="rId3"/>
<sheet name="EGM 202" sheetId="23" r:id="rId4"/>
<sheet name="RHODIA" sheetId="24" r:id="rId5"/>
<sheet name="Module1" sheetId="19" state="veryHidden" r:id=""/>
</sheets>
<definedNames>
<definedName name="_xlnm.Print_Area" localSheetId="2">'EGM 201'!$A$1:$J$75</definedName>
<definedName name="_xlnm.Print_Area" localSheetId="3">'EGM 202'!$A$1:$D$53</definedName>
<definedName name="_xlnm.Print_Area" localSheetId="0">'EGM 60'!$A$1:$D$34</definedName>
<definedName name="_xlnm.Print_Area" localSheetId="4">RHODIA!$A$1:$D$26</definedName>
</definedNames>
<calcPr calcId="152511" iterate="1" iterateCount="1"/>
</workbook>

I don't think the state="veryHidden" property is anything to do with it, at least not with modern versions of Excel. It doesn't prevent there being an /xl/worksheets/sheet6.xml file. The property controls the visibility of the sheet in the Excel UI, and veryHidden means it is invisible in both the normal view and the VBA window, and can only be accessed via VBA code (or by unzipping the file).

@jennybc
Copy link
Member

@jennybc jennybc commented Nov 27, 2017

How do you think such an xlsx file comes to be 🤔? What I really mean is, do you think this happens in the wild or is something deeply peculiar to the history of that file? If it happens in the wild, I'm surprised we've never had a report before but who knows.

@nacnudus
Copy link
Contributor Author

@nacnudus nacnudus commented Nov 27, 2017

I was wondering the same thing, because Enron predates the xlsx format. The file came from https://gitlab.com/rsheets/enron_corpus -- do you remember whether they were converted from xls to xls first? Anyway, I'd say it's deeply peculiar and I'm not bothered by it, just thought it worth noting.

@jennybc
Copy link
Member

@jennybc jennybc commented Nov 27, 2017

I'm 99% sure @richfitz did not do the xls to xlsx conversion, i.e. that was done by Felienne Hermanns group or earlier, when the corpus was assembled.

@jennybc
Copy link
Member

@jennybc jennybc commented Apr 15, 2018

Does this one sheet remain our sole example of this? If so, I am inclined to not act on it.

@jennybc
Copy link
Member

@jennybc jennybc commented Apr 25, 2018

Something I did seems to have fixed this. Probably eeeebf8.

🎉

@jennybc jennybc closed this Apr 25, 2018
@jennybc
Copy link
Member

@jennybc jennybc commented Apr 25, 2018

Oh nevermind. I still can't read that exact sheet, although I can read others. I think, until I see more examples of this, I am not going to worry about it.

@jennybc
Copy link
Member

@jennybc jennybc commented Apr 25, 2018

I did some excelgesis on it. Here's the sheets node in workbook.xml:

<sheets>
<sheet name="EGM 60" sheetId="20" r:id="rId1"/>
<sheet name="DELV SUMMARY" sheetId="21" r:id="rId2"/>
<sheet name="EGM 201" sheetId="22" r:id="rId3"/>
<sheet name="EGM 202" sheetId="23" r:id="rId4"/>
<sheet name="RHODIA" sheetId="24" r:id="rId5"/>
<sheet name="Module1" sheetId="19" state="veryHidden" r:id=""/>
</sheets>

So this is a matter of: how to handle "veryHidden" sheets?

Update: I now see you @nacnudus said most of this above yourself.

@jennybc jennybc reopened this Apr 25, 2018
@jennybc
Copy link
Member

@jennybc jennybc commented Apr 25, 2018

I think having a stance on hidden sheets is important. But I doubt this spreadsheet is a good example to base such work on. I suspect that this hit some weird edge case in whatever tool was used to create the Enron corpus, which we suspect involved some xls to xlsx conversion.

If I resave this file, the "Module1" sheet remains listed in workbook.xml, remains veryHidden, gains a proper r:id attribute, and gains its own XML file below xl/worksheets/ (but with empty sheetData node). As @nacnudus saw as well.

Back to this position: unless we have ongoing reports of this, I will let this be. I could change the sheet parsing to only acknowledge a sheet if the r:id attribute has more than zero characters but I think it's not worth complicating the code for this. It seems reasonable to use tryCatch() when doing something ambitious, like reading every putative worksheet in the Enron corpus.

@jennybc jennybc closed this Apr 25, 2018
@lock lock bot locked and limited conversation to collaborators Oct 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants