Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file descriptors are not closed #14

Closed
benoit-pierre opened this issue May 3, 2017 · 9 comments
Closed

file descriptors are not closed #14

benoit-pierre opened this issue May 3, 2017 · 9 comments

Comments

@benoit-pierre
Copy link
Contributor

For example, when running the following code on Linux:

import os

import pyexcel

pyexcel.get_book_dict(file_name='test.xlsx')

fd_dir = '/proc/%u/fd' % os.getpid()
for fd_name in os.listdir(fd_dir):
    print(fd_name, '-> ', end='')
    try:
        print(os.readlink('%s/%s' % (fd_dir, fd_name)))
    except FileNotFoundError:
        print()

The last file descriptor open point to test.xlsx.

Beside the file descriptor leak, this is really problematic in Windows, as it make it impossible to concurrently modify the spreadsheet in Office when it has been read in another (still running) application.

@chfw
Copy link
Member

chfw commented May 3, 2017

the close() function should be called for read-only and write-only mode, which were used in this library. will run your code to test a fix.

@benoit-pierre
Copy link
Contributor Author

I've been trying this, but it's not enough:

 pyexcel_xlsx/xlsx.py | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git i/pyexcel_xlsx/xlsx.py w/pyexcel_xlsx/xlsx.py
index ed01c1b..bf8a036 100644
--- i/pyexcel_xlsx/xlsx.py
+++ w/pyexcel_xlsx/xlsx.py
@@ -101,7 +101,7 @@ class XLSXBook(BookReader):
 
     def read_sheet(self, native_sheet):
         sheet = XLSXSheet(native_sheet, **self._keywords)
-        return {sheet.name: sheet.to_array()}
+        return {sheet.name: list(sheet.to_array())}
 
     def _load_the_excel_file(self, file_alike_object):
         self._native_book = openpyxl.load_workbook(
@@ -111,6 +111,9 @@ class XLSXBook(BookReader):
         self.skip_hidden_sheets = self._keywords.get(
             'skip_hidden_sheets', True)
 
+    def close(self):
+        self._native_book.close()
+        self._native_book = None
 
 class XLSXSheetWriter(SheetWriter):
     """

There's an issue in openpyxl, it's closing its ZipFile archive, but internally, ZipFile tracks reference counts, and the descriptor is not closed because there's still at least one reference.

@benoit-pierre
Copy link
Contributor Author

This additional patch in openpyxl is necessary:

 openpyxl/worksheet/read_only.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git i/openpyxl/worksheet/read_only.py w/openpyxl/worksheet/read_only.py
index fd533569..93438328 100644
--- i/openpyxl/worksheet/read_only.py
+++ w/openpyxl/worksheet/read_only.py
@@ -89,7 +89,8 @@ class ReadOnlyWorksheet(object):
     def xml_source(self):
         """Parse xml source on demand, default to Excel archive"""
         if self._xml is None:
-            return self.parent._archive.open(self.worksheet_path)
+            from six import BytesIO
+            return BytesIO(self.parent._archive.read(self.worksheet_path))
         return self._xml

Now to see if I clean it up to make a proper PR...

@chfw
Copy link
Member

chfw commented May 4, 2017

The fix seems to be complex. My thought was to allow the developer to consume the data on demand so as to avoid reading the data into memory on behalf of the developer. "streaming=True" would enable on-demand feature. For this feature, the file handle should be kept open and eventually will be closed by garbage collector.

As in your changes, you have noticed that close method alone would cause a RuntimeError: Attempt to read ZIP archive that was already closed. So to cater the need to close the file handle manually, I would have to think of some alternatives.

@benoit-pierre
Copy link
Contributor Author

I certainly don't expect pyexcel.get_book_dict to only read on demand, as opposed to something like iget_records. I don't think letting the garbage collector close file descriptors is a good idea, they should be closed right after I'm done with consuming whatever data it is I'm asking pyexcel for, either explicitly through a call to close on the context I'm using, or better yet with should be supported support so it's closed at the end of the block.

@chfw
Copy link
Member

chfw commented May 4, 2017

yes, you had your point there. the fix would be make those file handle closure explicit. with get_* functions, auto closure is expected whereas iget_*, the closure will be left to the developer after the generator has been consumed.

@benoit-pierre
Copy link
Contributor Author

benoit-pierre commented May 4, 2017

OK, for now I'll be using a patched version of pyexcel-xlsx (with the above patch) and of openpyxl (with the patch mentioned here).

@chfw
Copy link
Member

chfw commented May 5, 2017

I will put this fix in 0.4.0 branch, next major release. In this way, I could focus the effort on one branch instead of supporting two branches.

chfw added a commit that referenced this issue May 6, 2017
chfw added a commit to pyexcel/pyexcel-ods that referenced this issue May 31, 2017
chfw added a commit to pyexcel/pyexcel-ods3 that referenced this issue Jun 1, 2017
chfw added a commit to pyexcel/pyexcel-odsr that referenced this issue Jun 1, 2017
@chfw
Copy link
Member

chfw commented Jun 4, 2017

Inconsistent behaviour was found in this test run:

#. file handle from iget_data was closed in python 3.6, pypy only but not with other python versions
#. file handle from get_data was left open still in pypy .

openpyxl version is v2.4.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants