New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple and memory-efficient way to read the whole Excel book as "records" #144

Closed
patarapolw opened this Issue Jul 30, 2018 · 11 comments

Comments

Projects
None yet
2 participants
@patarapolw

patarapolw commented Jul 30, 2018

If it is possible, it want to use pyexcel.iget_records().

What I have got now is:-

data = OrderedDict()
raw_data = pyexcel.get_book(file_name='filename.xlsx')
for sheet_name in raw_data.sheet_names():
    raw_data[sheet_name].name_columns_by_row(0)
    data[sheet_name] = raw_data[sheet_name].to_records()
pyexcel.free_resources()

But .to_records() doesn't seem to be a memory-efficient way as it generates a list, not a generator. Also, why do I have to call raw_data[sheet_name].name_columns_by_row(0) explicitly?

@chfw

This comment has been minimized.

Member

chfw commented Jul 30, 2018

please have a look at these docs: http://docs.pyexcel.org/en/v0.6.0/two-liners.html

@patarapolw

This comment has been minimized.

patarapolw commented Jul 30, 2018

Of course I can use

record_iter = pyexcel.iget_records(file_name='filename.xlsx', sheet_name=???)
...
pyexcel.free_resources()

But how would I know the sheet names?

Two liners for get data from big excel files: Get a list of dictionaries / array

is oblivious to the fact that excel have multiple sheets.

@chfw

This comment has been minimized.

Member

chfw commented Jul 30, 2018

Interesting use case. Yes, it is impossible to get sheet name at this point.

please try use sheet_index

@patarapolw

This comment has been minimized.

patarapolw commented Jul 30, 2018

please try use sheet_index

It is hard to know sheet_index as well, unless I pre-read the Excel and store sheet_names in a list, and then use list.index(sheet_name).

@chfw

This comment has been minimized.

Member

chfw commented Aug 1, 2018

So, it is possible to get sheet names before consuming a lot of memory. but I do not recommend to use for now because internal api are subjected to change:

>>> from pyexcel.internal.core import get_book_stream
>>> ibook = get_book_stream(file_name="test-book.xlsx", on_demand=True)
>>> ibook.sheets
OrderedDict([(u'Sheet A', <pyexcel.internal.generators.SheetStream object at 0x0000000002E211D0>), (u'Sheet B', <pyexcel.internal.generators.SheetStream object at 0x00000000024EA320>), (u'Sheet C', <pyexcel.internal.generators.SheetStream object at 0x0000000002E44EB8>)])
>>> ibook.sheets.keys()
[u'Sheet A', u'Sheet B', u'Sheet C']

Please wait for official support for the access to internal book streams.

With that said, if you really want to save memory consumption or want a more efficient xlsx reader, you can consider pyexcel-xlsxr, which is under-used, zero stars at the moment. With above code, pyexcel-xlsxr returns faster than pyexcel-xlsx.

@chfw chfw added the enhancement label Aug 1, 2018

@chfw

This comment has been minimized.

Member

chfw commented Aug 1, 2018

Alternatively, here is the officially supported APIs, doing the same thing but with pyexcel-io, a bit lower level. I do not want you to use two set of APIs. However, it's better than you choose the internal/private apis of pyexcel.

>>> from pyexcel_io import iget_data
>>> data_dict = iget_data("test-book.xlsx")
>>> data_dict
(OrderedDict([(u'Sheet A', <generator object to_array at 0x000000000309EFC0>), (u'Sheet B', <generator object to_array at 0x00000000030D91F8>), (u'Sheet C', <generator object to_array at 0x00000000030D9090>)]), <pyexcel_xls.xlsr.XLSBook object at 0x000000000309BDD8>)

again, please note that pyexcel-xlsxr give better reading performance than pyexcel-xlsx.

Finally, the choice is yours.

@patarapolw

This comment has been minimized.

patarapolw commented Aug 1, 2018

pyexcel_io, but I also need to install pyexcel-xlsxr? No iget_records()?

Also, writing to Excel file is troublesome for big files. Maybe I want something like Google Sheets performance. Maybe, only a real database object (like SQLite) can do this?

Lastly, I am still looking for a good text wrapper and a custom renderer (like Handsontable), but again, maybe I shouldn't think about this one too much.

@chfw

This comment has been minimized.

Member

chfw commented Aug 1, 2018

first of all, here's the architecture of pyexcel, pyexcel_io is the base module for read and write. pyexcel provides high level manipulations as well as read and write.

In pyexcel, it is possible that we have multiple readers and writers for the same file format. If you do not want to use pyexcel-xlsxr, that's fine. you are safe to use pyexcel-xlsx. the code works no matter which plugin is installed but just make sure you have at least one reader for xlsx format.

@chfw

This comment has been minimized.

Member

chfw commented Aug 1, 2018

Regarding the data serialization, SQLite is a bad choice. HDF format should be considered. But for now, pyexcel does not support such an output.

To write into xlsx, pyexcel-xlsx and pyexcel-xlsxw can help you.

@patarapolw

This comment has been minimized.

patarapolw commented Aug 1, 2018

Maybe I should look at http://www.pytables.org/index.html or http://docs.h5py.org/en/stable/ but that's a lot of things to learn. Maybe I should stick with SQLAlchemy + SQLite for now.

To write into xlsx, pyexcel-xlsx and pyexcel-xlsxw can help you.

Yeah, rather than to write, I just want to .insert(i, data) and .remove(data).

chfw added a commit that referenced this issue Aug 25, 2018

chfw added a commit that referenced this issue Aug 25, 2018

@chfw

This comment has been minimized.

Member

chfw commented Aug 30, 2018

>>> book_stream = pe.iget_book(file_name=test_file)
>>> book_stream.sheet_names()
["Sheet1", "Sheet2", "Sheet3"])
>>> pe.iget_array(sheet_stream=book_stream["Sheet1"])
Generator of an array

@chfw chfw closed this Aug 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment