Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispose of DOM-based XML parser [reading] #62

Closed
Mercury13 opened this issue Sep 7, 2016 · 4 comments
Closed

Dispose of DOM-based XML parser [reading] #62

Mercury13 opened this issue Sep 7, 2016 · 4 comments

Comments

@Mercury13
Copy link

Some large XLSXs are 20M large, with ≈50M data, 200M XML, and DOM does not fit into 32-bit memory at all. The only way to parse them in reasonable way is a stream-based parser, like Java StAX.

@tfussell
Copy link
Owner

tfussell commented Sep 7, 2016

This and #63 are going to be significantly more time-consuming to implement than all of the the other issues. I'm focused on implementing core functionality at the moment before thinking about performance. I do understand however that performance is a major reason for using C++ for handling XLSX files in the first place. I will create a demo branch using a SAX parser in the near future. Maybe another contributor will be willing to flesh it out otherwise this may take a few months.

@Mercury13
Copy link
Author

IMHO, SAX is shit, it’s better to use something StAX-like.

@tfussell
Copy link
Owner

tfussell commented Sep 9, 2016

Hmm, I had never heard of StAX. Researching that led me to libstudxml which looks really nice and implements pull parsing like StAX. I'll just have to figure out if I can avoid loading the full XML from the XLSX during parsing to save memory.

@tfussell
Copy link
Owner

I was able to automate much of the necessary refactoring for this issue and #63 so I decided to go ahead and do it. With commit dadf852, parsing and serialization using libstudxml (a non-DOM XML library) is effectively at feature parity with the old DOM-based XML library so I've merged it into master.

I will note that the XML is held in several places while reading and writing an XLSX file: compressed in the ZIP archive on disk, compressed in the ZIP archive in memory, uncompressed as a string, and another copy in the std::(i/o)stringstream used by the parser/serializer. There is thus still room for improving memory usage by writing a custom stream that reads/writes compressed data directly to/from the ZIP archive. That will be a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants