-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dispose of DOM-based XML parser [reading] #62
Comments
This and #63 are going to be significantly more time-consuming to implement than all of the the other issues. I'm focused on implementing core functionality at the moment before thinking about performance. I do understand however that performance is a major reason for using C++ for handling XLSX files in the first place. I will create a demo branch using a SAX parser in the near future. Maybe another contributor will be willing to flesh it out otherwise this may take a few months. |
IMHO, SAX is shit, it’s better to use something StAX-like. |
Hmm, I had never heard of StAX. Researching that led me to libstudxml which looks really nice and implements pull parsing like StAX. I'll just have to figure out if I can avoid loading the full XML from the XLSX during parsing to save memory. |
I was able to automate much of the necessary refactoring for this issue and #63 so I decided to go ahead and do it. With commit dadf852, parsing and serialization using libstudxml (a non-DOM XML library) is effectively at feature parity with the old DOM-based XML library so I've merged it into master. I will note that the XML is held in several places while reading and writing an XLSX file: compressed in the ZIP archive on disk, compressed in the ZIP archive in memory, uncompressed as a string, and another copy in the std::(i/o)stringstream used by the parser/serializer. There is thus still room for improving memory usage by writing a custom stream that reads/writes compressed data directly to/from the ZIP archive. That will be a separate issue. |
Some large XLSXs are 20M large, with ≈50M data, 200M XML, and DOM does not fit into 32-bit memory at all. The only way to parse them in reasonable way is a stream-based parser, like Java StAX.
The text was updated successfully, but these errors were encountered: