Dispose of DOM-based XML parser [reading] #62

Mercury13 · 2016-09-07T09:14:01Z

Some large XLSXs are 20M large, with ≈50M data, 200M XML, and DOM does not fit into 32-bit memory at all. The only way to parse them in reasonable way is a stream-based parser, like Java StAX.

tfussell · 2016-09-07T21:23:20Z

This and #63 are going to be significantly more time-consuming to implement than all of the the other issues. I'm focused on implementing core functionality at the moment before thinking about performance. I do understand however that performance is a major reason for using C++ for handling XLSX files in the first place. I will create a demo branch using a SAX parser in the near future. Maybe another contributor will be willing to flesh it out otherwise this may take a few months.

Mercury13 · 2016-09-08T04:25:43Z

IMHO, SAX is shit, it’s better to use something StAX-like.

tfussell · 2016-09-09T00:52:29Z

Hmm, I had never heard of StAX. Researching that led me to libstudxml which looks really nice and implements pull parsing like StAX. I'll just have to figure out if I can avoid loading the full XML from the XLSX during parsing to save memory.

tfussell · 2016-09-21T23:53:13Z

I was able to automate much of the necessary refactoring for this issue and #63 so I decided to go ahead and do it. With commit dadf852, parsing and serialization using libstudxml (a non-DOM XML library) is effectively at feature parity with the old DOM-based XML library so I've merged it into master.

I will note that the XML is held in several places while reading and writing an XLSX file: compressed in the ZIP archive on disk, compressed in the ZIP archive in memory, uncompressed as a string, and another copy in the std::(i/o)stringstream used by the parser/serializer. There is thus still room for improving memory usage by writing a custom stream that reads/writes compressed data directly to/from the ZIP archive. That will be a separate issue.

tfussell self-assigned this Sep 7, 2016

tfussell added enhancement help wanted performance labels Sep 7, 2016

tfussell mentioned this issue Sep 21, 2016

Dispose of DOM-based XML objects [writing] #63

Closed

tfussell closed this as completed Sep 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dispose of DOM-based XML parser [reading] #62

Dispose of DOM-based XML parser [reading] #62

Mercury13 commented Sep 7, 2016

tfussell commented Sep 7, 2016

Mercury13 commented Sep 8, 2016

tfussell commented Sep 9, 2016

tfussell commented Sep 21, 2016

Dispose of DOM-based XML parser [reading] #62

Dispose of DOM-based XML parser [reading] #62

Comments

Mercury13 commented Sep 7, 2016

tfussell commented Sep 7, 2016

Mercury13 commented Sep 8, 2016

tfussell commented Sep 9, 2016

tfussell commented Sep 21, 2016