CPU usage #34

albertdb · 2016-07-27T15:59:25Z

Hi!

It seems that at last I found an unzip library that doesn't leak memory, so thank you.

Having said that, I'm observing that yauzl vs unzip command-line utility takes twice the amount of CPU time to decompress an identical ZIP. The code used is the one provided in README by you.

The ZIP file is public: https://downloads.mariadb.org/f/mariadb-10.1.16/winx64-packages/mariadb-10.1.16-winx64.zip

While unzip takes 16 second of CPU, yauzl takes 33.

Is there any room for improvement?

Thank you,

Albert

The text was updated successfully, but these errors were encountered:

thejoshwolfe · 2016-07-30T03:00:06Z

My experimentation shows a much smaller difference between unzip and yauzl. Something more like ~10s vs ~12s, not the factor of 2 that you're reporting. However, performance is a very complicated topic, and it gets more complicated when there are "managed languages" like JavaScript involved, so below is my attempt to approach this topic as a well-educated and open-minded programmer.

It's expected that the more metadata there is in a zipfile, the more the "control code" written in JavaScript will slow down the processing. The more data there is in large compressed files, then the more processing will be happening in C (specifically the zlib module) for unzip and yauzl alike. So for zip archives with a high average file size, the two should be very close, and for files with a very low average file size, yauzl is expected to be slower. There's really nothing I can do about the small-file-size end of the spectrum, since the point of this module is to be written in JavaScript, which is an environment with fundamental design choices that make it impossible to be as performant as C.

So to test how much room there is for improvement, we can examine the two ends of the spectrum. Let's begin with a test case where we start with 1GB of 0's and compress it as a single file into a .zip archive. Extracting that should have a trivial amount of metadata, and most of the time should be spent in the zlib module inflating all those 0's.

unzip: 3.958s, yazl: 3.191s. Here yauzl is actually performing better than unzip, probably due to skipping the CRC check.

Next, we can use random bits instead of all 0's. We'll skip the zlib compression, since we can guess that it will consistently fallback to the literal encoding of data, so that won't be very interesting. Let's just use the STORE mode in the .zip file to skip zlib entirely. Since this zip file also has minimal metadata, we expect that yauzl's logic will play a trivial role. Most of the work should be done in node's fs module for piping a file read stream to a file write stream.

unzip: 4.598s, yazl: 1.555s. Here yauzl is actually performing much better than unzip, possibly due to both sides skipping zlib, and yauzl skipping the CRC check. To be honest, I'm kind of upset that I don't know how to turn off the CRC check in unzip so that I can get a more accurate comparison.

Now we get to the other end of the spectrum, which is much more grim. Let's create a zipfile comprized of 100000 empty files. Extracting these files is mostly an exercise in metadata parsing (and file system metadata writing), which means the difference between JavaScript and C will be quite noticeable.

unzip: 1.146s, yauzl: 12.556s. As expected yauzl is much worse than unzip.

Between these two ends of the spectrum is some (mostly) monotonic function of the ratio of C performance to JavaScript performance. However, system capabilities and limitations (such as SSD vs HDD, RAM availability, and even node version and configuration) can affect what this curve looks like. The zipfile usecase you provided gives us a point in the middle of the spectrum where we can evaluate the ratio of performances. It appears that for you, the ratio is about 2:1, and for me the ratio is closer to 6:5.

And now that I've said all that and tried to sound educated, I must admit I have no idea why there's a bigger performance difference on your machine than on mine. All I know for sure is that when I designed yauzl's code, I tried to make sure not to waste any CPU cycles doing something that could have easily been avoided (extensive API argument validation, for example). I tried to design yauzl with (reasonably) optimal performance, and I knew I wouldn't be able to compete with C for metadata-heavy test cases. But since yauzl does so well for the cases where the metadata is minimal, I think it's a good sign that I didn't do anything stupid, like buffer entire files in RAM or copy everything to a temporary file for no reason, or anything like that. I'm sure there are countless tiny optimizations that could be made by taking advantage of the subtleties of JavaScript and v8, but I don't feel that pouring effort in that direction will be worthwhile.

So I guess my summary of all this and my direct answer to your question is: No, I don't think there's any room for improvement.

However, I've got an open mind, and if someone has a pull request that demonstrates a performance boost, I will be thrilled!

albertdb · 2016-08-02T07:54:49Z

Thank you very much for the extensive explanation, I wasn't expecting it. The more the control code in JavaScript, the more slowdown, so it's about how many files are to be decompressed, because for every file contained in the zip, a piece of JavaScript code is executed.

So, if there was a plain extraction method (inputZip, outputDir) that does all the work in C, I guess it would quite faster. Most of the times, extraction doesn't need to be selective nor the contents of the file are important from the programming side, you just want to unzip a zip.

Would that additional approach be possible?

Thanks again.

thejoshwolfe · 2016-08-02T08:30:27Z

I suppose someone could write an unzip implementation in C and provide JavaScript bindings for it. That would speed things up pretty much optimally. That's not the goal of yauzl though, so I don't think I will do that. You may be able to find bindings like that already out there somewhere. And of course you could always shell out to the unzip command line program, if you really needed to.

For yauzl, I want to keep everything in JavaScript.

thejoshwolfe added the question label Jul 30, 2016

thejoshwolfe closed this as completed Jul 30, 2016

thejoshwolfe mentioned this issue Dec 20, 2016

CRC-32 checks #49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU usage #34

CPU usage #34

albertdb commented Jul 27, 2016

thejoshwolfe commented Jul 30, 2016

albertdb commented Aug 2, 2016 •

edited

Loading

thejoshwolfe commented Aug 2, 2016

CPU usage #34

CPU usage #34

Comments

albertdb commented Jul 27, 2016

thejoshwolfe commented Jul 30, 2016

albertdb commented Aug 2, 2016 • edited Loading

thejoshwolfe commented Aug 2, 2016

albertdb commented Aug 2, 2016 •

edited

Loading