-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs: do bulk file reads to optimize cache extraction #3539
Conversation
It looks like the test suite is just generally not passing right now? If anyone points me to the file(s) I should look at, if there's new breakage, and I'll look into fixing it ❤️ |
events.onProgress(data.dest); | ||
cleanup(); | ||
}, | ||
err => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version seems to swallow the error while the original version would re-throw
it after the cleanup()
. Am I mis-reading this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No you read that right. That's my mistake. Fixed 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also dropped the forwarding of the resolved value in the resolve handler. Maybe it's not actually used anyway?
This patch boosts cache extraction by ~2x+ by letting node do more parallelization work. This makes nearly all of the file copy stuff be done by the C++ code with minimal boundary-crossing (at least compared to node streams). Streams in node.js are ~3x slower, specially for small files, than just doing fs.writeFile/readFile, because of this boundary. This is something Yarn might want to take into account in other places. The reason this is OK is because pretty much any files this would handle would fit neatly into memory (any npm packages MUST fit into memory by definition, because of the way npm@<5 does extracts). If you really want to make doubleplus sure to minimize memory usage, you could do an fs.stat to find the file size and then do heuristics to only use streams for files bigger than <X>MB.
I can only assume my version is slower (another reason why I wasn't sure whether to PR it). It probably comes down to how much complexity you want to add (to maintenance, deployment, etc) vs raw perf. It might also be worth benchmarking it on the latest node8, since this patch will go faster in step with libuv perf boosts. |
Yeah, I'm just curious about how large the difference is. This patch is still a good idea, even if it's not as fast as using native code 😃 We'll probably always have some "pure JS" fallback in case the native code doesn't work for whatever reason. |
I ran it through my test project and got: yarn master: 25s I think that this seems well worth it, considering it requires no native code. Keeping 16x full files rather than the streams should be okay. Not quite sure what kind of file sizes there are in the larger node modules, I have a list of some pretty big ones somewhere that I've used as test cases in a few projects but can't check them right now. :) |
src/util/fs.js
Outdated
resolve(); | ||
} | ||
reporter.verbose(reporter.lang('verboseFileCopy', data.src, data.dest)); | ||
return (currentlyWriting[data.dest] = readFile(data.src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to use readFileBuffer
, otherwise the output will be an utf8 string instead of a raw buffer, which will corrupt the data if they happen to be contain invalid utf8 sequences (hence the corrupted tarball warnings in the tests). On the plus side, using buffers should contribute to make your PR even faster! 😃
Thanks a lot! Really nice optimization you found there 👍 |
This is awesome. Thank you so much @zkat. |
Yeah, thanks for this! One thing I just thought of (that @cpojer kinda reminded me of) is that using native file copying will provide a pretty large speed boost on any systems that use copy-on-write filesystem such as zfs or btrfs (or Apple's new APFS, apparently). This is something we can't do with a pure JS implementation as the file system has no idea that we're actually copying a file - it just sees a read followed by a write. We'd need native code like in #3290 to take advantage of CoW. On the other hand, maybe simply generating a shell script with all the required |
See nodejs/node#12902 and libuv/libuv#925 for the node-side issues on CoW support in node. |
Help on making libuv/libuv#925 happen would be deeply appreciated :) |
Summary
This patch boosts cache extraction by ~2x+ by letting node
do more parallelization work. This makes nearly all of the file copy
stuff be done by the C++ code with minimal boundary-crossing (at least
compared to node streams).
Streams in node.js are ~3x slower, specially for small files,
than just doing fs.writeFile/readFile, because of this boundary. This
is something Yarn might want to take into account in other places.
The reason this is OK is because pretty much any files this would
handle would fit neatly into memory (any npm packages MUST fit
into memory by definition, because of the way npm@<5 does extracts).
If you really want to make doubleplus sure to minimize memory usage,
you could do an fs.stat to find the file size and then do heuristics
to only use streams for files bigger than MB.
Test plan
I guess I should probably spend more time on making sure this passes the test suite, but it's a pretty benign patch, imo 👍