Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider hardlinks rather than separate copy of packages per app #499

Closed
Daniel15 opened this issue Oct 5, 2016 · 73 comments
Closed

Consider hardlinks rather than separate copy of packages per app #499

Daniel15 opened this issue Oct 5, 2016 · 73 comments

Comments

@Daniel15
Copy link
Member

Daniel15 commented Oct 5, 2016

This was touched on in a comment on #480, but I thought it's worth pulling into its own separate issue.

Currently, each app that uses Yarn (or npm) has its own node_modules directory with its own copies of all the modules. This results in a lot of duplicate files across the filesystem. If I have 10 sites that use the same version of Jest or React or Lodash or whatever else you want to install from npm, why do I need 10 identical copies of that package's contents on my system?

We should instead consider extracting packages into a central location (eg. ~/.yarn/cache) and hardlinking them. Note that this would be a hardlink rather than symlink, so that deleting the cache directory does not break the packages.

@cpojer
Copy link
Contributor

cpojer commented Oct 5, 2016

Yarn initially used to use symlinks and we changed it because our internal tooling (watchman etc.) doesn't work well with symlinks. Are hardlinks different in this case? If yes, that might be worth doing.

I think the initial release should continue to use the copy approach; it is more consistent with the rest of the ecosystem and we should evaluate this behavior for a future major release.

@dxu
Copy link
Contributor

dxu commented Oct 5, 2016

Upon thinking about it further, another issue that might come up is that people may try to modify their local node_modules for local debugging purposes or testing purposes, and not expect that they're actually modifying the node module linked to everywhere else. I don't know how often this happens with others, but I've definitely done it (though rarely) in the past. Apart from that, hardlinks seems to make sense. I'd guess that the tooling would be fine, since it should be the same as any other file.

The primary issue this was intended to address was the cache causing issues with hardcoded paths that result from building packages (#480).

@Daniel15
Copy link
Member Author

Daniel15 commented Oct 5, 2016

Are hardlinks different in this case?

Not sure, might be worth asking @wez whether Watchman can handle hardlinks.

another issue that might come up is that people may try to modify their local node_modules for local debugging purposes or testing purposes, and not expect that they're actually modifying the node module linked to everywhere else

I think this is the use case for npm link or Yarn's equivalent though, right? You're never supposed to directly modify files in node_modules. We could have an option to make a local copy if people want to do this though.

@cpojer
Copy link
Contributor

cpojer commented Oct 5, 2016

I totally agree with @dxu and actually wanted to write the same thing. I do this often: I manually add some debugging code into a random node_module (that I don't have checked out locally). Once I'm done, I wipe it away and do npm install. It would be a mental change for me to remember it would also affect other projects.

@Daniel15
Copy link
Member Author

Daniel15 commented Oct 5, 2016

Yeah, that's a use case I didn't really think about... Hmm...

Oh well, we can still hold on to this idea. Maybe it could be an optional configuration setting for people that don't directly edit node_modules and would like Yarn to run faster (less copying of the same data = less disk IO = stuff is faster and disk cache efficiency is improved) 😄

@sebmck
Copy link
Contributor

sebmck commented Oct 5, 2016

Going to close this since we decided long ago to go away from symlinks. It's required for compatibility with the existing ecosystem as even projects like ESLint rely on this directory structure to load rules etc. There's also a lot of problems with existing tooling not supporting them. For example when Yarn initially used Jest and it would fail and produce extremely long paths. Jest is much better now and the bug is likely fixed but small issues like this exist in a lot of tools.

@sebmck sebmck closed this as completed Oct 5, 2016
@Daniel15
Copy link
Member Author

Daniel15 commented Oct 5, 2016

Sebastian, this task is for hardlinks not symlinks. Hardlinks shouldn't
have any of the problems you mentioned.

Sent from my phone.

On Oct 5, 2016 5:24 AM, "Sebastian McKenzie" notifications@github.com
wrote:

Going to close this since we decided long ago to go away from symlinks.
It's required for compatibility with the existing ecosystem as even
projects like ESLint rely on this directory structure to load rules etc.
There's also a lot of problems with existing tooling not supporting them.
For example when Yarn initially used Jest and it would fail and produce
extremely long paths. Jest is much better now and the bug is likely fixed
but small issues like this exist in a lot of tools.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#499 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAFnHQ9ON3xMLDxpGTj6kr0zMZII1hb6ks5qw5blgaJpZM4KOai0
.

@sebmck
Copy link
Contributor

sebmck commented Oct 5, 2016

Hardlinks have the exact same problems and are semantically the same in this scenario. Why do you think they don't have any of the same issues?

@yunxing
Copy link
Contributor

yunxing commented Oct 5, 2016

@kittens haven't really tested hardlinks. But once you hardlink a file, in theory from the filesystem's perspective, it should be exactly same as the original file -- you can remove the original file and the hardlinked file will still work.

This is different from symlinks, whose content is just a pointer to the original file.

@sebmck
Copy link
Contributor

sebmck commented Oct 5, 2016

You can have cycles though which is extremely problematic if tools aren't designed to handle them (most JavaScript tools aren't, and how would they?). Hardlinks and symlinks on Windows both require admin privileges (NTFS junctions don't but they're more synonymous with symlinks) which is a non-starter for a lot of environments.

@yunxing
Copy link
Contributor

yunxing commented Oct 5, 2016

Good point of Windows. We can have platform specific logic maybe if we decide to go down this path.

How do you create a cycle in hardlink? Note that there is no hardlink for directories.

@Daniel15
Copy link
Member Author

Going to reopen this for tracking purposes. It should be doable as hardlinked files look identical to the file system. I might prototype it.

@wycats
Copy link
Member

wycats commented Oct 12, 2016

@Daniel15 one thing to keep in mind is that since hardlinks pretend to be the file system so well, deleting them usually deletes way more files than you're expecting. Since rm -rf node_modules is a common pattern, I'd want us to have some mitigation for that likelihood before unleashing this into the wild (even on an opt-in basis).

I remember unexpected deletions hitting users of n back in the day and it has left a permanent scar 😛 (not directly analogous, but it gave me serious fear about giving people rope that could cause mass deletions of shared files)

@Daniel15
Copy link
Member Author

Daniel15 commented Oct 12, 2016

I remember unexpected deletions hitting users of n back in the day and it has left a permanent scar 😛

Good point, I remember Steam on Linux accidentally running rm -rf /* too: http://www.pcworld.com/article/2871653/scary-steam-for-linux-bug-erases-all-the-personal-files-on-your-pc.html

Maybe we need a safer "clean" function rather than just doing rm -rf node_modules

@also
Copy link

also commented Oct 13, 2016

Are issues with hard links and rm -rf node_modules actually possible? While you can create symlinks to directories, you can't create hard links to them*, so you shouldn't be able to recurse into some global directory while running rm -rf.

* On macOS you can, but you shouldn't

@vjpr
Copy link

vjpr commented Oct 20, 2016

Symlinking to a global cache is essential. The copying approach is very slow for large projects (which I would argue are very common), and extremely slow on VMs, and very slow on Windows, and insanely slow on a virtualized Windows VM running on a macOS host in Parallels/VMWare.

I have a relatively simple frontend/backend project and the node_modules is 270K files and about ~300MB.

With a warm global cache, the "Linking dependencies..." step takes about 5 minutes. Symlinking would take a couple of seconds.

rm -rf node_modules takes about 15 seconds.

So when I am building my Docker images, its taking me 5 mins everytime when it could be seconds.

It seems every package manager's authors flatly ignore real-world performance.

Is there a plan to support symlinking any time soon. I feel like it would be a simple implementation and just add a --symlink flag. Where can I find the issue?

@Daniel15
Copy link
Member Author

, the "Linking dependencies..." step takes about 5 minutes. Symlinking would take a couple of seconds.

I wonder how long hardlinking would take. Definitely longer than symlinking as you need to hardlink each individual file, but it should be faster than copying the files over while avoiding some of the disadvantages of symlinks. I think it's worth having both a hardlink and a symlink mode, both of them opt-in.

@tlbdk
Copy link

tlbdk commented Oct 21, 2016

We could also use a symlink or hardlink feature when doing builds on our build server as copying of node modules is far the slowest part of the build, fx. our build time drops from 3 minutes with npm install(1:45 with yarn) to 15 seconds if we cache and symlink the node_modules folder between builds(we do hasing of the packages.json to know when to invalidate the cache). A raw copy with cp takes 45 seconds.

@AlicanC
Copy link
Contributor

AlicanC commented Oct 24, 2016

Yarn initially used to use symlinks and we changed it because our internal tooling (watchman etc.) doesn't work well with symlinks.

Lack of symlink support in Watchman blocks more than Yarn: facebook/react-native#637

I develop React Native, Browser and Electron apps and I only had problems with symlinks in React Native and that was because of Watchman.

The reason we can't have symlinking in Yarn shouldn't be Watchman or some other internal Facebook tooling. The rest of the ecosystem appears to support it well.

Symlinking should be opt-out.

@Daniel15
Copy link
Member Author

Daniel15 commented Oct 24, 2016

Hardlinks should work fine with Watchman, and any other tool, since they look identical to "regular" files. That's one reason I suggested trying hardlinks rather than symlinks.

@dhakehurst
Copy link

use paths relative to the current file

Therein lies the problem. A good module system is not simply about importing files, and paths.

However, as you say, "this issue is about Yarn", so problems with node/javascript are out of scope, sorry to have brought it up.

As the title of this thread is about hard links rather than copy,
If hardlinks would work, then so would softlinks/symlinks. Otherwise hardlinks have the same peer dependency problem as described by @vjpr

@jpeg729
Copy link

jpeg729 commented May 4, 2018

There seems to be a lot of confusion about what hardlinks actually are. Most filesystems store files in two parts. The filename which points to the storage location, and the actual data. A symlink is a special filename that tells you to go look for another file. A hardlink is an second (or third, or ...) filename that points to the same data location.

Therefore hardlinked files do not suffer the same problems as symlinks, because they truly look like copies of the original files.

Also, assuming I only hardlink files and not directories, then if I do_ rm -rf node_modules then my system will delete the filename my-hardlink, but then it will notice that the underlying data storage is still referenced by yarn-cache/original-file and it won't delete the original file.

Basically, unless you are examining inode numbers, hardlinks look exactly like files copied from the originals, but they share the same storage location as the original files. So we will need to warn people not to modify the contents of their node_modules directories.

Another potential problem is that on linux you can't make a hardlink across filesystem boundaries. I don't know about windows or mac os. So we would need to fall back on true copying when hardlinking doesn't work.

Until something like this is implemented, I am going with the following approach...

hardlink -t -x '.*' -i '^(.*/node_modules/|/home/user/.cache/yarn/v1/npm-)' ~/.cache/yarn ~/code

Where ~/code is the directory I store all my projects in.

@ljharb
Copy link

ljharb commented May 4, 2018

@jpeg729 one problem that causes tho is that you’re supposed to be able to edit any file inside node_modules and see that change when you run your program - and if you have two places in node_modules that point to the same data location, editing one will end up editing the other, which might not be desired.

@KSXGitHub
Copy link

@ljharb

  1. You are not supposed to manually edit files in node_modules.

  2. We can have that feature configurable, so you can turn it off if you really want to edit node_modules.

@ljharb
Copy link

ljharb commented May 4, 2018

@KSXGitHub "you are not supposed to" where does that rule come from? It's always been both possible, and something node and npm explicitly supports.

As for being configurable, the problem is that users aren't going to know that this normal node ecosystem behavior behaves differently, and they could end up silently getting surprising behavior.

@Pauan
Copy link

Pauan commented May 4, 2018

As for being configurable, the problem is that users aren't going to know that this normal node ecosystem behavior behaves differently, and they could end up silently getting surprising behavior.

If the default is to not use hard-links, and the user has to manually enable it, then that's not a problem: they know they're using weird yarn-specific behavior.

@jrz
Copy link

jrz commented Nov 21, 2018

APFS supports clonefiles, which are copy-on-write hardlinks. Eventhough I like the rubygems/bundlers package management more, the node_modules is a thing to stay for a while.

The upside of using clonefiles is the if a malicious or misbehaving package / script (or user!) tries to update a file, it gets copied.

This is a safe way to enable the space / time saving feature, and can be enabled by default.

@Daniel15
Copy link
Member Author

APFS supports clonefiles, which are copy-on-write hardlinks.

@jrz - Btrfs and ZFS both support Copy-on-Write too. Unfortunately, very few people are using CoW filesystems at the moment. Apple users are a relatively small proportion of the population, and many are still on HFS+. On Linux, ext3/4 is still much more common than Btrfs and ZFS.

I think there's an issue somewhere to support CoW copies in Yarn, but I really think the "plug'n'play" functionality makes this obsolete anyways: yarnpkg/rfcs#101

@KSXGitHub
Copy link

Alternate solution: nodejs/node#25581

@dhakehurst
Copy link

dhakehurst commented Mar 10, 2019 via email

@Kogia-sima
Copy link

Kogia-sima commented May 10, 2019

We should instead consider extracting packages into a central location (eg. ~/.yarn/cache) and hardlinking them.

Temporary workaround

⚠️ As @dxu mentioned, applying this change may cause another problems.

--- a/src/package-linker.js
+++ b/src/package-linker.js
@@ -232,6 +232,7 @@ export default class PackageLinker {
     const copyQueue: Map<string, CopyQueueItem> = new Map();
     const hardlinkQueue: Map<string, CopyQueueItem> = new Map();
     const hardlinksEnabled = linkDuplicates && (await fs.hardlinksWork(this.config.cwd));
+    const forceLinks = true;

     const copiedSrcs: Map<string, string> = new Map();
     const symlinkPaths: Map<string, string> = new Map();
@@ -302,7 +303,7 @@ export default class PackageLinker {
       }

       const copiedDest = copiedSrcs.get(src);
-      if (!copiedDest) {
+      if (!forceLinks && !copiedDest) {
         // no point to hardlink to a symlink
         if (hardlinksEnabled && type !== 'symlink') {
           copiedSrcs.set(src, dest);
@@ -319,7 +320,7 @@ export default class PackageLinker {
         });
       } else {
         hardlinkQueue.set(dest, {
-          src: copiedDest,
+          src: forceLinks ? src : copiedDest,
           dest,
           onFresh() {
             if (ref) {

@forresthopkinsa
Copy link

It seems like everyone came to a consensus on this years ago:

  • don't use symlinks because the ecosystem isn't expecting it
  • don't use copy-on-write because very few people could utilize it
  • enable hardlinking on an explicitly opt-in basis
  • don't hardlink directories even on the platforms that allow it

What's blocking this? Are we just waiting for someone to submit a PR?

@dhakehurst
Copy link

dhakehurst commented May 20, 2020

There is an even better option.

Use a virtual filesystem.

The virtual FS can map the real directories to whatever location is needed, as many times as wanted with multiple versions. The modules exist once on the real FS.

@Artoria2e5
Copy link

the copy-on-write thing is getting better with APFS.

@jrz
Copy link

jrz commented Jun 15, 2020

I'm not sure if yarn already uses clonefiles, but it looks like it:

const ficloneFlag = constants.COPYFILE_FICLONE || 0;

https://github.com/yarnpkg/yarn/blob/3fc13c15a89f93661e0957ed15081131924c8a47/src/util/fs-normalized.js

@Artoria2e5
Copy link

Yeah, that thing gets passed down to libuv, but it only supports Linux for now. And oops I have two stale PRs (libuv/libuv#2577, libuv/libuv#2578)... maybe I should check the reviewer comments and stuff.

@Hecatron
Copy link

Any news on if this is a thing in yarn? (hard links), while I like some of yarn's features there's no way I can move away from pnpm for my laptop with the amount of disk space that ends up being used otherwise.

There is an even better option.

Use a virtual filesystem.

The virtual FS can map the real directories to whatever location is needed, as many times as wanted with multiple versions. The modules exist once on the real FS.

That wouldn't be a viable option for a lot of people, or at the very least would be very awkward to setup, if your developing under Windows for example with a ntfs filesystem or Linux with an ext4 filesystem going through the hassle of mounting a block of storage as a "special" filesystem just for developing wouldn't be practical for a lot of folks.
Most people tend to follow the easiest path to a solution.

I'm thinking of giving Plug’n’Play a go to see if there are any compatibility issues with the code I'm running.

@Daniel15
Copy link
Member Author

Daniel15 commented Oct 11, 2020 via email

@Hecatron
Copy link

I've just been playing around with pnp under yarn and I do like it.
I think I'll be moving to that since it's a bit quicker at installing / using depends now than pnpm given the way it works by generating a .pnp.js file
it also only requires some minimal changes to the webpack configs to use it, so I'm kind of sold on that now

@dhakehurst
Copy link

That wouldn't be a viable option for a lot of people, or at the very least would be very awkward to setup, if your developing
under Windows for example with a ntfs filesystem or Linux with an ext4 filesystem going through the hassle of mounting a
block of storage as a "special" filesystem just for developing wouldn't be practical for a lot of folks.
Most people tend to follow the easiest path to a solution.

I was not suggesting it as a user solution.

It is something that should be implemented inside yarn.

@Hecatron
Copy link

That wouldn't be a viable option for a lot of people, or at the very least would be very awkward to setup, if your developing
under Windows for example with a ntfs filesystem or Linux with an ext4 filesystem going through the hassle of mounting a
block of storage as a "special" filesystem just for developing wouldn't be practical for a lot of folks.
Most people tend to follow the easiest path to a solution.

I was not suggesting it as a user solution.

It is something that should be implemented inside yarn.

In a way this is what they have sort of done with yarn pnp
Typically a package manager such as yarn, npm or pnpm will download files to a node_modules directory local to the project, the obvious problem is this can use up a lot of disk space.
Lets say you implement a virtual filing system inside of yarn, this doesn't fix javascript being run by node, or being sourced by webpack since they wouldn't know anything about this virtual filesystem without some kind of wrapper.

pnpm's approach is to have a single global directory on the disk to store all the libs, then create a node_modules directory in the project and create hardlinks from there to the global store to save on disk space. This works but I've found can be a bit slower than yarn's pnp approach.

The yarn pnp approach is to download the libs as compressed zips into ether a single global directory store (like pnpm) or local to the project .yarn\releases
depending on which setting you've got configured. It then generates a .pnp.js file

You then have a wrapper that reads in a generated .pnp.js file from yarn.
The wrapper makes it look like your accessing files from a node_modules directory, but in actual fact your accessing .pnp.js which then in turn reads from the files in the zips without the need to store them uncompressed on the disk.

The type of wrapper depends on the tool in use.

  • For webpack there's a plugin to the resolver called PnpWebpackPlugin
  • For some tools they already have inbuilt support - https://yarnpkg.com/features/pnp#native-support
  • For tools that don't have native support (like the typescript compiler) you can use a tool called pnpify - such as "yarn pnpify tsc" for running the typescript compiler.

I think there may be some edge cases where it doesn't work, but I always have pnpm for that.
npm's approach at something similar is called "tink" I think but doesn't seem to be getting much development at the moment.

@jeffbski
Copy link

jeffbski commented Oct 13, 2020

Hard links work well to reduce the usage on disk and they work fine on linux, mac, and windows. I haven't found any disadvantages of using them. I currently use a package I wrote called pkglink to create them on my node_modules directories, but it would sure be nice if this was integrated into yarn so one didn't have to run things separately. But you can start using them today with pkglink https://github.com/jeffbski/pkglink Just run it on your js repos or even give it the folder above all of your repos and it will create hard links on the duplicate node_modules files, it verifies versions and file sizes, and dates before linking to make sure files are the same and can be linked.

@Artoria2e5
Copy link

one thing to keep in mind is that since hardlinks pretend to be the file system so well, deleting them usually deletes way more files than you're expecting

I know I am replying to something very old, but just for the record (since @also is only saying it vaguely) hardlinks should not cause accidental deletions at all. When anything is "deleted" on a filesystem with an inode table (including NTFS w/ MFT), all that happens is the inode gets one fewer reference count; it only gets removed when the count reaches 0.

The only case for accidental deletion to happen is with directory hard links. Almost nobody besides macOS supports that, so I can't care to test. For the record, the n problem stems from a symlink: rm only removes the symlink by default, but when you are asking to delete to something under the symlink it will find it.

@Hecatron
Copy link

Typically pnpm uses junctions instead of hardlinks for directory pointers / links

So I suspect if you deleted the files inside a junction directory it would also delete them from the directory the junction pointed to (similar to a symlinked directory under Linux)

@Artoria2e5
Copy link

@grbd, yes that's how junctions work.

@merceyz
Copy link
Member

merceyz commented Jan 3, 2021

Closing as PnP is a thing now and the most space efficient option. If this were to be implemented it would happen in v2 and is tracked here yarnpkg/berry#1845

@merceyz merceyz closed this as completed Jan 3, 2021
@larixer
Copy link
Member

larixer commented Aug 22, 2022

Update. This feature is supported starting from Yarn 3, via nmMode: hardlinks-global:
https://yarnpkg.com/configuration/yarnrc#nmMode

When enabled, the project files inside node_modules will be hardlinked into central content addressable storage at ~/yarn/berry/store, thus occupying disk space only once across all the projects that have this option enabled.

@callaginn
Copy link

@larixer Using nmMode: hardlinks-global doesn't seem to be linking to a global store in Yarn 4.

Right now my .yarnrc.yml file looks like this:

enableGlobalCache: true
nmMode: hardlinks-global
nodeLinker: pnpm

Is there something I'm missing? It appears to be linking to a hidden ".store" directory within my local node_modules folder, which isn't what I expected.

@larixer
Copy link
Member

larixer commented Feb 27, 2024

@callaginn The information I provided was for node-modules linker. The pnpm linker has different behaviours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests