Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
[spec: webpack 5] - A module disk cache between build processes #6527
Current Problems & Scenarios
Users get fast webpack builds on large code bases by running continuous processes that watch the file system with webpack-dev-server or webpack's watch option. Starting those continuous processes can take a lot of time to build all the modules for the code base anew to fill the memory cache webpack has to make rebuilds fast. Some webpack uses remove the benefit from running a continuous process like running tests on a Continuous Integration instance that will be stopped after it runs, or making a production build for staging or release which is needed less frequently than development builds. The production build will just use more resources while the development build completes faster from not using optimization plugins.
Community solutions and workarounds help remedy this with cache-loader, DllReferencePlugin, auto-dll-plugin, thread-loader, happypack, and hard-source-webpack-plugin. Many workarounds also include option tweaks that trade small loses in file size or feature power for larger improvement to build time. All of these involve a lot of knowledge about webpack and the community or finding really good articles on what others have already figured out. webpack itself does not have some simpler option to turn on or have on by default.
With the module memory cache there is a second important cache in webpack for build performance, the resolver's unsafe cache. The unsafe cache is memory only too, and an example of a performance workaround that is on by default in webpack's core. It trades resolving accuracy for fast repeated resolutions. That trade means continuous webpack processes need to be restarted to pick up changes to file resolutions. Or that the option can be disabled but for the number of resolutions that will change like that restarting will save more time overall than having the option regularly be off.
Freeze all modules in a build at needed stages during compilation and write them to disk. Later iterative builds, the first build of a continuous process using an existing on disk module cache, read the cache, validate the modules, and thaw them during the build. The graph relations between modules are not explicitly cached. The module relations need to also be validated. Validating the relations is equivalent to rebuilding the relations through webpack's normal dependency tracing behaviour.
The resolver's cache can also be frozen and validated with saved missing paths. The validated resolver's "safe" cache allows retracing dependencies to execute quickly. Any resolutions that were invalidated will be run through the resolver normally allowing file path changes to be discovered in iterative and rebuilds.
Plain json data is easiest to write to and read from disk as well as provide a state the module's data can be in during validation. Fully thawing that data into their original shape will require a Compilation to be running so the Module, Dependency's and other webpack types can be created according to how that Compilation is configured to create a copy of the past Module indistinguishable from the last build.
Creating this data will like involve two sets of APIs. One creates the duplicates and constructing thawed Instances from the disk read duplicate. The second uses the first to handle the variation in subclassed types in webpack. As an example the webpack 3 has 49 Dependency subclasses that can be used by the core of webpack and core plugins. The first API duplicating a NormalModule doesn't handle the Dependency instances in the module's dependencies list, it calls to the second API to create duplicates of those values. The second API uses the first to create those duplicates. To keep this from running in a circular cycle, uses of the first API are responsible for not duplicating cyclical references and for creating them while thawing using passed state information like webpack's Parser uses.
The first data API will likely be a library used to implement a schema of a Module or Dependency. The second data API may use webpack's dependencyFactories strategy or Tapable hooks. A Tapable or similar approach may present opportunities to let plugin authors cache plugin information that is not tracked by default.
A file system API is needed to write and read the duplicates. This API organizes them and uses systems and libraries to operate efficiently or to provide an important point for debugging to loader authors, plugin authors, and core maintainers. This API may also act as a layer that may separate some information in a common shape to change its strategy. Asset objects may be treated this way if they are found to best be stored and loaded with a different mechanism then the rest of the module data.
This must be a safe cache. Any cached information must be able to be validated.
Modules validate their build and rendered source through timestamps and hashes. Timestamps cannot always be validated. Either a file changed in a way that didn't change its timestamp or the timestamp decreased in cases like a file being deleted in a context dependency or a file be renamed to the path of the old file. Hashes of the content, like the rendered source and chunk source use can be validated. All timestamp checks in modules and elsewhere must be replaced with hash or other content representative comparisons instead of filesystem metadata comparisons. File dependency timestamps can be replaced with hashes of their original content. Context dependency timestamps can be replaced with hashes of all the sorted relative paths deeply nested under them.
The cached resolver information needs to validate the filesystem shape and can do that by
Two larger "validations" also need to be performed.
The webpack's build configuration needs to be the same as the previous build. Instead of invalidating in case of a different build configuration though, a separate cache stored adjacent to the other cached modules under other configurations. Webpack configurations can frequently switch like in cases of using
The second larger validation is ensuring that dependencies stored in folders like node_modules have not changed. yarn and npm 5 can help here by trusting them to do this check and hashing their content. A back up can hash the combined content of all package.json files under the first depth of directories under node_modules. webpack will track the content of built modules, but it does not track the source of loaders, plugins, and dependencies used by those and webpack. A change to those may have an effect on how a built module looks. Any changes to these not-tracked-by-webpack files currently will mean the entire cache is no longer valid. A sibling cache could be created but if that can be determined to be regularly useful to keep the old cache.
User Stories (That speak in solving spirit of these problem areas)
1 As a plugin or loader author, I can use a strategy or provided tools to test with the cache. In addition I have a strategy or means to have the cache invalidate entirely or specific modules as I am editing a loader or plugin.
1 As a user, I can rely on the cache to speed up iterative builds and notify me when an uncached build is starting. I can also turn off the notifications if I desire. I should never need to personally delete the cache for some performance trade off. The cache should reset itself as necessary without my input. I understood I may need to do this for bugs. Best such bugs be squashed quickly.
1 As a user, I should be able to use loaders and plugins that don't work with the cache. Modules with uncacheable loaders will not be cached. Modules with nested objects that cannot be duplicated or thawed from containing values that are not registered in the second data API will produce a warning about their cacheability status and allowed to be built in the normal uncached fashion.
1 As a core maintainer, I can test and debug other webpack core features and core plugins in use with the cache to make sure it can validate and verify itself for use.
This RFC will not look into using a cache built with different node_modules dependencies than those last installed. This would be a large effort on its own likely involving trade offs and may best be its own RFC.
This cache will be portable. Reusable on different CI instances or in different repo clones on the same or different computers. This RFC will not figure out the specifics of sharing a cache between multiple systems and leaves this to users to best figure out.
This spec can be bridged into other proposed new features with its module caching behaviour. This document and issue does not intend to make those leaps.
A api or library to create duplicates of specific webpack types and later those back into the specific types with some given helper state like the compilation and related module, etc. Uses of this api must handle not duplicating cyclical references, like a dependency to its owning module, and thawing the reference given the helper state.
A data relation API that either has duplication/thaw handles registered by some predicate, or like dependencyFactories, or through tapable hooks.
A (disk) cache organization API that either creates objects to handle writing to and reading from disk kind of like the FileSystem types. This API is for reading and writing the duplicate objects. Its API shape needs to support writing only changed objects. This might be done in a batch database like operation, letting the cache system send a list of changed items to write so the cache organization API doesn't need to redo work to discover what did and did not change. It will likely need to read all of the cached objects from disk during an iterative build. Core implementations of this API will likely need to be one, a debug implementation, and two, a space and time efficient implementation.
JSON is at least the starting resting format written to disk. The organization API might be used to wrap the actual disk implementation. The wrapping implemetation will turn the JSON objects into strings or buffers and back for the wrapped implmentation. That can be JSON.stringify and parse or some other means to do this work quickly as this step is a lot of work. Beating JSON.parse performance is pretty tricky.
Either in watchpack or another module, timestamps either need to be replaced with hashes for file and context dependencies or they can be added to the callback arguments. With a disk cache, timestamps will not be a useful comparison for considering if needs to be redone. The timestamps are not guaranteed to represent changes to file or directory content.
Use file and context dependency hashes in needRebuild instead of timestamps.
Hash a representative value of the environment, dependencies in node_modules and like. A different value from the last time a cache was used means no items in the cache can be used and they must be destroyed and replaced by freshly built items.
Hash webpack's compiler configuration and use it as a cache id so multiple adjacent caches are stored. The right cache needs to selected early on at some point of plugins being applied to the compiler after defaults are set and configuration changes are made by tools like webpack-dev-server.
These adjacent caches should be automatically cleaned up by default to keep the cache from running away in size by each one adding to a larger sum. This might happen automatically say if there are more than 5 caches including the one in use, cumulatively they use more than 500 MB. The oldest ones are deleted first until the cumulative size comes under the 500 MB threshold. Alternative to the cumulative size a if there are more than 5 caches and some are older than 2 weeks, caches older than 2 weeks are deleted.
Replace the resolver's unsafe cache with a safe cache that validates a resolution by stating every resolved file and every originally attempted check. Doing this in bulk skips the logic flow the resolver normally executes. Very little time is spent doing this as it doesn't rely on js logic to build the paths. The paths are already built. The resolver's cached items may be stored with their respective module, consolidating all of the data for a cached module into one object for debugging and cleanup. If a module is no longer used in builds, removing it also removes the resolutions that would lead to it, and less information will need to be read from disk.
Iterative builds, builds with a saved cache, should complete significantly faster than an uncached build. An uncached build saving a cache will be a small margin slower than one not writing a cache, as writing the cache is an additional task webpack does not yet perform. A rebuild, a build in the same process that ran an uncached or iterative build, should be a hard to measure amount slower, saving only the changed cache state and not the whole cache.
Similar security as to how third party dependencies are fetched for a project.
@TheLarkInn ooo, that's a good question. I added a more specific question relating that to the environment or configuration hash comparison.
I think the node version and OS would be answered separately. It would probably be wise to consider the node version to be comparable to npm/yarn/bower installed dependencies and be part of the environment hash. The OS I think gets to a deeper aspect of why the environment and configuration hashes are needed.
If we could transform the webpack object information into a general shape when saving a cache and transform back into any possible specific shaped decided by all of the installed dependencies, node versions, and webpack configuration, we wouldn't need the hashes. We would have a hermetic cache like the webpackGraph spec conversation is talking about. Since we can't we can represent that idea has hashes and instead verify that the stored specific shape is usable in the executing webpack instance.
There is some representation of this in the above cache spec that may need some expansion, but we can reduce the surface area of the hash comparison with areas that can do that specific to generic and back transformation. There may be something I'm overlooking but I think the OS difference can be handled through such a transformation like webpack records does, generalizing the file paths. I'd figure if a webpack project used multiple drives on say Windows that parts of the cache would not be usable on Mac or Linux but the parts that can be transformed would be usable. The missing parts would just be ignored since Mac and Linux would never resolve Windows like paths or be able to make comparable ones.
For the OS I think we only need the file system to be able to
I haven't tested a Mac cache on a Windows machine with
added a commit
Mar 15, 2018
added a commit
Mar 16, 2018
added a commit
Mar 16, 2018
added a commit
Mar 18, 2018
I mostly lurk here, so take these with a grain of salt:
This would be a huge win for the ecosystem, and the amount of third party solutions shows the high demand for a feature like this. Seeing as most of them rely on internals, it seems wise to provide a solution that is built into webpack itself.
Regarding happypack, I would consider compatibility a nice-to-have, not a hard requirement. In other words, if the optimal solution is not compatible and we can't find a secondary solution that is only slightly slower, then maybe we shouldn't go for it.
In any case, what are the next steps for this spec? How can we build some momentum for it? Do the developers behind the community solutions (like @amireh) know about it? Anything that I or others not familiar with the internals can do to help?
referenced this issue
Jun 6, 2018
Regarding the cache size, I expect it to be quite big on big projects with lots of chained loaders, which are also the projects that would most benefit from this cache. Deleting the cache after it reaches a certain size can lead the thrashing, where caches keep getting deleted and re-created. To avoid this a user would need to go check individual caches and try to increase the size limit to what he believes would allow a few extra caches to be stored.
For this reason I don't see cache size as being a useful criteria in determining if caches should be deleted. The size of a cache is a function of the configuration and sources, and cannot be determined by a user in advance. I think it's best to keep to number of caches, and cache age.
Including the node or operating system in either hash would greatly diminish its portability. It's very common for team members to use different OSs or node versions.
Also worth mentioning that, IIRC, yarn can produce different folder structures for the same lockfile. At least I remember people have different deduping behaviours, which meant packages where in different places.
When hoisting is taken into account, I don't think the backup described here (hash first level package.jsons) is enough. Things can change at any level, and that can affect loader/plugin behaviour. For this reason the backup method needs to take into consideration the resolved packages at any level.
Agreed. I made a change to hard-source recently to auto-prune that relies on only cache age.
I think as long as yarn guarantees the same versions of dependency dependencies for each dependency a different folder structure should be fine.
You're right. I think projects using hoisting will need to customize their cache configuration so it uses the right yarn.lock or package-json.lock. Lock files would be the best. Otherwise they would need a custom list of top level node_modules directories to hash the first level of. The best strategy in hoisting situations may need to be plugins or optional values, I'm not sure a strategy that checks for hoisting would make a good default.
referenced this issue
Aug 12, 2018
I would like to see another user story for this kind of change which is to reduce the memory requirement for large builds, especially in a CI / cold cache scenario.
At Atlassian we recently tried to upgrade from Webpack 3 to Webpack 4 for Jira's frontend build but were unable to because we either a) couldn't even get the build to complete due to out of memory errors b) the build would take anywhere from 150% of the Webpack 3 time to 800%+ depending on source map / optimisation options (Webpack 3 currently takes about 15 minutes for a full production build)
Significant memory usage - we believe due to Webpack storing the sources, bundles, and source maps in memory the whole time - was the cause. If the build completed, it was slower due to significant GC thrashing. Our build produces some 50 separate bundles (multiple entry points + code splitting), and these issues occur even with a 8GB heap.
One similar existing issue I found was #7703.
It'd be interesting if it was a advanced tuning thing - you could tradeoff disk for memory. For smaller builds using memory is fine, and makes things faster, where using disk would slow it down. For a build the size of ours offloading to disk and freeing memory would probably actually make things faster due to reduction in GC thrashing.