New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache build information between webpack processes #250

Closed
zkat opened this Issue May 1, 2014 · 47 comments

Comments

Projects
None yet
@zkat

zkat commented May 1, 2014

The current watch mode for webpack is fairly good at caching load results and general workload between file changes as long as you keep the same webpack process running. The initial build for some projects, on the other hand, can be on the order of more than a minute. This can make the build process very slow if for some reason you need to stop your watcher, or node crashes, or you simply want to restart your connect server with different flags.

It would be nice if webpack were able to persist compilation information between invocations of the node process to the filesystem and reload that cache later -- so ideally, performance between different $ webpack calls would be close to how long it takes to --watch to recompile when it detects a change.

@jhnns

This comment has been minimized.

Member

jhnns commented May 3, 2014

👍

@sokra

This comment has been minimized.

Member

sokra commented May 6, 2014

We need a way to serialize the cache object to the filesystem, but the cache is not a simple JSON object. It should be possible with a cool serializiation module that can rebuild the object structure.

@jhnns

This comment has been minimized.

Member

jhnns commented May 6, 2014

What properties of the cache object can't be serialized with JSON.stringify? I guess toSrc isn't option too 😉

@sokra sokra referenced this issue May 15, 2014

Closed

write cache #270

@geddski

This comment has been minimized.

geddski commented May 15, 2014

this would be great. Currently if you have the bail option set then some JS errors cause the watch process to crash and you have to do an entire new build to start it again.

sokra referenced this issue in webpack/concord Oct 20, 2014

@damassi

This comment has been minimized.

damassi commented Jan 6, 2015

Running into this problem as well. Once its going its instant, but right now our initial build takes over a minute to warm up.

@jtangelder

This comment has been minimized.

jtangelder commented Feb 12, 2015

This would be really cool.

@nickdima

This comment has been minimized.

Contributor

nickdima commented Sep 3, 2015

This could be very useful also for speeding up incremental builds in a continuos deployment setup.
@sokra could you point us in the right direction if we want to help with this?

@marcello3d

This comment has been minimized.

marcello3d commented Sep 22, 2015

Any pointers on where to start to solve this issue?

@bholloway

This comment has been minimized.

bholloway commented Oct 15, 2015

I took a look at serialising the cache based on a TodoMVC project (babel, sass) that I have lying around and Webpack 1.12.0. Here is what I found.

The compiler.cache is provided by CachePlugin created by WebpackOptionsApply. In CachePlugin it is trivial to serialise on the existing "after-compile" step and reload the same file (if it exists) in the constructor. Sure it has circular dependencies but cycle may be used.

However this doesn't get us anywhere. The reason being that the cache is not a plain data structure. It is composed of Classes that would need to be identified on serialisation and reconstructed on deserialisation. Naive JSON deserialised does not implement the requisite methods and the compiler quickly fails on cacheModule.needRebuild().

So how big is this problem? I hacked this code in the CachePlugin "after-compile" to see.

I found that all my interesting files (in the small Todo project) were entries in the cache hash each with value instanceof ContextModule. Noting that ContextModule inherits Module inherits DependenciesBlock which uses DependenciesBlockVariable.

Enumerable on these module instances there was also instances of Parser, OriginalSource, SourceMapSource (or Array thereof).

At first glance it is plausible (but suboptimal) to write an external monolithic serialiser/deserialiser that can reconstruct these objects. It would support only this set of Classes but not necessitate any changes to the Webpack codebase.

Likely such and implementation would only be useful for a narrow set of use cases. Given the large number of Classes in the Webpack /lib I imagine there would be a quickly growing list of other classes that would need to be included in the monolith.

The alternative is where @sokra brings the difficult tag - Much or all of the Webpack /lib needs to support some sort of serialisation interface, in order that the cache items may be recursively serialised/ deserialised. This would be the more consistent solution but needs architectural buy-in.

Please let me know if any of this is misinformed. I'm interested to hear other thoughts on this.

I would like to try the monolithic serialiser. Obviously as its author it would be suited to my use case, but also because it might inform a more general solution.

If you know of any other Classes that would need inclusion (or you can discover them by running a similar test) then please let me know. But my hope is that the full cache would not need to be reinstated to get some performance benefit.

@bholloway

This comment has been minimized.

bholloway commented Oct 16, 2015

Ok, so I have a really naive implementation of the monolithic serialisation. Overall... interesting but currently unusable.

There were lots of classes in the compiler.cache that I initially didn't see. But thankfully I was able to use the require.cache to discover pretty much every Class and @sokra unintentionally did an absolutely beautiful job of structuring all the classes so that they could be easily reconstituted.

Without optimisation serialisation is 2 orders of magnitude in the red, which doesn't bode well. Deserialisation is quite fast, leading me to believe it is the class discovery hurting the serialisation. In the short term I added a persist:false option to turn off serialisation and just use any existing cache file.

Even under these conditions performance gain is only 10% on a medium project. For very small projects this is completely offset by the time to deserialise the cache. At this stage its pretty amazing it produces workable output at all. But what we are all expecting is incremental-size compile times. So I have to conclude that right not it is not useful.

Some learning:

  • The cache is much more complex than I expected from my experience working with the Browserify cache. I am sure this is essential complexity but I'm hoping we can be selective in what we persist. This really needs some expert knowledge.
  • Possibly doesn't need an architectural solution due to the clean nature of @sokra's classes. Notwithstanding the possibility that the whole concept just doesn't work.
  • Currently writing JSON to disk. I'm seeing 200MB cache file on a medium size project with built size around 4MB. Binary will help but how much (?).
  • With lots of copies of Webpack code it was difficult to detect Classes from the correct paths, making require.cache essential. I need to re-confirm this wasn't something stupid I was doing during development which is now fixed. I'd like to require-dir on the Webpack /lib folder instead.
  • The small project mentioned above builds correctly but that is no guarantee for other use cases. I don't have any medium/large projects in a stable state (migrating from Browserify) but I suspect a problem with at least one. All are AngularJS projects. It would be good if others can test their use-case to see where things stand.

I've spent a couple of days on this and need to let it sit for a while. But if it crashes for you certainly raise an issue and I will try to look at it right away.

This is such a missing link it would be awesome to crack it. If anyone has feedback or PR that can move this plugin forward please let me know.

@marcello3d

This comment has been minimized.

marcello3d commented Oct 16, 2015

@bholloway Nice work and writeup! Apologies if this is obvious, but how much of the 200MB is duplicated? Are the same references being serialized multiple times? You might also be running into memory pressure/gc just from serializing/deserializing the entire thing into memory at once (as opposed to streaming/multiple cache files).

@bholloway

This comment has been minimized.

bholloway commented Oct 16, 2015

@sokra

This comment has been minimized.

Member

sokra commented Oct 17, 2015

@bholloway pretty awsome work.

Nice idea of using require.cache for serialization. I initally thought about adding a static property to each class storing __filename of the module.

On of the reasons why the cache is so big is propably that modules are stored into the cache while still connected in the graph. They are disconnected when they are read from the cache (search for "disconnect" in the webpack source). I'll try to change this in webpack (may reduce memory usage too) and check how much effect does it have on the serialization.

@markfinger

This comment has been minimized.

markfinger commented Oct 21, 2015

I tried a similar approach as @bholloway and found the serialization/deserialization costs were too high, it's really nice to see that the problem wasn't as insurmountable as it appeared to me.

I ended up wrapping webpack, caching it's output, and adding a dozen or so lines of cache validation logic (checking mtimes on dependencies and emitted assets, package versions, etc). For our purposes (automated deploys with remote builds) this dropped webpack's build times from ~30 seconds to < 1 second - most of which is spent booting the node process, loading the modules and cache, and hitting the file system.

The same wrapper is used on a bunch of local dev machines as well. We serve from the cache, while spawning a watching compiler in the background. Once the compiler was ready, the cache would delegate to the compiler, and the compiler would keep the cache populated.

@markfinger

This comment has been minimized.

markfinger commented Oct 21, 2015

Probably repeating @bholloway points, but the blockers that I can remember encountering when trying to add a persistent cache to webpack were:

  • Cyclical structures in both the cache and the state
  • Mixtures of simple objects and instances with constructers+prototypes

The cyclical structures ensured that you had to walk the entire tree to find any cycles, which is slow on both ends of the serializer.

The objects with constructors and prototypes ensured that you would need a large amount of domain knowledge to cover all the use cases in both webpack and the ecosystem.

If those two issues were solved, performant caching would be pretty viable.

The last thing I remember considering was adding an interface for plugins and loaders to populate and read from the cache.

@mzgoddard

This comment has been minimized.

Contributor

mzgoddard commented Nov 3, 2015

I've been looking at this too. Here's my current implementation. https://github.com/mzgoddard/webpack-cache-module-plugin

I dove into webpack's source while trying my hand at it so far. I realized the cyclical parts of the cache are caused by Compilation handling built modules and primarily storing Reasons that point to the module that depends on the newly built module. If a cached module is used the old Reasons are thrown away and new ones are computed during an iterative build.

From reading some other plugins I thought I might use this detail and hard code the members of modules to be serialized. To deserialize them I'd let webpack's normalModuleFactory do a bunch of the lifting and in the factory's module plugin interface wrap the NormalModule in a proxy object that uses the specifically selected serialized members for the first run if the timestamp wasn't out of date for its dependencies.

This works. It works for at least in my generated sandbox. I think the take away either way is to when serializing or deserializing the cache treat the modules as disconnected (

webpack/lib/Module.js

Lines 29 to 37 in f7d799a

Module.prototype.disconnect = function() {
this.reasons.length = 0;
this.lastId = this.id;
this.id = null;
this.index = null;
this.index2 = null;
this.chunks.length = 0;
DependenciesBlock.prototype.disconnect.call(this);
};
). As far as I can tell Module#disconnect removes the circular references and its essentially the state the modules are in when they are first created and the state Compilation wants when using a CachedModule.

The other thing I noticed, and I'm not sure if it affects @bholloway's work but I think it does, is even with using these cached proxies around the normal modules I saw maybe a 10% improvement in performance. It wasn't after some more poking around and logging timestamps that I noticed the biggest consumer of time was spent resolving locations of dependencies. Implementing a stored UnsafeCache (https://github.com/mzgoddard/webpack-cache-module-plugin/blob/0f09a55b01715c8e70dbc64098a220b6f5c67f68/lib/CachePlugin.js#L80-L97) gave me the performance improvement we're all hoping for from a persisting cache. Compilation when processing modules for dependencies and recursing, building the depended on modules, has to resolve what that dependency is, or more specifically the module factory has to. I thought I knew how the iterative builds avoided this but looking back at webpack's source I'm not as sure. Just thinking about it now I'd guess the CachedInputFileSystem is helping here once primed from the first run.

Freezing and thawing the cache to disk is pretty straight forward. I think caching dependency resolutions or possibly part of the CachedInputFileSystem is where a large win may be waiting but they are less straight forward. One thought is if we could list all attempted filepaths for context and dependency pair, we might be able to watch or check those paths on start to invalidate path resolutions. Another thought is maybe CachedInputFileSystem can be a help here. Maybe if we persist to disk the stats, readlink and readdir info in there and check it on revalidate that on start we could gain a similar impact to UnsafeCache but be ... safer. I imagine though that revalidating info for the CachedInputFileSystem could take a lot of time considering the number of paths involved in file resolving dependencies and loaders.

To reiterate, from what I've found, Reason and Chunk objects and others removed in disconnect stored on Modules are the circular references and are not wanted during iterative compilations so they can be safely ignored. Resolving dependencies is very elaborate and takes a lot of time, to approach iterative build times somehow safely caching dependency path resolutions or CachedInputFileSystem is likely a needed step.

Hope this helps.

@Globegitter

This comment has been minimized.

Contributor

Globegitter commented Jan 21, 2016

@sokra Is there anything in webpack 2 that would make caching to disk easier?

@wkentdag

This comment has been minimized.

wkentdag commented Sep 2, 2016

@jhnns check out static-dev/spike-core#156. spike generates static sites using webpack as its core bundler/compiler, and adding a persistent cache feature to webpack core would, for example, directly affect my use case: compiling a massive wordpress site into a static site that I can distribute through a CDN. With a persistent cache, I can run the initial build all day if need be, and then look forward to more normal build times throughout the rest of the development process, since 90% of the static pages i generate on the first run will never get touched again.

this thread has been my first foray into webpack core, so it totally could be too much work to integrate a somewhat tangential feature. btw, this hard-source plugin looks awesome, i'm going to try integrating it asap, and that might work well enough for me. just thought I'd throw my two cents in and say that I think this would be a really great feature for anybody working on large webpack projects, including ones that are leveraging other libraries built on top of webpack (eg, spike). thanks to everyboy here for your hard work on the issue 🎉

@gdborton

This comment has been minimized.

Contributor

gdborton commented Sep 2, 2016

@jhnns Any tips for breaking down time spent in various parts of the build?

Our project is massive at 451 chunks and 7217 modules, but even building a subset of that leads to long rebuild times in dev.

2016-09-02_18:28:16.85185 chunks 12
2016-09-02_18:28:16.85227 modules 1658
2016-09-02_18:28:16.85246 Build complete in 10343ms

At least some of this can be attributed to our use of vagrant/NFS, but I don't expect all of the slowness to be related. I can get numbers from my host machine as well.

@jhnns

This comment has been minimized.

Member

jhnns commented Sep 12, 2016

I think flame graphs could give us better insights on where so much time is spent. I'll try to get some from my current project and post instructions on how to get them (if you do not know it anyway ^^).

We talked about this issue at our last weekly meeting. We're planing to deprecate all parts of the loader APIs that make parallel compilations impossible (namely sync APIs and _compiler/_compilation). However, in order to deprecate these we need to provide better alternatives for loaders that had to access these internal objects (namely typescript loaders but probably some others too).

@sokra had the idea of a loader API that allows to hook into different compilation states. Thus, loaders would be more like plugins but still on a per-file basis. He wanted to create a proposal so that we can discuss it with other loader authors.

Since webpack@2 already uses a separate module to execute loaders, we could write a parallel-loader which spawns multiple processes (see our meeting notes for details). The parallel-loader would need to provide a webpack-like loader context and handle all the communication in the background. I think it's a good idea to push this into user-space instead of embedding it into webpack core. This way we can keep the core simple, and parallel compilation is probably not always desired due to the costs of spawning a new process.

@jhnns

This comment has been minimized.

Member

jhnns commented Sep 12, 2016

The proposal for the loader API can be found here

@amireh

This comment has been minimized.

amireh commented Sep 12, 2016

Since webpack@2 already uses a separate module to execute loaders, we could write a parallel-loader which spawns multiple processes (see our meeting notes for details). The parallel-loader would need to provide a webpack-like loader context and handle all the communication in the background.

Honestly, that's exactly what happypack does, although it doesn't use loader-runner since at the time I saw it it didn't actually work for a distributed application (at least in happypack's context). I can't recall the details but it's probably something easily amendable.

@jhnns

This comment has been minimized.

Member

jhnns commented Sep 12, 2016

Honestly, that's exactly what happypack does

That's perfect. You could give us valuable feedback if the proposed loader API would make it possible to use happypack rather as a loader than a plugin.

@abergs

This comment has been minimized.

abergs commented Feb 1, 2017

@jhnns I'm investigating our webpack 2 build that takes around 2 minutes (.tsx project with about 800 modules).

Could you please describe how to get those flame graphs?

I think flame graphs could give us better insights on where so much time is spent. I'll try to get some from my current project and post instructions on how to get them (if you do not know it anyway ^^).

@amireh

This comment has been minimized.

amireh commented Feb 1, 2017

@abergs run your webpack with node's inspector (a command like the following):

node --debug-brk --inspect webpack

Then launch your browser to visit the URL that command will show you (make sure you're on node 6+, 7 preferably)

In the browser page, find the Profiles panel and click on "Start Profile", then let the debugger continue (you may need to go back to the Sources panel to do this). Once the run is complete, go to the Profiles panel and click on "Stop Profile" or such, then you'll find the collected profile in a list to the left. That profile will contain the flame graph you're after.

@abergs

This comment has been minimized.

abergs commented Feb 1, 2017

Thank you @amireh

@ghost

This comment has been minimized.

ghost commented Mar 6, 2017

Is there filesystem caching yet?

@abdelbk

This comment has been minimized.

abdelbk commented Apr 15, 2017

Has some progress been made to make this feature at the core of webpack ?
The Rails' Asset Pipeline do it perfectly. Is it something worth investigating ?

@sod

This comment has been minimized.

sod commented Jun 26, 2017

The new webpack-contrib/cache-loader reduces consecutive builds in our environment from 50s to 27s. But you have to ensure that one cache folder isn't used for different configs (like dev/prod).

@jedwards1211

This comment has been minimized.

jedwards1211 commented Nov 27, 2017

@sod that's an improvement; how much faster do you think it would be if Webpack could load its entire previous state from disk on startup, see what's changed, and then run an incremental rebuild in the exact same way it would in watch mode? That way the time to invoke webpack again after a single file change would be reduced to just

<time to load previous state from disk> + 
<time to determine which files have changed> +
<time to run incremental rebuild for the files that changed>

In my environment incremental rebuilds take on the order of 10s, and I would hope that loading the previous state and determining which files have changed take far less time.

@MagicDuck

This comment has been minimized.

Contributor

MagicDuck commented Nov 29, 2017

I agree with @jedwards1211
We have like 8000 modules in our build and most of the startup time seems to be taken by reading the files and doing the acorn parsing to determine deps. We are already using a babel cache so cache-loader would not help.
I am willing to pitch in and help but hoping somebody with more experience with the codebase/problem space could take the lead or at least give a direction to get started. The investigations above looked very impressive! I think this feature would make a lot of people very happy and help alleviate performance issues.
@sokra @mzgoddard @markfinger

@jedwards1211

This comment has been minimized.

jedwards1211 commented Nov 29, 2017

We have like 8000 modules in our build

Hot damn, I wonder if that's a record! 😅

@MagicDuck

This comment has been minimized.

Contributor

MagicDuck commented Nov 29, 2017

That's what you get when legacy code moves to webpack, lol. It's gotten slightly smaller now, but for the longest time we had a massive hairball chunk about 50mb in size, for which i had to write a custom plugin emulating @sokra's agressive splitting plugin to split it into slices that get loaded in parallel. Good times... 😂

@filipesilva

This comment has been minimized.

Contributor

filipesilva commented Nov 29, 2017

I was running some tests on how hard-source-webpack-plugin could be added to the Angular CLI pipeline and the results seem promising.

It does seem fairly sensitive to Webpack internals though. Since it relies on serializing Webpack sources any change in them can possibly break the plugin.

This unfortunately means that it will necessarily lag behind latest Webpack stable its changes are integrated into hard-source-webpack-plugin. Webpack 4 also seems to be bringing in a fair number of performance improvements which will probably affect sources and thus need to be integrated into the plugin.

This makes me think that the only way a plugin such as hard-source-webpack-plugin could be stable is if it's integrated directly in Webpack, or has access to a stable API (e.g sources supporting serialization). Maybe the second option would provide plugin authors with more options.

@evilebottnawi

This comment has been minimized.

Member

evilebottnawi commented Jun 6, 2018

It is very old issue and duplicate more detailed issue #6527. Please leave you opinions and how we can solve this in #6527 issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment