Join GitHub today
Lockfile concerns #579
I wanted to open up this issue as a discussion item. I don't think there's anything blockery about this concern (more below), but I do think that the structure of the lockfile will become problematic in the long term.
First, here's a quick example of one of the consequences of the problem, from Yarn's own lockfile:
There are two dependencies described here (
This doesn't always happen; there is some code that tries to "dedup" packages:
But it's hardly the only example:
All of these versions of
All of these patterns can unify to
Hopefully I've made my point
The current yarn strategy successfully unifies dependencies opportunistically, but doesn't have any backtracking mechanism, which is necessary for the unification opportunities described above. It also doesn't provide any way for resolvers to participate in the necessary backtracking. This means that even if we successfully unify the graph we have now via post-processing, there are likely even better opportunities that we'll miss.
The major change in the lockfile format I propose is that the lockfile represents a graph of packages, where the entries are exact package identifiers. This is a change from the current format, where the lockfile is a list of patterns and associated packages.
The reason I think the current status isn't an urgent priority is that a list of seen patterns and associated packages is still deterministic, which is the number one priority of the lockfile. Lower priorities, like conservative updating (
Ultimately, I think a simplified version of the Cargo strategy (the Cargo strategy supports a nice optional dependency mechanism called "features" which we don't support, and which complicates the algorithm somewhat) is the right way to go. The Cargo strategy is a refined version of the Bundler strategy, which was also used (and further refined) by Cocoapods. The Cargo strategy adds support for package duplication, which is necessary (and desired) in the Node ecosystem.
This would mean version-bumping the Yarn lockfile, but the good news is that Yarn's lockfile is already versioned, so this should be fine.
The Rough Strategy
Here's a rough description of the algorithm. (it's slightly modified from Cargo and Bundler's algorithm because of the amount of duplication expected as a matter of course in the npm ecosystem).
Dependency resolution always starts with a top-level
Each dependency (
The first step in the algorithm adds a list of all dependencies in the top-level package.json to a
Repeatedly activate a dependency by popping a dependency from the list of remaining dependencies:
If a dependency cannot be unified with an existing instance of the same package, consider backtracking if:
Keep going until the set of remaining dependencies is empty. At this point, a complete graph of unique packages will exist, and be ready to be written out to a lockfile.
A note about backtracking: because dependency graphs with thousands of packages are common, it is not advisable to use recursion when implementing this algorithm in a language with limited stack depth. Instead, we would maintain a side stack (see the same algorithm in Cargo) that allows us to implement the recursive algorithm without blowing the stack.
I really like the proposal in #422 to use a weighted SAT solver assuming it runs fast enough on large dependency graphs. Then we could optimize for as much deduping as possible, or using the latest packages as much as possible, and so on. It might end up being impractical but modern SAT solvers run impressively fast and can find an optimal solution when a solution exists.
@ide the usual SAT solvers treat constraints as absolute: either two dependencies conflict or they don't. This is more-or-less Bundler's strategy. The strategy described above tries to unify as much as possible, while still allowing duplicates to exist. It's also a back-of-the-envelope strategy (Cargo does something similar, but the ecosystem is much more reliant on semver, defaults to the
@ide I'd also like to add that the Cargo algorithm is a few hundred lines of code in total (including comments) and is performant enough in practice. The bundler algorithm is a similar size, is written in Ruby, and is performant enough in practice (again: written in Ruby!).
I don't think we need to wait to integrate an existing SAT solver library (unless someone is motivated to do the work and make sure it works reliably on the supported platforms) to do this improvement.
@wycats I was thinking about https://github.com/meteor/logic-solver (linked in that issue), which Meteor uses to compute its dependency graph. They've done the work of making it run on many systems via emscripten I think. There's a method called
I don't really have anything useful to add and I'm probably not clever enough to help solve this, but I just wanted to mention that your description is fantastic and describes the problem very well
For reference, here's how NuGet handles dependency resolution: https://docs.nuget.org/ndocs/consume-packages/dependency-resolution#dependency-resolution-rules. In general, it uses the lowest version that satisfies all version requirements (so in your example of
Not sure if quite matching this topic -- but the "resolved" field doesn't seem appropriate for a lockfile. I mean, you do need to store the resolved version / file hash, but storing even the absolute URL makes lockfiles generated in different machines (say, if a user is using a local sinopia cache) have different resolved URLs depending on user settings.
In the worst case, a lockfile could be generated entirely with localhost URLs, even if fetching from the public registry would have the same result (same downloaded hash); so nobody would be able to install from the lockfile without also setting up a local cache server.
The origin of the file is not important, what matters is that the hashes match.
Just saw this, I'm doing some work to implement a basic version of #422 that uses the aformentioned logic solver by meteor. Seems like both your proposed solution and what I'm implementing might be trying to solve the same problem.
Possibly unimportant question - the algorithm will always output a single package that satisfies all constraints, but what is the advantage of having the graph of packages on top of this? At that point, couldn't you just have the list of installed exact packages? My understanding of the lockfile is just that it's a file to indicate that packages you already have installed, to compare with versions or ranges in the future, so definitely correct me if I'm wrong here.
Sorry, just to clarify, by "package duplication", do you just mean multiple versions of the same package?
Cool! A few questions off the top of my head - it seems like this could potentially result in a "non-optimal" configuration. Given that in large repos, you could potentially have a lot of "valid" configurations due to occurrences like transitive dependencies, or overlapping/conflicting dependencies at various parts of the dependency graph, you could potentially end up in a situation where you've optimistically chosen the first dependency as the "most recent" version, but as a result multiple packages further down the line suffer as a result. A simple example would be something where you have 3 packages -
The top level package.json requires
With the algorithm described above, it'd look at
Note that the example is very contrived, but hopefully my example gets my question across. You can extend this example with more packages with interwoven dependencies to simulate the backtracking checks - but the idea here is that packages wouldn't ever consider backtracking, because they're already in a "valid" state, because the dependencies are technically already satisfied by the existing set of activated packages - they just haven't explored the other options.
I'll talk about the approach that @yunxing and I discussed offline, and was going forward with for #422 - I think both approaches are trying to solve the same underlying problem. Though, note that this was intended to only be applied during the
With the SAT solver approach, the plan to implement a basic version of this is to model the package version resolution as a logic problem, and using the logic-solver to handle the actual solution-finding. If you take a look at the api for modeling the problem, you should be be able to model the problem as a series of
The algorithm for setting up the logic problem would be as follows:
Theoretically, setting this up and telling the solver to solve the problem should result in a set of valid "choices". At this point, you can define the "best choice" heuristic however you want, by adding weights (as @ide mentions above), and running another solver on this smaller set of choices (for example, weighting more recent versions higher). The nice part is, all of the complexity of generating/traversing up and down the dependency graph no longer needs to be handled by us, the logic solver will just "handle" all of that for us. Because we've already fetched all of the information anyway, we just need to keep track of the final list of packages and their versions, and use the manifests we already fetched to install only those packages.
Note that I'm still working on this, so I don't know the performance implications yet. It looks like Meteor is using this within their package manager, with more complex logic.
Love this! Thanks for thinking about the problem. The proposed algorithm looks good to me, only concern is the performance of the algorithm with thousands of dependencies (this is NP-complete!). But even then we can add a separate command to generate the lock file first (so all subsequent installation can just read the lock file).
@dxu has already started working on the SAT solver approach. It may not be as flexible as the proposed algorithm, but it could be a great fit when in "--flat" mode. I think we should share our plans here to avoid duplicated efforts.
@dxu these are some great thoughts! lots to chew on. some quick responses to a couple points:
So yeah, I'm gonna make that argument
The algorithm @wycats has described should be able to guarantee that, if a solution is found, then the most recent version of each package is used that satisfies all mutual constraints. (Assuming highlander/no duplication - I haven't yet absorbed the implications of allowing duplicates into my thinking). If the goal were to find the solution with the "minimal" distance from selected to newest versions for the entire depgraph, that would require an extra layer where some sort of cost function (probably having a similar shape to meteor's version pricer) defines those distances, then we look for multiple solutions and pick the one with the minimum total cost. (Or, some entirely different approach.)
Either way, though, I think such an approach would necessarily have to explore the entire combinatorial version space to guarantee it had found "optimal distance." That would eliminate one of the main escape valves a solver can use to minimize running time: we only have to find one solution, not all possible ones. That makes it worth making strategic choices about the order in which deps that are likely to (further) minimize the amount of the space we have to explore.
In a broader sense, though, I'm sorta on the fence about whether it's a Good Thing™ to keep as close to the latest release as possible. I mean, as a goal within dev groups, I think that's probably reasonable, and it's certainly the way most teams I've been on have operated. But that doesn't necessarily mean tooling should operate in the same way. Absent user intent to the contrary, it might be better to default in a conservative direction - one that instead focuses on minimizing change, and thereby minimizing risk. (That's the motivation behind preferred versions, which I'm still experimenting with.)
This was referenced
Oct 12, 2016
referenced this issue
Oct 13, 2016
For what it's worth I think this is extremely important. I have also found empirically that the longer it takes to find a solution, the more likely it is that the solution is conceptually wrong ("I found a version of Rails that is compatible with this version of ActiveMerchant. It's Rails 0.0.1, which didn't depend on ActiveSupport!")
Bundler and Cargo both implement "conservative updates", which optimizes for minimal change rather than newest versions. I strongly prefer that in the absence of a signal to the contrary (
Yeah, I should be mostly free. Perhaps we can coordinate through discord/email on a time that works for all of us.
This is true. with the logic solver approach, the heuristics pricing becomes the more important problem than the solution generation, and most likely a problem that is more easy to be improperly implemented.
Probably the main advantage to the solver approach is that it offloads the complexity of solving the dependency graph to the SAT solver and focuses on the heuristics, which allows for more flexibility compared to a deterministic algorithm. Practically, this might be a very rare need. I'm guessing performance with the algorithm probably isn't an issue for 99% of cases given that it also powers bundler and cargo.
Fwiw the semver solver I prototyped on top of z3 a while ago when this came up in discussion for cargo allows for this; in general z3 permits weighting and optimal-solving by user-defined metrics, as of at least whenever the "z3opt" branch was integrated. It's definitely on trunk these days.