Consider using BFS instead of DFS for backtracking #10884

igsilya · 2022-02-05T23:05:28Z

What's the problem this feature will solve?

Current pip will drive version of one of the packages to the ground looking for compatible one, then it reduces the version of a second package a little and drives the first package to the ground again, i.e. performing kind of a depth-first search (DFS). This approach has a few issues:

It can be very slow if one of the packages has a lot of versions available.
If dependencies of very old version of a package are somehow not correctly set, pip will choose and install that very old version that will not actually work with the dependent package of which a fairly new version is installed.
Versions of installed packages may vary a lot depending on the order these packages are listed in the pip install command.

Example of the behavior is:

# pip3 install flake8 hacking
Successfully installed flake8-4.0.1 hacking-0.5.4

Here it downloads a huge pile of different versions of hacking from 4.1 down to 0.5 taking a lot of time and also ending up with a broken combination. To be clear, I'm not blaming pip for incompatibility of the result, this is, probably, the issue with the packaging of that very old version of hacking. However, time is wasted and the result is a broken flake8 command.

While:

# pip3 install hacking flake8
Successfully installed flake8-3.8.4 hacking-4.1.0

Here we can see that pip started downgrading from the flake8 package and found a working combination almost immediately with a reasonably fresh versions of both packages.

Describe the solution you'd like

Solution might be to use breadth-first search (BFS) instead, i.e. not driving one package to the ground, but gradually lower versions of all f them one at a time until the good combination is found.

So, for the flake8 and hacking problem, solution can be found quickly regardless of the order in which these packages are in the command line. This search order should also provide better compatibility between installed packages, because we should end up
with more or less fresh versions of all of them. And this should help avoiding badly packaged very old versions of some software.

Alternative Solutions

Workaround is to manually limit versions of some packages or shuffle them around in the command line until pip installs what you need. But that's no fun to do if you're not a developer of these packages and hardly know what the version requirements should be. And this gets even worse if you need a big number of packages installed and which of them is actually causing the backtracking problem is unclear.

Additional context

Along with aborted backtracking problem we had to spend a significant amount of time -- while trying to fix CI for the openvswitch project -- trying to figure out why flake8 checks are not performed during the build. Ended up with >=3.0 limit for the hacking package, otherwise pip installs hacking-0.5.4 and flake8 just throws an exception on startup. At this point our configuration script decides that flake8 is broken and unavailable: openvswitch/ovs@d545300

With the version limit, pip does this:

# pip3 install flake8 'hacking>=3'
Successfully installed flake8-3.9.2 hacking-3.0.0

Code of Conduct

I agree to follow the PSF Code of Conduct.

The text was updated successfully, but these errors were encountered:

notatallshaw · 2022-02-06T00:09:19Z

Have you tried implementing BFS? Do you have some results of how well it behaves in general?

I have a list of problematic requirements here you might want to try it on #10481 (comment)

In my investigation of trying to improve backtracking I initially also thought BFS might be a good solution but in my PoC code I wrote it ended up being slower in most cases (but it could of just been a buggy implementation).

I personally think that backjumping might be the way to improve performance, or at least taking some part of it as I write here. I have a very hacky backjumping implementation here, I tested your requirements and they install very quickly compared to current pip (in fact I can't even install your requirements on pip 22.0.3 because hacking going back to such an old version it errors out generating metadata).

igsilya · 2022-02-07T18:47:22Z

Backjumping seems interesting from the performance point of view. But my main concern is not the speed, actually. It's consistency and ability to install more or less fresh versions of all the packages, i.e. not driving one of them to the ground.

For the BFS, I didn't try to implement it as I'm not familiar with the code. It's more of a suggestion based on a user experience. I understand that BFS may not be optimal if the answer is really at the bottom of the search tree. But I hope that performance is in some sane boundaries for normal cases. And if the performance is not horrible, I'm, as a user, willing to sacrifice speed for stability and predictability. We're using pip to install requirements while running automated CI and it's much more important for CI to be stable rather than a bit faster.

I'm able to install flake8 and hacking with 22.0.3 just fine even though it takes a few seconds (I only had to install wheel beforehand). And since you mentioned that, it is another problem of DFS and, likely, backjumping too: good chance to have a build failure or some other error while trying to build very old version of a package. Current pip will abort backtracking at this point leaving the user with no packages installed at all. BFS, I think, should be better protected from such failures, since it will not go that deep in most cases.
Just my 2c.

notatallshaw · 2022-02-07T19:17:13Z

I'm able to install flake8 and hacking with 22.0.3 just fine even though it takes a few seconds (I only had to install wheel beforehand). And since you mentioned that, it is another problem of DFS and, likely, backjumping too: good chance to have a build failure or some other error while trying to build very old version of a package. Current pip will abort backtracking at this point leaving the user with no packages installed at all. BFS, I think, should be better protected from such failures, since it will not go that deep in most cases. Just my 2c.

That definitely feels intuitive, however in real world dependency trees I am not convinced this would be true because the version depth of one package can not be compared to the version depth of another package.

Take for example boto3 which releases every day, you may want to go back 10 to 100 versions quite reasonably because of some requirements in your dependency tree, this is just a few weeks or months old. Now compare to pymssql where 10 versions back would take you ~6 years and 20 versions back would take you ~18 years. In a hypothetical conflict between the two a BFS would quickly go to an ancient version of pymssql.

I think the only real way to protect yourself against ancient versions of packages got stability is to put in lower bounds either in your requirements file or constraints file. Pip can't know ahead of time if the metadata will fail to build on your machine or not. I believe one intention, beyond the primary reason of catching missing system dependencies, that erroring out when a package fails to build metadata is to reveal backtracking too far issues and encourage you, and the ecosystem in general, to put in sensible lower bounds.

But I am not a Pip maintainer, I have just been working a little bit on improving Pips backtracking performance. You should know this project is volunteer driven and ultimately changes are made by people committing PRs so they can be evaluated in practice.

uranusjr · 2022-02-12T15:05:10Z

You summed up the situation quite well actually. One of the reasons DFS is initially chosen over BFS is that both can have terrible performance in some situations, but the it is relatively easy to identify and work around cases where DFS have disastrous performance (by adding a lower bound when you see pip is going too deep for a package), while it is not obvious at all how you could improve the worst case scenario for BFS (it is possible, but far less obvious without deeper investigation).

jbylund · 2022-02-17T23:25:16Z

Is the package version release date available in metadata? maybe that could be part of the picture?

uranusjr · 2022-02-18T06:59:52Z

Is the package version release date available in metadata?

It is not. A list of information available at resolution time can be found here: https://packaging.python.org/en/latest/specifications/core-metadata/

Additionally we can obtain the URL a package is downloaded from (obviously) and the artifact hash.

notatallshaw · 2022-02-18T14:26:09Z

In general I think it would be good to try possible different strategies outside pip. The problem is so many people use pip that every possible edge case is almost immediately touched on the moment a release is made. And there isn't currently a good way for pip to test any of these.

I'm working on a personal project to implement the bare bones functionality of pip (get, resolve, install) where each section has a clear API and can be easily replaced or extended. I don't know if it'll get the state where I can share it but I think someone needs to do something like this so more radical ideas can be tested without impacting millions of developers.

For example I think one approach is to use the pypi json API which does include the upload date. However this API isn't standardized and would only be useful for pypi and not other indexes so it's not something that can be added to pip.

jbylund · 2022-02-18T14:57:02Z

I think that idea could really pay off not only in allowing easier experimentation but also in helping clarify where it makes sense to draw api boundaries (what things should be decoupled or combined).

In the meanwhile I'm left wondering if it makes sense to make a github project or at least a label for resolver/backtracking related things? in which case tickets like these would fit under that umbrella.

pradyunsg · 2022-02-18T15:26:25Z

I think it's fairly unlikely that we'd make such a change in pip, especially given that we don't have same amount of available time for change management and developer time, as we did when we rolled out the resolver.

As noted already, there are specific advantages to DFS (notably, that it's easier to control as an end user as well as being easier to reason about as an end user). Switching to BFS does not solve all the issues, it just changes the set of tradeoffs involved and it's not universally better either.

Further, pip also does a first round of breadth first-style sweep of the top-level requirements, to allow users to have the control that they wish to have on the resolution process -- so, we do already have some of the top-level benefits.

If folks wish to implement alternative dependency resolution algorithms and feed that into pip, that can be done today itself, by spewing out a requirements.txt file that's installed with --no-deps with pip. That will not use pip's dependency resolver and will install the packages exactly as requested.

pradyunsg · 2022-03-27T08:15:28Z

I think it's fairly unlikely that we'd make such a change in pip, especially given that we don't have same amount of available time for change management and developer time, as we did when we rolled out the resolver.

I'm going to go ahead and close this, based on this rationale. Appreciate the discussion here folks! ^>^

igsilya · 2022-03-28T11:07:03Z

Thanks for the discussion and sorry for not participating much. I agree that it doesn't look possible to make a significant improvement by changing the algorithm, since any implementation will have its own corner cases. At least, until the actual release date is available during the backtracking, so some meaningful estimations can be made. Closing for now is OK to me.

notatallshaw · 2022-03-28T15:20:12Z

FYI I'm still of the opinion that some form of backjumping will significantly improve most current heavy backtracking cases. My hacky implementation worked very well on your original example and other examples posted to Pip issues.

The problem is backjumping is difficult to implement and understand the behavior of how it will affect backtracking in general, and so is not as clear win as #10481 (comment) was. That combined with only a few heavy backtracking cases posted to Pip issues since 21.3, personally not running in to heavy backtracking in my own projects, and in general having good lower bounds on requirements fixes most issues I've not been motivated to develop or push for it.

But if someone else is very motivated to improve the situation I would humbly suggest they read my latest thoughts at sarugaku/resolvelib#64 (comment) .

igsilya added S: needs triage Issues/PRs that need to be triaged type: feature request Request for a new feature labels Feb 5, 2022

pradyunsg added C: dependency resolution About choosing which dependencies to install state: needs discussion This needs some more discussion and removed S: needs triage Issues/PRs that need to be triaged labels Feb 18, 2022

This was referenced Feb 26, 2022

Present found conflicts when trying to resolve them #10258

Closed

Give the user (optional) control over the backtracking behaviour via timeout #10932

Open

pradyunsg added resolution: no action When the resolution is to not do anything and removed state: needs discussion This needs some more discussion labels Mar 27, 2022

pradyunsg closed this as completed Mar 27, 2022

github-actions bot locked as resolved and limited conversation to collaborators Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using BFS instead of DFS for backtracking #10884

Consider using BFS instead of DFS for backtracking #10884

igsilya commented Feb 5, 2022

notatallshaw commented Feb 6, 2022 •

edited

Loading

igsilya commented Feb 7, 2022

notatallshaw commented Feb 7, 2022 •

edited

Loading

uranusjr commented Feb 12, 2022

jbylund commented Feb 17, 2022 •

edited

Loading

uranusjr commented Feb 18, 2022 •

edited

Loading

notatallshaw commented Feb 18, 2022

jbylund commented Feb 18, 2022

pradyunsg commented Feb 18, 2022

pradyunsg commented Mar 27, 2022

igsilya commented Mar 28, 2022

notatallshaw commented Mar 28, 2022

Consider using BFS instead of DFS for backtracking #10884

Consider using BFS instead of DFS for backtracking #10884

Comments

igsilya commented Feb 5, 2022

What's the problem this feature will solve?

Describe the solution you'd like

Alternative Solutions

Additional context

Code of Conduct

notatallshaw commented Feb 6, 2022 • edited Loading

igsilya commented Feb 7, 2022

notatallshaw commented Feb 7, 2022 • edited Loading

uranusjr commented Feb 12, 2022

jbylund commented Feb 17, 2022 • edited Loading

uranusjr commented Feb 18, 2022 • edited Loading

notatallshaw commented Feb 18, 2022

jbylund commented Feb 18, 2022

pradyunsg commented Feb 18, 2022

pradyunsg commented Mar 27, 2022

igsilya commented Mar 28, 2022

notatallshaw commented Mar 28, 2022

notatallshaw commented Feb 6, 2022 •

edited

Loading

notatallshaw commented Feb 7, 2022 •

edited

Loading

jbylund commented Feb 17, 2022 •

edited

Loading

uranusjr commented Feb 18, 2022 •

edited

Loading