Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolver preformance regression in 21.2 #10201

Closed
1 task done
bblommers opened this issue Jul 25, 2021 · 101 comments
Closed
1 task done

Resolver preformance regression in 21.2 #10201

bblommers opened this issue Jul 25, 2021 · 101 comments
Assignees
Labels
C: dependency resolution About choosing which dependencies to install

Comments

@bblommers
Copy link

bblommers commented Jul 25, 2021

Description

As of 21.2, the resolver decides it needs to download every possible version, to determine which version is compatible with other requirements., for multiple dependencies.

Note that this works fine for some dependencies (coverage, pytest_cov), as it only downloads a few.
For dependencies botocore and boto3, however, it downloads every possible version.

A minimal reproducible example can be found here:
https://github.com/bblommers/pipresolverbug

Not sure what the offending dependency/installation method is - I could only reproduce it for this specific workflow.

The CI for this project runs this specific example for both 21.1 and 21.2.
Version 21.1 completes in 15 seconds - 21.2 is still running after 20 minutes.

The CI logs can be found here:
https://github.com/bblommers/pipresolverbug/actions/runs/1064350454

Expected behavior

To only download the latest version of a dependency

pip version

21.2

Python version

3.7

OS

Linux (Ubuntu)

How to Reproduce

git clone git@github.com:bblommers/pipresolverbug.git
pip install -r req.txt

Output

Collecting botocore>=1.12.201
  Using cached botocore-1.21.2-py3-none-any.whl (7.7 MB)
Collecting boto3>=1.9.201
  Using cached boto3-1.18.1-py3-none-any.whl (131 kB)
Collecting botocore>=1.12.201
  Using cached botocore-1.21.1-py3-none-any.whl (7.7 MB)
Collecting boto3>=1.9.201
  Using cached boto3-1.18.0-py3-none-any.whl (131 kB)
Collecting botocore>=1.12.201
  Using cached botocore-1.21.0-py3-none-any.whl (7.7 MB)
Collecting boto3>=1.9.201
  Using cached boto3-1.17.112-py2.py3-none-any.whl (131 kB)
Collecting s3transfer<0.5.0,>=0.4.0
  Using cached s3transfer-0.4.2-py2.py3-none-any.whl (79 kB)
Collecting botocore>=1.12.201
  Using cached botocore-1.20.112-py2.py3-none-any.whl (7.7 MB)
INFO: pip is looking at multiple versions of boto3 to determine which version is compatible with other requirements. This could take a while.
Collecting boto3>=1.9.201
  Using cached boto3-1.17.111-py2.py3-none-any.whl (131 kB)
Collecting botocore>=1.12.201

Code of Conduct

@bblommers bblommers added S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior labels Jul 25, 2021
@uranusjr
Copy link
Member

This is not a bug. Please kindly read this section of the documentation: https://pip.pypa.io/en/stable/user_guide/#dependency-resolution-backtracking

To propose improvements to the dependency resolution algorithm, please see #9187.

@uranusjr uranusjr added resolution: duplicate Duplicate of an existing issue/PR and removed S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior labels Jul 25, 2021
@uranusjr
Copy link
Member

Reopening since I missed the part you said 21.1 worked reasonably well. I'll investigate whether we can bring back the 21.1 performance for you.

@uranusjr uranusjr reopened this Jul 25, 2021
@uranusjr uranusjr added C: dependency resolution About choosing which dependencies to install C: new resolver and removed resolution: duplicate Duplicate of an existing issue/PR labels Jul 25, 2021
@uranusjr uranusjr modified the milestones: 21.2, 21.2.2 Jul 25, 2021
@uranusjr uranusjr self-assigned this Jul 25, 2021
@uranusjr uranusjr changed the title [21.2.x] Resolver downloading every version known to mankind Resolver preformance regression in 21.2 Jul 25, 2021
@pradyunsg
Copy link
Member

Collecting botocore>=1.12.201
  Using cached botocore-1.21.2-py3-none-any.whl (7.7 MB)
Collecting boto3>=1.9.201
  Using cached boto3-1.18.1-py3-none-any.whl (131 kB)
Collecting botocore>=1.12.201
  Using cached botocore-1.21.1-py3-none-any.whl (7.7 MB)
Collecting boto3>=1.9.201
  Using cached boto3-1.18.0-py3-none-any.whl (131 kB)
Collecting botocore>=1.12.201
  Using cached botocore-1.21.0-py3-none-any.whl (7.7 MB)

Looks like a state pollution issue!

@bblommers
Copy link
Author

Reopening since I missed the part you said 21.1 worked reasonably well. I'll investigate whether we can bring back the 21.1 performance for you.

Thanks @uranusjr. Checking the logs, it wasn't backtracking at all in 21.1 - it immediately accepted the latest dependency version (as it should).
So to my untrained eyes, 21.2 is backtracking when it shouldn't.

Hopefully the repo and CI logs help in narrowing this down, but let me know if there's anything else I can do.

@uranusjr
Copy link
Member

Looks like a state pollution issue!

I think it’s simply trying to find a boto3 version that specifies a botocore version that is compatible with another dependency. The situation is probablyt something like this

  1. A package depends on one of the dependencies of botocore, and is incompatible with botocore 1.21.1 (but botocore 1.21.1 itself is not excluded).
  2. Another package depends on boto3>=1.9.201. A lot of boto3 versions have a very specific botocore pin that's only satisfiable by botocore 1.21.1.
  3. So the resolver tries each boto3 version and repeatedly
    1. Try to find a compatible version of botocore (and find 1.21.1 compatible)
    2. Try to pin botocore 1.21.1
    3. Fail because botocore 1.21.1 contains an incompatibility with already pinned dependencies
    4. Backtrack to unpin boto3

Step 3-ii. and 3-iii. can be optimised with tree pruning, but that still leaves 3-i. To have a useful optimisation, we need to find what exactly is causing botocore 1.21.1 to fail, and come up with something to avoid backtracking.

@uranusjr
Copy link
Member

uranusjr commented Jul 25, 2021

Checking the logs, it wasn't backtracking at all in 21.1 - it immediately accepted the latest dependency version (as it should).

If you look closely, 21.1 still backtracks, but on different packages (specifically coverage and pytest-cov). It's simply “more lucky” (lacking a better description) and only needed to backtrack a few versions before finding one that works, aided by the fact that both coverage and pytest-cov are significantly smaller in size than boto3 and botocore. So it's more complicated than you likely imagined 🙂

@bblommers
Copy link
Author

If you look closely, 21.1 still backtracks, but on different packages (specifically coverage and pytest-cov).

Yeah, I should have been more specific - I meant that 2.1 is not backtracking on botocore/boto3, so I'm surprised it needs to backtrack now with 21.2 - considering the botocore/boto3 versions are the same

So it's more complicated than you likely imagined slightly_smiling_face

It always is 🙂

@pfmoore
Copy link
Member

pfmoore commented Jul 25, 2021

One of the difficulties here is that botocore has so many releases (1224 in my extract, but it's probably gone up since then). with the current limitations on the packaging infrastructure (getting metadata needs a download) that's always going to be an issue.

Less than 0.8% of packages on PyPI have more than 100 releases, so in principle it's entirely reasonable for the trade-offs that pip makes to favour projects with fewer releases. Unfortunately, projects like botocore are too popular to be considered "outliers" in a general sense.

I wonder if we could prioritise projects with fewer releases over those with more? Or is that something we've already tried?

On the other hand, given that it is only a few projects, maybe we should be looking at asking those projects to meet us half way? Maybe botocore/boto3 could yank older, unsupported versions? The support page isn't very clear on precisely which minor versions are supported for how long, but I can't imagine Amazon support all 1224 versions of botocore...

@pradyunsg
Copy link
Member

Didn't we try to resolve projects with fewer releases before projects with lots of releases? Or am I remembering it the other way round?

@uranusjr
Copy link
Member

We only resolve packages with "tighter constraints" first, but don't actually count the number of releases. I think the main reason is there's currently no mechanism to do that; the information is of course available on the index page, but we don't have a way to bubble it up past PackageFinder.

@notatallshaw
Copy link
Member

notatallshaw commented Jul 26, 2021

Less than 0.8% of packages on PyPI have more than 100 releases, so in principle it's entirely reasonable for the trade-offs that pip makes to favour projects with fewer releases. Unfortunately, projects like botocore are too popular to be considered "outliers" in a general sense.

I wonder if we could prioritise projects with fewer releases over those with more? Or is that something we've already tried?

So while true most packages only have a few release, unfortunately looking at the top 10 packages on PyPi 50% have over 100 releases: botocore, awscli, boto3, requests, setuptools.

In the past (not sure if it's still true) setuptools has been identified in the resolver as a special case, I think it was to deprioritize checking for older versions? IMO I think it makes sense to add the other 4 packages listed here to that special case, but at the least add botocore, awcli, and boto3 as it looks like all 3 packages release daily.

What makes the pip resolver interesting about it's dependency resolution vs. other dependency resolver like conda is that exploring the dependency graph is a non-trivial cost. And it makes intuitive sense that packages with very frequent release (e.g. daily) will not change their dependency requirements every release and therefore trying to backtrack through those packages is going to be highly costly (this is my intuition and unsubstantiated by any data, I'll have a think if there's anyway I could empirically collect this information). But as @uranusjr points out this information currently isn't propagated to the pip resolver. So maybe it makes sense to special case very popular very commonly released packages?

My 2 cents.

@notatallshaw
Copy link
Member

notatallshaw commented Aug 29, 2021

I put together another commit that focuses only on Hypothesis 2, i.e. it tries to resolve the current conflicts before anything else.

It prefers the causes and parents of causes of the last known conflicts. In my testing it fixes OPs issue and other test cases I have to hand. The hypothesis here is that "If package A version n depends on package B then package A version n-1 probably also depends on package B". Therefore "If package A and C both depend on package B but on different versions of package B the resolution will probably be faster it it focuses on packages A and C and not on some package X which does not depend on package B".

Feel free to try here, I make no promises it will improve your particular case: python -m pip install git+git://github.com/notatallshaw/pip@third_attempt_at_prefer_non_conflicts

Code diff: notatallshaw/pip@21.2.3...notatallshaw:third_attempt_at_prefer_non_conflicts

Compared to before it doesn't try to hack the resolution steps of Pip, so it's worst case is the same as the current worst case. I think this approach is now simple enough that it's worth me trying to work a real PR rather than just PoC commits. FYI it also includes a fix for #10415 which I think may also help some use cases.

@notatallshaw
Copy link
Member

notatallshaw commented Aug 29, 2021

Fourth reproducible test case:

django-python3-ldap==0.12.0
django-ses==2.2.1
dj-database-url==0.5.0
kiwitcms-github-app==1.3.0
kiwitcms-tenants==1.6.0
raven==6.10.0
social-auth-app-django==5.0.0
social-auth-kerberos==0.2.4
python3-saml==1.12.0
flake8
kiwitcms
pylint-django
textdistance
twine
readme_renderer[md]
wheel

Further the below test case has no solution (changed kiwitcms to kiwitcms>=10.3), but my PoC commit finds that it to be impossible much faster, presumably because by focusing on the failure causes the possible solution space is narrowed down much quicker:

django-python3-ldap==0.12.0
django-ses==2.2.1
dj-database-url==0.5.0
kiwitcms-github-app==1.3.0
kiwitcms-tenants==1.6.0
raven==6.10.0
social-auth-app-django==5.0.0
social-auth-kerberos==0.2.4
python3-saml==1.12.0
flake8
kiwitcms>=10.3
pylint-django
textdistance
twine
readme_renderer[md]
wheel

@notatallshaw
Copy link
Member

notatallshaw commented Aug 30, 2021

Second ResolutionImpossible test case, taken from #9187,

I don't think anyone has run the current pip resolver long enough to for this to reach ResolutionImpossible. But on my latest patched version of pip it gets to ResolutionImpossible in a few minutes. Again I think because it prefers the failure causes it narrows down the possible search space much more efficiently, at least for real world dependency trees (it's definitely possible to construct an artificial dependency tree where this approach is much slower).

Linux only, must install python package six first, must install OS packages libssl-dev, libmysqlclient-dev, and enchant

bleach==3.1.1
borgbackup==1.1.9
celery[redis]==4.4.5
cssselect==1.0.0
Cython==0.29.14
diff-match-patch==20200713
Django==3.1
django-appconf==1.0.3
django-compressor==2.4
django-crispy-forms==1.9.0
django-filter==2.4.0
django-redis==4.11.0
djangorestframework==3.11.0
filelock==3.0.0
GitPython==2.1.15
hiredis==1.0.1
html2text==2019.8.11
jellyfish==0.7.2
jsonschema==3.0.0
lxml==4.4.0
methodtools==0.4.2
misaka==2.1.0
openpyxl==2.6.0
Pillow==6.0.0
pycairo==1.15.3
pygobject==3.27.0
pyparsing==2.4.0
python-dateutil==2.8.1
python-redis-lock==3.4.0
requests==2.20.0
sentry-sdk==0.13.0
setuptools==36.0.1
siphashc==1.2
social-auth-app-django==4.0.0
social-auth-core==3.3.3
translate-toolkit==3.1.1
translation-finder==2.6
user-agents==2.0
weblate-language-data==2020.13
weblate-schemas==0.4
Whoosh==2.7.4
aeidon==1.6.0
akismet==1.0.1
bleach==3.1.1
borgbackup==1.1.9
boto3==1.15.0
celery[redis]==4.4.5
chardet==3.0.4
cssselect==1.0.0
Cython==0.29.14
diff-match-patch==20200713
Django==3.1
django-appconf==1.0.3
django-auth-ldap==1.3.0
django-compressor==2.4
django-crispy-forms==1.9.0
django-filter==2.4.0
django-redis==4.11.0
djangorestframework==3.11.0
filelock==3.0.0
git-review==1.27.0
GitPython==2.1.15
google-cloud-translate==3.0.0
hiredis==1.0.1
html2text==2019.8.11
iniparse==0.5
jellyfish==0.7.2
jsonschema==3.0.0
lxml==4.4.0
Mercurial==5.2
methodtools==0.4.2
misaka==2.1.0
mysqlclient==2.0.1
openpyxl==2.6.0
phply==1.2.5
Pillow==6.0.0
psycopg2-binary==2.7.7
pycairo==1.15.3
pygobject==3.27.0
pyparsing==2.4.0
python-dateutil==2.8.1
python-redis-lock==3.4.0
python3-saml==1.2.1
requests==2.20.0
ruamel.yaml==0.16.0
sentry-sdk==0.13.0
setuptools==36.0.1
siphashc==1.2
social-auth-app-django==4.0.0
social-auth-core==3.3.3
tesserocr==2.3.0
translate-toolkit==3.1.1
translation-finder==2.6
user-agents==2.0
weblate-language-data==2020.13
weblate-schemas==0.4
Whoosh==2.7.4
zeep==3.2.0

@bblommers
Copy link
Author

Thanks for continuing to look into this @notatallshaw - much appreciated.
I can confirm that the third approach fixes the problem for both my simplified POC and the original, more complex, project. 👍

@notatallshaw
Copy link
Member

notatallshaw commented Aug 30, 2021

I tried this known extremely problematic test case:

apache-airflow[all]==1.10.13

While I believe my patched version of Pip will solve this many orders of magnitude faster than the current resolver on my computer it still didn't resolve after an hour.

The problem is one of the failing causes is moto, which by the time it's a failing cause it has been pinned 55 packages ago, but there is still a large search space to go through before pip will backtrack and unpin moto for it to not resolve for a long time. I believe this kind of situation can be optimized to a "reasonable" time by "unpinning" any failure causes (in this case moto ) immediately and not waiting for the backtrack, but this would require mutating the state in a very careful way.

Mutating the state like this would be very complicated and require a lot of code / test cases, where as my current proposed PoC is just a few extra lines and probably just a few extra test cases. I think therefore while I've shown many real world test cases to be dramatically improved in performance by the current PoC, the limit is where failure causes are already pinned and changing the preference order won't cause them to be backtracked any time soon. That case would need to be a different more carefully thought out optimization.

Thanks for continuing to look into this @notatallshaw - much appreciated.
I can confirm that the third approach fixes the problem for both my simplified POC and the original, more complex, project. 👍

Thanks for testing, really appreciate it!

@potiuk
Copy link
Contributor

potiuk commented Aug 30, 2021

Side comment. I really like how apache-airflow[all]==1.10.13 became the testing ground for PIP ;)

@notatallshaw
Copy link
Member

notatallshaw commented Aug 31, 2021

Fifth real world test case (taken from: #9187 (comment) ) where my PoC installs the requirements in a few minutes but the current resolver gets stuck taking an unknown amount of time to install:

(Linux only, may require installing some OS libs)

PyPDF2
amzqr
catsql
cloudscraper
colorama
convertdate
coursera-dl
daudin
ddgr
dephell[full]
ezgmail
fanficfare
fastapi[all]
fire
gallery-dl
gay
gensim
git+https://github.com/NightMachinary/BrishGarden.git
git+https://github.com/NightMachinary/brish.git
git+https://github.com/wustho/epr.git
glances[action,browser,cloud,cpuinfo,docker,export,folders,gpu,graph,ip,raid,snmp,web,wifi]
gnureadline
google-api-python-client
google-auth-httplib2
google-auth-oauthlib
google_images_download
hachoir
hiredis
html5lib
httpie
https://github.com/jakubroztocil/httpie/archive/master.tar.gz
icecream
imdbpy
instaloader
ipydex
ipython
iredis
jedi
jill
json-rpc
libgen.py
libgenapi
lightnovel-crawler
magic-wormhole
metadata-parser
metadata_parser
neovim
nest_asyncio
numba
numpy
ocrmypdf
opencv-python
orger
pandocfilters
passlib[bcrypt,argon2,totp]
pdfCropMargins
peewee
pexpect
piep
pipx
pixcat
plumbum
praw
psaw
pypyp
python-jose[cryptography]
python-multipart
python-telegram-bot
pyyaml
rich
scipy
service_factory
spotdl
subliminal
sumy
syncer
text2num
tgcrypto
tqdm
waybackpack
wifi-password
wikipedia
wikiquotes
xkcdpass
yapf
yq
zxcvbn

@pfmoore
Copy link
Member

pfmoore commented Aug 31, 2021

First of all, the work you're doing is really appreciated, so please don't take this comment the wrong way. But have you collected any evidence as to whether your approach makes any cases that are currently fast, slower? Obviously it's almost impossible to know how to check that - maybe it would be worth looking at some of the issues that were slow before 21.2, which the changes in 21.2 improved? Confirming that we don't make them worse again would be a helpful data point.

Actually, what would be perfect would be if we could add performance tests to the test suite, so we could capture some of these bad cases and ensure we don't regress. But I haven't the foggiest idea how we could do this 🙁

@edmorley
Copy link
Contributor

what would be perfect would be if we could add performance tests to the test suite, so we could capture some of these bad cases and ensure we don't regress. But I haven't the foggiest idea how we could do this

Instead of measuring performance in terms of say duration, how about doing so in terms of expected number of backtracks/...? This would presumably be less flaky than traditional performance tests.

@notatallshaw
Copy link
Member

notatallshaw commented Aug 31, 2021

First of all, the work you're doing is really appreciated, so please don't take this comment the wrong way. But have you collected any evidence as to whether your approach makes any cases that are currently fast, slower?

Not taken the wrong way, I would really like to test this and find examples. But a couple of points: 1) If the resolver never backtracks there will be no difference, 2) As my latest version only changes the preference order then the worst possible times are the same.

In theory, I think, the worst possible case for my optimization (and therefore presumably slower than the current resolver) would be: Package A version n depends on Package B and Package C, but they are both 100% incompatible , Package A version n-1 depends on 2 different packages Package D and Package E, there are also incompatible, etc. until eventually package A version 1 installs. Packages B, C, D, E, ... have similar dependency incompatibilities to Package A.

But I don't think this represents a real world dependency tree, the assumption of the optimization is if package A version n depends on package B, then package A version n-1 probably also dependends on package B. Where this doesn't hold the optimization will be slower, but I think changing dependencies usually only happens once in a while and when the dependencies do change the user presumably changes them to a valid installable set of dependencies, in which case my optimization should still be faster.

Obviously it's almost impossible to know how to check that - maybe it would be worth looking at some of the issues that were slow before 21.2, which the changes in 21.2 improved? Confirming that we don't make them worse again would be a helpful data point.

I have been backtracking through open and closed issues that look performance based and trying to find any real world reproducible examples, there are shockingly few, most users just write something to the effect of "takes too long" and don't provide an actual reproducible example.

If you have specific cases that improved from 21.1 to 21.2 I would love to test them. But I would like to point out my optimization is only adding 1 extra parameter to the pinning preference, most of the improvements that came with 21.2 should also be included in my optimization (in fact even better I fixed a bug in one of the improvements: #10415 )

Actually, what would be perfect would be if we could add performance tests to the test suite, so we could capture some of these bad cases and ensure we don't regress. But I haven't the foggiest idea how we could do this 🙁

An idea I had was trying to build a tool that takes a real world test case and turns it in to an offline repeatable test case. I imagine this would be pretty tricky but hopefully possible.

But as of right now I can not get pip's test suite to even start to run on my machine, I spent an hour trying yesterday and tox/pytest just keep throwing me cryptic errors. I'll give it another shot at the weekend when I have more time to work on it.

Instead of measuring performance in terms of say duration, how about doing so in terms of expected number of backtracks/...? This would presumably be less flaky than traditional performance tests.

@edmorley I think the issue is having any kinda of benchmark suite right now, yes the performance should probably be measured in iterations not time.

@notatallshaw
Copy link
Member

I have summarized my proposal to assist with dependency trees like this and linked the PRs here: #10479

If anyone has any ideas of how to improve or where I need to make a stronger argument let me know.

@pradyunsg
Copy link
Member

Dropping this from the release milestone, since there's no clear actionable items listed here -- and #10481 is already a part of the milestone independently.

@pradyunsg pradyunsg removed this from the 21.3 milestone Oct 6, 2021
@uranusjr
Copy link
Member

uranusjr commented Oct 6, 2021

@notatallshaw Are all (applicable) reports in this issue in your test cases already? If that's the case, I feel we can just close this one and track those cases on their own.

@notatallshaw
Copy link
Member

@notatallshaw Are all (applicable) reports in this issue in your test cases already? If that's the case, I feel we can just close this one and track those cases on their own.

Yes, of the ones I could reproduce. Some I simply could never reproduce and some have had their dependencies fixed (In those cases I've done my best to keep the test cases working by adding {package} <= {last known bad version} to my test requirement files).

I agree on closing, and once #10481 lands I think all pip issues related to heavy backtracking should be closed (e.g. #10373 #9187 #9215), as any new reports should be tested against that and if they experiencing heavy backtracking still it will take further analysis and the reasoning will not be the same.

@uranusjr uranusjr closed this as completed Oct 6, 2021
@takluyver
Copy link
Member

what would be perfect would be if we could add performance tests to the test suite, so we could capture some of these bad cases and ensure we don't regress. But I haven't the foggiest idea how we could do this slightly_frowning_face

The scientific Python world has a tool airspeed velocity which runs the same benchmarks against different commits in a project's history, trying to identify performance regressions. It's more usually used for CPU-bound number crunching (I'm guessing this is network bound?), but when you're looking at orders of magnitude changes from different recursive algorithms, it's probably easy to distinguish signal from noise.

And of course you could take the machinery for walking a project's history, running benchmarks, storing & viewing results, and change what it actually measures when running benchmarks.

@pradyunsg
Copy link
Member

See #9187 (comment), if you're here to report the behaviour of pip's dependency resolver.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C: dependency resolution About choosing which dependencies to install
Projects
None yet
Development

No branches or pull requests