-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New resolver takes 1-2 hours to install a large requirements file #12314
Comments
Hey, I'll take a look at this in more detail tonight but just quickly:
If I'm reading your graph correctly 9% of the time is spent on version objects? So, if I understand correctly, that's not the same as creating and not really the bulk of the time. But feel to correct me. Edit: I see you already have a PR that proves your point, exciting! Please ignore everything I said 🙂 |
I was doing profiling on my laptop which means the ssl read / network latency overhead is much higher than what actually happens in production. The top of the profile calls to |
Well I got home and tried to reproduce this (had a lot of fun learning how to install Python 3.11 into an Alpine WSL 2 distro) but I could not produce a significant performance improvement with your branch as you suggest. Some notes:
So first install pip/main and build cache
And then collect timing:
Which produced:
And then installing your branch and collect timing:
Which produced:
Which potentially is a real performance gain, but it seems less than 2%. Can I suggest you test again on your side using pip/main instead of pip/23.2.1, perhaps the performance difference is something that already landed? Or maybe a problem in the way alpine has packaged/vendored pip? |
Thanks for testing. I think the major difference is going to be that we are running out image builds for arm under qemu which is much more performance sensitive. |
But why just the version object creation? Wouldn't being on a worse performane machine cause everything to be slower? Have you tried running vanilla pip/main with no profiling enabled? As a side note if performance is so critical to you here you might want to try https://github.com/prefix-dev/rip once it is released. I tried testing now but getting it to work under Alpine defeated me. |
It would probably be helpful to see an actual production run : https://github.com/home-assistant/core/actions/runs/6408628446/job/17398078969
It was the thing that made the most difference in testing
Yes, everything will be worse
Yes
Thanks for that. |
I'm going to blow away my test env, and try again with your testing steps above. |
I'm going to try the following. Create a fresh venv, and install the packages
Once the packages are in the venv, just do the resolution with:
|
One thing that stands out is the image builds run with |
With the above testing main once all packages installed:
version_cache once all packages installed:
|
So it looks like the pip install isn't actually working |
well I forgot to change the path. it should be |
Even with that the file isn't getting updated? |
Weird, when I next get a chance I will confirm the correct version is installed in my tests, try doing "--force" on the install if you haven't already. |
I manually checked it out and installed it with
|
Well that's very definitive, I will rerun my tests as soon as I can, sorry if this was a rabbit hole I led you down. |
Retested this and made sure I was testing the right version this time:
Still get no performance improvement, I also tried installing directly from source, when I get a chance I will try running profiling on the two versions. Otherwise I guess there is something funadmentally slow about this version parsing in this "arm under qemu" environment vs. an x86 environment 😕 |
I can reproduce the performance difference! I had to take network out of the equation:
Implementing this methodology I get ~3m 40s on Pip main and ~2m 30s on your branch. Further this is the call graph I get on Pip main: Looking at this it seems that the work for me is split into two categories, finding candidates and dealing with specifiers. And Version construction looks to be the majority of the latter, for some reason in your environment I guess finding candidates is not slow. I am invesigating if a caching layer can be added at the Pip level instead of the lower vendor level like your PR proposed, also I'm looking at both finding candidates and specifiers. But this is way outside my expertise so if someone else would like to submit a PR instead that would be great. |
Excellent. I don't know this codebase well enough to tackle a solution at the pip level that would ultimately avoid all the creation of the same Version objects. Hopefully someone is interested in taking on the challenge as I'm likely to strike out on another attempt at a solution given my lack of familiarity with this code and the design choices. |
Can you give this branch a try and let me know if you see performance improvement or not? https://github.com/notatallshaw/pip/tree/version_cache It basically tries to implement caching at all the entry points into the packaging library (and also path to url functions): main...notatallshaw:pip:version_cache I actually think the only two significant performance improvements it lands is the path to url function and caching specifier contains result. But if you do see performance improvements I will break out each cache layer addition, test them, and submit a PR if appropriate. I have also made this issue on the packaging repo as there doesn't seem to be a way to cache one of the biggest offenders: pypa/packaging#729 |
Will give it a shot tomorrow after some sleep. 👍 |
This conversation has got quite long and it is difficult to follow how to reproduce. I am going to make issue/PR pairs that highlight very specific performance issues I have found and with detailed instructions on how to reproduce (without requiring Alpine). I am going to start with the simplest to solve and wait for feedback from Pip maintainers before continuing to more complex ones. |
Its better than |
Okay great, well at least I'm not completely on the wrong track. However based on the feedback from the first PR I opened it's probably going to be less trivial than adding caches everywhere it's possible. I'm going to investigate a bit more to see if I can come up with a more elegant solution to solving that first bottleneck I've identified. But if anyone else wants to jump in with PRs I would be highly suppportive! |
I'm very confused here. This issue talks about creating Is there a usable analysis anywhere (sorry, I don't know how to read callgrind data) that pinpoints what is going on here, explains why @notatallshaw is getting vastly faster results, and confirms that a significant portion of the reported 1-2 hours runtime is being taken by pip (as opposed to by network traffic, for example) and what parts of pip are at fault? The screenshot says that 9.37% of something is taken up by an init call linked to version, and it's called 9 million times. That seems a lot, but a lot of what? Can we pinpoint what the 9 million "things" are whose versions are being calculated? Are we recalculating versions multiple times? Are we scanning candidates that we could avoid by tighter pinning of requirements? The reproduction @notatallshaw quotes is using fully-pinned requirements - is the same true of the original issue? I don't personally have the time or resources to reproduce a 1-2 hour install that needs me to set up an Alpine Linux environment. And I'm uncomfortable extrapolating from a reproduction that takes 2-3 minutes - even if the latter demonstrates places we could improve performance, I don't know how we establish whether we're addressing the problem reported here. I'm happy to support the general exercise of improving performance in pip where there's ways of doing so. But I think we need a better way of measuring performance improvements in that case (for example, I don't want to improve performance of 1000-dependency installs at the cost of hurting performance in the more common case of something like Footnotes
|
That's one of the reasons I created a seperate issue, I am not sure fixing the performance bottle neck I see with I can only go off what I can reproduce, and with a scenario where I am able to identify Long story short, as I already said in #12314 (comment), my plan was to fix performance bottlenecks that I can identify and reproduce.
I agree, in my future PR work I plan to show the performance impact across 4 scenarios, which will probably be:
I am also looking to see if I can show memory impact as well as relative time performance improvement.
This was going to be my next scenario I tested, profiling with cache I see a lot of opportunities for improvement. However even with cache fully populated there are a lot of network calls involved, I planned to create an environment that used simpleindex to store all the relevant projects thus reducing the amount of time random network fluctions affected the relative performance. I have not yet constructed this environment to do this testing. |
Its confusing because there are multiple bottlenecks at scale. Some of the newer testing is using
In the use case here, everything is running through qemu so the run time of pip can be 2-3 orders of magnitude worse. With only avoiding
The construction of 9 million version objects was from re-processing the same versions of the same packages over and over for the purposes of comparing them. Since the version is constructed each time it needs to be compared, the object/memory is a significant bottleneck.
That's completely understandable, its not surprising that pip isn't tested to scale at 1000s of packages as that test cycle is painful. The goal was generate a test case that was not painful to reproduce that encapsulated a portion of the run time problem. Testing is happening on systems that are 2-3 orders of magnitude faster than the actual use case under qemu so any improvement in seconds can quickly become an improvement in minutes in actual production. |
Took a look at this again today, I think there are two issues:
From what OP has explained I beleive that 2. is the issue OP is facing, but there is no hard data in this issue to proove it. As it happens @pradyunsg has amazingly created a new benchmark tool https://github.com/pradyunsg/pip-resolver-benchmarks, and it should be possible to use this to directly show performance improvements and issues across a range of scenarios. When I have some time I will try and turn this use case into a benchmark scenario. |
I've created a draft PR on what I think might be a solution to 2: sarugaku/resolvelib#148. But it is making an assumption about resolution convergence which I'm not sure is true, so I need to do a bunch of testing and carefully pick through failing test cases in resolvelib. I've created an expirmental Pip branch here based on that PR: https://github.com/notatallshaw/pip/tree/optimize-resolve-steps. I've tested installing home assistant's full requirements, and I see the number of times the @bdraco it would be helpful, if you get a chance, to test against that branch and post back if you get any unexpected errors. |
The I ran an updated (Python 3.12) scenario based on the steps to reproduce in #12320, as it largely takes out IO and sdist building as a factor. Here were my results (run on my very fast PC):
I think it's still worth investigating bottlenecks, but performance should be a lot better now, and since this thread was opened uv has been launched which aims for super fast performance with a pip-like interface, which took less than 1 second for this scenario on my computer. |
@notatallshaw Are you comparing against uv with a warm cache? |
No, this compares pip against uv for resolving a large number of pre-downloaded wheels using find-links, no cache, no installing. Performance compared to pip main is about 100x faster, at least on my machine. |
FWIW that same home assistant scenario still using no cache but doing a regular install from PyPI on my machine uv complete in ~2 mins and pip main in ~16 mins. A lot of the time is spent building sdists and downloading, both of which uv does in parallel, so it's hard to determine in that scenario if uv is actually being more efficent than pip or just faster due to the actions being done in parallel. |
It almost certainly is, since it's using partial downloads as well during the resolve phase (basically pip's fast-deps behaviour). |
There hasn't been any feedback from OPs since there have been 2 pip releases, both of which should have seen significant performance improvement, the latest of which directly tackling the complaint about construction of the Version object too many times. And as no one was able to reproduce OPs issue exactly this thread has been left open. However given the work done, I feel it is worth closing at this point, and tackling any existing performance problems in a new issue. And to be clear, there are definitely O(n2+) issues in pip, that rear their head under certain circumstances that are not always obvious to the pip maintainers. So please do raise new issues, most helpful with clear steps to reproduce. |
Description
This is a followup to #10788
Attached is a callgrind
pip_callgrind.zip
The bulk of the time is spent creating
Version
objectsExpected behavior
No response
pip version
pip 23.2.1 from /root/ztest/lib/python3.11/site-packages/pip (python 3.11)
Python version
3.11
OS
alpine 3.18 linux
How to Reproduce
Must use alpine 3.18 linux
python3 -m cProfile -o all.pstats -m pip install --no-cache-dir --only-binary=:all: --index-url https://wheels.home-assistant.io/musllinux-index/ -r /usr/src/homeassistant/requirements_all.txt
https://github.com/home-assistant/core/blob/dev/requirements_all.txt
or
Dockerfile https://github.com/home-assistant/core/blob/dev/Dockerfile
Output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: