Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking if installed packages are up to date is slow and use lots of CPU #2207

Open
Tenzer opened this issue May 16, 2018 · 16 comments

Comments

@Tenzer
Copy link
Contributor

@Tenzer Tenzer commented May 16, 2018

I've been trying to migrate a project to use Pipenv but I'm slightly blocked on how much longer it takes for Pipenv to check if the installed dependencies are up to date, compared to pointing pip install at a requirements file.

In our setup we run tests inside Docker containers. The image we run the tests on is one that comes pre-installed with the dependencies our project has at the time the image is built. Before we run any tests we then make sure the dependencies are up to date, in case any new dependencies are needed for any new code/tests that might have been added. For this we have just been using pip install -r requirements.txt, which normally completes in around 30 seconds when there's no new dependencies to install.

I then tried to switch this to Pipenv and pre-installed the dependences in the image using a Pipfile and Pipfile.lock and then running pipenv install --deploy --dev --system against the files. That works fine and I got an image created, but the problem comes to when we want to run tests and want to check if dependencies are up to date first. I've done this using the same pipenv install --deploy --dev --system command and instead of 30 seconds it now takes 5 minutes and 30 seconds! On top of that the CPU usage it much, much higher.

I've made a small test with the Pipfile and Pipfile.lock we are using (only slightly modified): https://github.com/Tenzer/pipenv-test. Some simple tests that can be run with it is for instance to first install the dependencies and then afterwards check that they are up to date in the local environment, than then see how long time and CPU the second operation takes:

$ docker run -it --rm -v $(pwd):/test python bash
root@9f6ecaf12cf8:/# cd /test
root@9f6ecaf12cf8:/test# pip install pipenv
[...]
root@9f6ecaf12cf8:/test# pipenv install --deploy --dev --system
Installing dependencies from Pipfile.lock (f4e26d)…
Ignoring appnope: markers 'sys_platform == "darwin"' don't match your environment
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 212/212 — 00:02:25
root@9f6ecaf12cf8:/test# time pipenv install --deploy --dev --system
Installing dependencies from Pipfile.lock (f4e26d)…
Ignoring appnope: markers 'sys_platform == "darwin"' don't match your environment
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 212/212 — 00:01:04

real	1m7.166s
user	1m49.520s
sys	0m15.000s

Note that this was run on my laptop rather than our CI system, and with a slightly simpler Pipfile, hence it's faster than what I described above. It can however be compared to checking if all packages are installed with pip:

root@9f6ecaf12cf8:/test# pip freeze > requirements.txt
root@9f6ecaf12cf8:/test# pip install -r requirements.txt
[...]
real	0m1.836s
user	0m1.610s
sys	0m0.130s

So according to this non-scientific test, Pipenv is taking 36 times as long and using 94 times more CPU than pip.

I know that there's a big difference between what's going on under the hood, but my point here is that the vastly longer time and resource usage may be a deal breaker for some with lots of dependencies.

While digging into this, I noticed that Pipenv is spawning one pip process for each package, and I wonder how much of a slowdown that is compared to pip doing everything inside one process. Would it potentially make sense to split the list of dependencies into 16 (or whatever PIPENV_MAX_SUBPROCESS is set to), in order to avoid having to spawn 212 pip processes - like it's the case here?

It might also be that this is all down to pip and trying to make it faster for the operations that Pipenv runs. I just thought I would start here and see if there perhaps could be some possible optimisations on the Pipenv side of things.

@kalekseev

This comment has been minimized.

Copy link

@kalekseev kalekseev commented May 26, 2018

I'm also concerned about slow pipenv install in already created venv and would like to understand the difference between pipenv install --ignore-pipfile and pipenv lock -r > req.txt && pip install -r req.txt?
The second approach looks the same but it's much faster.

@uranusjr

This comment has been minimized.

Copy link
Member

@uranusjr uranusjr commented May 26, 2018

@kalekseev Pipenv caches your dependency tree, so if you run the two in order, the second one will be way faster b cause it just uses the cache. You need to test them in fully isolated environments (e.g. Docker images).

@kalekseev

This comment has been minimized.

Copy link

@kalekseev kalekseev commented May 26, 2018

@uranusjr well I'm talking mostly about this case:
On deploy in 99% cases we don't have any new deps in Pipfile so pipenv or pip should just check that all deps were already installed and exit.
In deploy logs pipenv lock -r && pip install is way faster than pipenv install --ignore-pipfile. I don't see how cache can affect results in that case.

@Tenzer

This comment has been minimized.

Copy link
Contributor Author

@Tenzer Tenzer commented Jun 26, 2018

I tried to use pyflame to get a better idea about what takes so much time when pipenv is checking for new packages. It's mainly the pip invocations which is taking time so I patched pipenv to run pyflame on each pip process with this patch against the current master branch (a2a1936):

diff --git a/pipenv/core.py b/pipenv/core.py
index 6fe230c8..732f92d8 100644
--- a/pipenv/core.py
+++ b/pipenv/core.py
@@ -1480,7 +1480,8 @@ def pip_install(
     quoted_pip = which_pip(allow_global=allow_global)
     quoted_pip = escape_grouped_arguments(quoted_pip)
     upgrade_strategy = '--upgrade --upgrade-strategy=only-if-needed' if selective_upgrade else ''
-    pip_command = '{0} install {4} {5} {6} {7} {3} {1} {2} --exists-action w'.format(
+    from uuid import uuid4
+    pip_command = 'pyflame --threads --output=/pipenv/flamegraph/{8}.txt --trace {0} install {4} {5} {6} {7} {3} {1} {2} --exists-action w'.format(
         quoted_pip,
         install_reqs,
         ' '.join(prepare_pip_source_args(sources)),
@@ -1489,6 +1490,7 @@ def pip_install(
         src,
         verbose_flag,
         upgrade_strategy,
+        uuid4(),
     )
     if verbose:
         click.echo('$ {0}'.format(pip_command), err=True)

I then concatenated the resulting files together to one and created a Flamegraph out of it. I did this both for a fresh install of the packages from the test repository, where all packages had to be installed, and then ran it again afterwards where no new packages were installed. The resulting Flamegraphs are here:
https://files.tenzer.dk/pipenv/fresh-install.svg
https://files.tenzer.dk/pipenv/no-new-packages.svg

The output from time is included in the sub-title of the graphs.

I am by no means an expert on pyflame output, but from what I can grasp by looking at no-new-packages.svg, it seems like a big proportion of the time spent by pip is spent on importing code that it needs in order to run. Potentially in the area of 60-70% of the time. This becomes clearer if you search for "importlib" in the top right corner, and then compare it to the fresh-install.svg graph.

Could we change pipenv to only spawn as many pip processes it needs, instead of one per package? I think the start up cost for pip makes it worth spending a bit more time up front on preparing a list of dependencies for pip, rather than only sending one to each process.

As far as I can see, the downside of not having a pip process per package, is that the progress bar becomes less useful as it will potentially jump from 0% to 100% due to one invocation of pip, but I see that as a small downside compared to the current resource usage.

If there's support for changing this, I wouldn't mind giving it a go. Any hints as to what to be aware of would be appreciated as well.

@techalchemy

This comment has been minimized.

Copy link
Member

@techalchemy techalchemy commented Jun 26, 2018

Hey @Tenzer seems like we came to the same conclusion! I don’t think there’s much to watch out for; we were hoping to tackle this in the coming weeks. For the sake of keeping progress bars useful I was considering consuming requirements with a threadpool or something,havent really had time to experiment. If you want to take a stab at this it would be much appreciated!

@Tenzer

This comment has been minimized.

Copy link
Contributor Author

@Tenzer Tenzer commented Jun 28, 2018

I had a chance to look through the code today to get an idea about how it could be implemented. My idea for now is:

  1. Make a parallel version of pip_install() which can take a batch of packages to install. It should end up being simpler as I don't want it to support "special" dependencies. I classify special dependencies as ones which require extra flags added to the pip command in order to be installed, such as editable dependencies, dependencies which point at VCS, dependencies which are requirements.txt files, etc.
  2. Make a function in utils.py which can be fed the list of dependencies from do_install_dependencies(), and split them into the ones which can be installed by the bulk pip_install() function and the ones which need special handling.
  3. Add a call in do_install_dependencies() which calls the bulk install command for the packages which can be installed like that, and then afterwards use the existing code to loop over the dependencies needing special handling.

Does that sound like an okay approach?

I imagine the majority of dependencies people have don't need any special pip flags, so this should speed up the majority of packages to install.

The alternative would be to try and group the list of dependencies by which pip flags they need, and then run each of them in batches. I think this can get tricky though, but can probably be done by creating sets of the arguments needed for pip so they can be grouped more easily, but that would require bigger code changes then, as most of the current logic inside pip_install() would have to be moved out of that function.

@Tenzer

This comment has been minimized.

Copy link
Contributor Author

@Tenzer Tenzer commented Jul 12, 2018

@techalchemy Do you have any feedback on the above idea?

@uranusjr

This comment has been minimized.

Copy link
Member

@uranusjr uranusjr commented Jul 12, 2018

One problem I can think of immediately is that you’d broke the installation progress bar by breaking them into chunks. How would the parent process know how far a process is into installing its chunk?

@Tenzer

This comment has been minimized.

Copy link
Contributor Author

@Tenzer Tenzer commented Jul 12, 2018

I mentioned that in my comment in #2207 (comment):

As far as I can see, the downside of not having a pip process per package, is that the progress bar becomes less useful as it will potentially jump from 0% to 100% due to one invocation of pip, but I see that as a small downside compared to the current resource usage.

In my mind the resource usage and speed of pipenv is more important than the progress bar.

An alternative idea would be to use pip as a library rather than calling it as a separate binary for each package, but that would likely be more work than installing packages in batches.

@uranusjr

This comment has been minimized.

Copy link
Member

@uranusjr uranusjr commented Jul 12, 2018

While I wholy agree the speed is more important than the progress bar, this is unfortunately one of those endowment effect situations. From a project management standpoint, if we need to sacrafise something to achieve the gain, the gain needs to be greatly outweight the loss, otherwise it would be not worth it to deal with the backslash from users. This is more of a political situation than technical, but no less important for a project.

Using pip as a library would be a good solution to this problem. pip, however, does not really encourages these kind of usages, and the script calling pip would need constant maintainence. It would also need to be a standalone script (instead of part of Pipenv), because the installation would need to be performed in the virtual environment’s Python, not the one Pipenv is installed into.

We do already tap into pip internals a little bit (but in other parts, not the installation), so this is not impossible. I would, however, encourage starting a new, standalone project for this (given it is a standalone script anyway). We will be able to integrate the script once it is usable.

@Tenzer

This comment has been minimized.

Copy link
Contributor Author

@Tenzer Tenzer commented Jul 12, 2018

Would adding a flag or environment variable to change the behaviour be acceptable? Meaning that the current behaviour is kept as the default, and then people who want the speed boost instead of the progress bar can switch to using a batched behaviour instead.

It could be thought of as a feature flag and perhaps help assess how big a difference it makes to the package installation speed.

@uranusjr

This comment has been minimized.

Copy link
Member

@uranusjr uranusjr commented Jul 12, 2018

That could work. Maybe introduce a variable called PIPENV_INSTALL_CHUNK (probably a bad name, please make suggestions) that indicates how many packages each pip process should install. The default is 1, which is the current behaviour, but you can increase it to whatever you think is good. The progress bar would still show, but would always just tell you how many processes have finished, so less useful when you dial the number up.

This should be quite simple to implement IMO. I would be very interested to hear how other maintainers think about this idea @techalchemy @erinxocon

@techalchemy

This comment has been minimized.

Copy link
Member

@techalchemy techalchemy commented Jul 13, 2018

A couple of things --

I've actually implemented the batched approach in the past for comparison and it wasn't faster, I didn't spend any time profiling it because it wasn't that straightforward either. One fundamentally negative element of this entire problem is that pip's own installation itself uses a subprocess call, so there will never really be a way to avoid this issue unless we do something incredibly drastic

Breaking the progress bar is not an option IMO. I would rather do something drastic like have a thread-safe construct that allows us to pass requirements around and consume those.

@max-wittig

This comment has been minimized.

Copy link

@max-wittig max-wittig commented Sep 13, 2018

Installing these packages using the following command is downloading some dependencies, not working at all for some and for others just reaĺly slow, but only with the python:3.7-alpine image. Crashes after 10+ minutes. It is reasonably fast with the python:3.7-stretch image and works fine. Why could that be?

I think it sometimes tried to connect to pypi, even though it's not possible. Please let me know, if I should create a new issue for this:

version-externally-managed --compile --install-headers /builds/project/venv/include/site/python3.7/bcrypt
    Download error on https://pypi.python.org/simple/cffi/: [Errno 99] Address not available -- Some packages may not be found!
    Couldn't find index page for 'cffi' (maybe misspelled?)
    Download error on https://pypi.python.org/simple/: [Errno 99] Address not available -- Some packages may not be found!

Pipfile

```toml
[[source]]
url = "https://some-artifactory/artifactory/api/pypi/pypi-all/simple"
verify_ssl = true
name = "artifactory"

[packages]
docker = "*"
paramiko = "*"
python-gitlab = "*"
python-gnupg = "*"
GitPython = "*"
pytest = "*"
pyyaml = "*"

[dev-packages]
pycodestyle = "*"

[requires]
python_version = "3.7"

[scripts]
lint = "pycodestyle --exclude=venv/ ."
test = "pytest --junitxml=results.xml"

Command used

pipenv install --verbose --pypi-mirror $PIP_INDEX --dev --deploy
@Tenzer

This comment has been minimized.

Copy link
Contributor Author

@Tenzer Tenzer commented Sep 13, 2018

@max-wittig It's a different issue to what's otherwise been talked about here, so you are probably better off creating another issue.

@lahwran

This comment has been minimized.

Copy link

@lahwran lahwran commented Jul 31, 2019

Any update on this? Could an option be provided to simply disable the progress bar, if it's so expensive? If the bottleneck is process invocations, I'd think that switching from O(n) process invocations should make it so much faster in the already-installed case that the progress bar becomes irrelevant in the already-installed case, does that seem right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.