Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf regression: pipenv lock creates many full-tree copies of the project #4403

Closed
mjpieters opened this issue Jul 28, 2020 · 4 comments · Fixed by #4421
Closed

Perf regression: pipenv lock creates many full-tree copies of the project #4403

mjpieters opened this issue Jul 28, 2020 · 4 comments · Fixed by #4421
Labels

Comments

@mjpieters
Copy link

mjpieters commented Jul 28, 2020

Issue description

While trying to debug why pipenv update was taking a monumental amount of time, I noticed that the pipenv/resolver.py process was very busy with copying across data to /tmp/reqlib-src[randomvalue] directories. The project includes several GBs of log files, so this was eating up a lot of /tmp disk space and taking a lot of time.

I traced this to the following SetupInfo.from_ireq() lines:

if is_file and not is_vcs and path is not None and os.path.isdir(path):
target = os.path.join(kwargs["src_dir"], os.path.basename(path))
shutil.copytree(path, target, symlinks=True)
ireq.source_dir = target

This creates multiple copies of the source tree (in apparent defiance of the LRU decorator), even though the project is listed just once, with no other entries under [packages]:

[packages]
project-name = {editable = true,path = "."}

It appears to create a copy per dependency; we have 104:

$ pipenv graph --json | grep '"key"' | cut -d':' -f2 | sort -u | wc -l
104

and while the temp dirs get cleaned up once the process finishes, towards the end I counted nearly as many temp directories:

$ ls -1ld /tmp/reqlib-src*/project-name | wc -l
85

I'm assuming at this point that had I put in a breakpoint somewhere I'd have seen 104 directories before pipenv completes. (addendum: my later timing tests show that this isn't quite the case, but still close).

While having log files in a project source directory is not ideal, it was not otherwise a problem as setuptools, .gitignore and MANIFEST.in files are configured to ignore the log files. We can't work around the issue with a symlink either as shutil.copytree() is called with symlinks=True.

Even then, nearly 100 copies of the full project is a huge waste of resources.

Expected result

If pipenv must create an isolated environment, it should either attempt to enumerate the source distribution files to copy, or warn or clearly document that a full copy is created of the whole tree. It should then create just one copy.

By temporarily removing the log files, pipenv update completed (albeit still slowly) without completely trashing the filesystem.

Actual result

Either pipenv update times out (in pexpect), or you run out of disk space on your temp partition.


Note: I've cut sensitive information out of the --support output; the project dependencies are simply:

install_requires =
    apache-airflow[celery,postgres,redis] >= 1.10.11
    airflow_multi_dagrun
    airflow-prometheus-exporter
$ pipenv --support

Pipenv version: '2020.6.2'

Pipenv location: '/home/ubuntu/.local/lib/python3.6/site-packages/pipenv'

Python location: '/home/ubuntu/miniconda3/envs/project-name/bin/python'

Python installations found:

  • 3.7.4: /home/ubuntu/miniconda3/bin/python3.7m
  • 3.7.4: /home/ubuntu/miniconda3/bin/python3
  • 3.7.4: /home/ubuntu/miniconda3/bin/python3.7
  • 3.6.9: /usr/bin/python3
  • 3.6.9: /usr/bin/python3.6m
  • 3.6.9: /usr/bin/python3.6
  • 2.7.17: /usr/bin/python2
  • 2.7.17: /usr/bin/python2.7

PEP 508 Information:

{'implementation_name': 'cpython',
 'implementation_version': '3.6.10',
 'os_name': 'posix',
 'platform_machine': 'x86_64',
 'platform_python_implementation': 'CPython',
 'platform_release': '5.3.0-1027',
 'platform_system': 'Linux',
 'platform_version': '#29~18.04.1-Ubuntu SMP Mon Jun 22 15:19:42 UTC 2020',
 'python_full_version': '3.6.10',
 'python_version': '3.6',
 'sys_platform': 'linux'}

System environment variables:

  • CONDA_SHLVL
  • LC_ALL
  • LS_COLORS
  • LD_LIBRARY_PATH
  • CONDA_EXE
  • SSH_CONNECTION
  • LESSCLOSE
  • LANG
  • CONDA_PREFIX
  • S_COLORS
  • _CE_M
  • XDG_SESSION_ID
  • USER
  • PWD
  • HOME
  • CONDA_PYTHON_EXE
  • LC_CTYPE
  • LC_TERMINAL
  • SSH_CLIENT
  • TMUX
  • LC_TERMINAL_VERSION
  • XDG_DATA_DIRS
  • COMBILEXDIR
  • _CE_CONDA
  • LADSPA_PATH
  • CONDA_PROMPT_MODIFIER
  • SSH_TTY
  • MAIL
  • TERM
  • SHELL
  • TMUX_PANE
  • SHLVL
  • LOGNAME
  • XDG_RUNTIME_DIR
  • PATH
  • CONDA_DEFAULT_ENV
  • LESSOPEN
  • _
  • OLDPWD
  • PIP_DISABLE_PIP_VERSION_CHECK
  • PYTHONDONTWRITEBYTECODE
  • PIP_SHIMS_BASE_MODULE
  • PIP_PYTHON_PATH
  • PYTHONFINDER_IGNORE_UNSUPPORTED

Pipenv–specific environment variables:

Debug–specific environment variables:

  • PATH: /home/ubuntu/.local/bin:/home/ubuntu/bin:/home/ubuntu/.local/bin:/home/ubuntu/bin:/home/ubuntu/miniconda3/bin:/home/ubuntu/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
  • SHELL: /bin/bash
  • LANG: C.UTF-8
  • PWD: /home/ubuntu/project-name

Contents of Pipfile ('/home/ubuntu/project-name/Pipfile'):

[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true

[dev-packages]
flake8 = "*"
flake8-bugbear = "*"
black = "==19.10b0"
pre-commit = "*"
pytest = "*"
pytest-cov = "*"

[packages]
project-name = {editable = true,path = "."}

[requires]
python_version = "3.6"
@mjpieters
Copy link
Author

mjpieters commented Aug 3, 2020

Note that this change is recent, it was introduced in c4a165b, so the affected versions are:

  • v2020.4.1b2
  • v2020.5.28
  • v2020.6.2

@techalchemy, you made this change; the repeated directory copying is causing pipenv to time out on install, lock or update commands, regularly, unless we minimize what's in the project directory, and even then it just takes way too much time. This used to run in under a minute, now we are looking at 15-20 minute run times.

For 'regular' projects on developer machines, the shutil.copytree command will also copy across various caches such as the __pycache__ directories, pytest and coverage cache directories, egg files, built distribution files, the full .git repository history, etc., and this many times over. Even at a few MB that's a lot of time wasted.

@mjpieters
Copy link
Author

mjpieters commented Aug 3, 2020

Here is a timing example with two separate pipenv versions, one before the change, one after, local project with cached files (note the size of the .git repository!). I had to run the 2020.6.2 attempt with PIPENV_INSTALL_TIMEOUT set to a high value to avoid timing out:

$ du -hs . .git
3.2M	.
2.7M	.git
$ ~/.local/bin/pipenv --version
pipenv, version 2018.11.26
$ pipenv --versions
pipenv, version 2020.6.2
$ time ~/.local/bin/pipenv lock
Locking [dev-packages] dependencies…
✔ Success!
Locking [packages] dependencies…
✔ Success!
Updated Pipfile.lock (e0aade)!

real	0m49.705s
user	0m43.434s
sys	0m4.617s
$ export PIPENV_INSTALL_TIMEOUT=99999999
$ time pipenv lock
Locking [dev-packages] dependencies…
Building requirements...
Resolving dependencies...
✔ Success!
Locking [packages] dependencies…
Building requirements...
Resolving dependencies...
✔ Success!
Updated Pipfile.lock (e0aade)!

real	18m51.186s
user	17m42.817s
sys	3m3.365s

So instead of 50 seconds, I had to wait nearly 19 minutes before completion.

A separate Python script tracked how many directories were created:

import glob, tempfile, time
maxcount = 0
pattern = f"{tempfile.gettempdir()}/reqlib-src*"
while True:
    count = len(glob.glob(pattern))
    if count > maxcount:
        maxcount = count
        print("Current maximum directory count:", maxcount)
    time.sleep(1)

which reached:

Current maximum directory count: 86

@mjpieters mjpieters changed the title SetupInfo.from_ireq() creates multiple full tree copies of editable path package Perf regression: pipenv lock creates many full-tree copies of the project Aug 3, 2020
This was referenced Aug 3, 2020
@tleonhardt
Copy link

tleonhardt commented Aug 5, 2020

I didn't dive into why, but I have noticed that it takes an extremely long time to do an update and that this has gotten worse in recent versions.

Specifically its bad enough that I've realized its way faster to just blow away the entire .venv and re-create it from scratch than it is to do an update, so I've resorted to that.

@luuk-codebeez
Copy link

luuk-codebeez commented Aug 7, 2020

Our team is affected by this. We are either rolling back to pipenv 2018 or migrating to poetry. The minimal activity on this thread does not give me much confidence. Thanks mjpieters for the diagnosis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants