Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package locking is crazy slow for scikit-learn #1785

Closed
jleclanche opened this issue Mar 19, 2018 · 43 comments
Closed

Package locking is crazy slow for scikit-learn #1785

jleclanche opened this issue Mar 19, 2018 · 43 comments

Comments

@jleclanche
Copy link
Contributor

jleclanche commented Mar 19, 2018

I'm sorry I don't have a reduced test case for this but this is so crazy slow it's hard to actually debug.

Steps to reproduce

  1. Clone https://github.com/HearthSim/hearthsim-vagrant
  2. Inside that directory, clone https://github.com/HearthSim/HSReplay.net
  3. Run docker-compose build django (this builds an image based on the python-stretch docker image, which will also install the latest pipenv systemwide, cf. Dockerfile).
  4. Finally, run docker-compose run django, which runs pipenv install --dev

On linux, this stays stuck at Locking [packages] for over 15 minutes, with no output even when run with --verbose. Then after ~15 mins, it gives me the full output of what it's been doing for all that time.
When run outside of docker, it still takes a couple of minutes on that step, but at most 1-2 mins. I have a pretty beefy CPU and SSD, so I don't know why it would take this long in the first place.

I also see a lot of Warning: Error generating hash for ... in the verbose output, I don't know if that's related.

Any idea? How can I debug this further?

@kennethreitz
Copy link
Contributor

you shouldn't be running lock in docker in the first place…

@jleclanche
Copy link
Contributor Author

I don't see why not. But that aside there's still a performance issue here I'd like to figure out.

@jleclanche
Copy link
Contributor Author

I've added --skip-lock to my compose file for now. Let me know how I can help debug this.

@pikeas
Copy link

pikeas commented Mar 19, 2018

@kennethreitz Running language-specific tools in Docker is normal. Specifically:

docker run --rm -v `pwd`/Pipfile.lock:/app/Pipfile.lock my_app pipenv <cmd> should work, this is a common workflow. This runs pipenv inside the application container, with the result (AKA changes to the lockfile) reflected on the host for CVS check-in.

The goal here is that devs never need to worry about local/system Python versions, pipenv versions, etc.

I'd also love to help debug this.

@techalchemy
Copy link
Member

Cheers @jleclanche I used to play hearthstone in its early days before there was much tooling around it so thanks for building some community tools!

Generally speaking when you run pipenv lock you’re resolving the entire dependency graph. For Django projects this can be quite large and complex— I use Django at work and it is quite a process. To be quite frank unless you’re developing inside a docker container it doesn’t make a lot of sense to run pipenv lock in your compose script. This is kind of defeating the purpose of pipenv.

If you run pipenv install or pipenv install —dev locally for you dev process, you will already have a lockfile you can deploy. That’s the whole purpose here— you already have an isolated python environment in a virtualenv which is managed by pipenv, so it can handle dependency graphs across platforms, os and whatnot. Since I imagine you install things locally when you develop, there’s really no reason at all to force docker to re-resolve a dependency graph that already was resolved.

Resolvers are hard, and sometimes in python we need to download the whole package to parse it’s setup.py to crawl the dependency graph. If you don’t pass a lockfile to your deployment you’re basically handicapping yourself.

@jleclanche
Copy link
Contributor Author

Cheers @jleclanche I used to play hearthstone in its early days before there was much tooling around it so thanks for building some community tools!

:)

That’s the whole purpose here— you already have an isolated python environment in a virtualenv which is managed by pipenv, so it can handle dependency graphs across platforms, os and whatnot. S

I fully understand what pipenv brings to the table. Just to explain why I'm using it in docker:

  • I need docker because the stack I'm running locally is complex. It's not just a single python app, it's a python app, database, mock servers, redis server, etc. All these need to be available, cross-platform, consistently between all devs on the team. Docker solves that.
  • I need (want) pipenv because I need (want) to track my dependencies in a pipfile, rather than requirements.txt. That is to say, I'm moving the app to pipenv anyway. So now my choice is to either duplicate the dependencies, or use pipenv consistently in docker as well.

With that said, I'm not interested in solving my problem. I solved my problem by adding --skip-lock. I'm interested in solving, or helping solve, the egregious difference in performance between inside and outside of the container. Or at least coming out of this with a "there's a very good reason for this difference and here it is".

But yarn is also running inside that same container and managing 1-2 orders of magnitude more dependencies than pipenv, so I think we can do better. And if that takes me PRing setup.py/setup.cfg fixes to 30 different projects so be it :)

@jtratner
Copy link
Collaborator

jtratner commented Mar 20, 2018

@jleclanche - just to get at it why the docker version is slower than the local version, can you try running the command twice within the container? Generally, my assumption is that the time is being taken up in downloading/caching all the wheels (and possibly also building wheels too).

i.e.,

docker run -it -v Pipfile:Pipfile python-stretch-image bash
$ time pipenv lock --clear
$ time pipenv lock --clear
$ time pipenv lock --clear

I'm relatively certain the second / third / fourth times will be much faster. On my system (outside of Docker), when I bust my wheel/pipenv/pip caches, locking goes from taking 1.5 minutes to taking 5-10 minutes.

If it is much faster the second (or third!) time, then the issue becomes how to persist the cache locally to get to that faster speed. In my (totally unofficial) opinion, I think you should be able to do this by explicitly setting the PIPENV_CACHE_DIR environment variable (for pipenv) and also the XDG_CACHE_HOME environment variable (for pip) to places that are mounted on your local filesystem, so they aren't wiped.


Semi-crazy cache-maintaining solution for docker

If the slowness is from (lack of) caching between runs, something like the below might work (you'll need to have a pre-existing Pipfile.lock in your working directory so you can dump that out at the end):

export LOCAL_PIPENV_CACHE_DIR=$PWD/pipenv-cache
export CONTAINER_PIPENV_CACHE_DIR=/.pipenv-cache
export PIP_LOCAL_CACHE_DIR=$PWD/.pip-cache
export PIP_REMOTE_CACHE_DIR=/.xdg-cache
mkdir -p $LOCAL_PIPENV_CACHE_DIR
mkdir -p $PIP_LOCAL_CACHE_DIR
docker run -v $PWD/Pipfile:/example/Pipfile -v $PWD/Pipfile.lock:/example/Pipfile.lock -v $PIP_LOCAL_CACHE_DIR:$PIP_REMOTE_CACHE_DIR -v $LOCAL_PIPENV_CACHE_DIR:$CONTAINER_PIPENV_CACHE_DIR -e PIPENV_VENV_IN_PROJECT=1 -e PIPENV_CACHE_DIR=$CONTAINER_PIPENV_CACHE_DIR -e XDG_CACHE_HOME=$PIP_REMOTE_CACHE_DIR -i myimage /bin/bash -c "env | grep -i cache && cd /example && time pipenv lock --clear"

@jleclanche
Copy link
Contributor Author

Uh, what's going on?
image

@jleclanche
Copy link
Contributor Author

Huh, looks like Pipenv really doesn't like the Pipfile being in /.

@jleclanche
Copy link
Contributor Author

mmhm.

# docker run -it --mount type=bind,source=$(pwd)/Pipfile,target=/tmp/Pipfile python:3.6-stretch bash

## cd /tmp; pip install --upgrade pip pipenv
...
root@664de6dd435e:/tmp# time pipenv lock --clear && time pipenv lock --clear && time pipenv lock --clear
Creating a virtualenv for this project…
Using /usr/local/bin/python3.6m (3.6.4) to create virtualenv…
⠋Running virtualenv with interpreter /usr/local/bin/python3.6m
Using base prefix '/usr/local'
New python executable in /root/.local/share/virtualenvs/tmp-XVr6zr33/bin/python3.6m
Also creating executable in /root/.local/share/virtualenvs/tmp-XVr6zr33/bin/python
Installing setuptools, pip, wheel...done.

Virtualenv location: /root/.local/share/virtualenvs/tmp-XVr6zr33
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
Updated Pipfile.lock (5e2e8f)!
 
real    15m22.755s
user    1m57.008s
sys     0m15.110s
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
Updated Pipfile.lock (5e2e8f)!

real    1m31.972s
user    0m49.017s
sys     0m4.507s
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
Updated Pipfile.lock (5e2e8f)!

real    1m9.732s
user    0m48.146s
sys     0m4.628s

@jtratner
Copy link
Collaborator

@jleclanche - so I guess possible (again, semi-hacky) solution would be to generate your lockfile separately using docker run that mounts your local filesystem (thus allowing you to reuse your cache) and then adding Pipfile.lock so that pipenv does not need to do any resolution when you build the final image. Builder-container style...

@jleclanche
Copy link
Contributor Author

That's essentially what I'm doing right now, by having the venv live in the project and not be regenerated between containers. But again, I'm good on my side, i just want to figure out where exactly the performance regression is happening :)

@techalchemy
Copy link
Member

If you attach to the container while it’s locking, can you see what it’s doing? How’s your *nix-fu? We can check processes, open files, network connections, load

@jleclanche
Copy link
Contributor Author

My nix-fu is alright, my docker-fu not so much. I'll try to debug this some more and see what I can find out. I don't think it's all network because it takes super long even with a local pypi mirror.

@techalchemy
Copy link
Member

If you can spin up the docker container you can have it just idly run bash or whatever. Then I think you can attach to it and run screen or some such and run your pipenv stuff in one window and watch stuff in the other

@jtratner
Copy link
Collaborator

@jleclanche - just for context: if you have a pre-existing Pipfile.lock, how long does it take to just install it directly?

I.e. something like:

<stuff>
ADD Pipfile
ADD Pipfile.lock
RUN pipenv install --deploy --system

I believe that's the minimal baseline for timing (because it's just time for pip to install stuff) - until PyPI's API is good enough to be trusted for dependency info without installation.

@jleclanche
Copy link
Contributor Author

Yeah, it's considerably faster like that:

root@9f96a2d15e58:/tmp# time pipenv install --deploy --system
Installing dependencies from Pipfile.lock (5e2e8f)…
     ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 57/57 — 00:01:04

real    2m10.580s
user    0m54.517s
sys     0m6.822s

@jtratner
Copy link
Collaborator

@jleclanche - that's with zero cache? (i.e., everything is already downloaded, etc)

@jtratner
Copy link
Collaborator

err, nothing is already downloaded, no pipenv or pip use except for pip install --upgrade pipenv?

@jleclanche
Copy link
Contributor Author

jleclanche commented Mar 25, 2018

Yes, fresh container + pip install --upgrade pip pipenv. This is with cache (running the command again):

Installing dependencies from Pipfile.lock (5e2e8f)…
     ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 57/57 — 00:00:10

real    0m11.887s
user    0m30.609s
sys     0m2.533s

@jtratner
Copy link
Collaborator

wow, that's super fast - guessing that's with local PyPI instance or something?

But also really helpful to separate out where slowness is coming from :)

@jleclanche
Copy link
Contributor Author

Both of those commands are with regular pypi. I think running the command again might not reinstall though, so it's probably faster than an actual reinstall would be.

@jleclanche
Copy link
Contributor Author

jleclanche commented Mar 25, 2018

Ok, good call on the cache.

Pipfile:

[[source]]
url = "https://pypi.python.org/simple"
verify_ssl = true
name = "pypi"

[packages]
scikit-learn = {extras = ["alldeps"]}

[requires]
python_version = "3.6"

Time in docker:

root@f4d55b37a3a0:/tmp# time pipenv lock --clear
Creating a virtualenv for this project…
Using /usr/local/bin/python3.6m (3.6.4) to create virtualenv…
⠋Running virtualenv with interpreter /usr/local/bin/python3.6m
Using base prefix '/usr/local'
New python executable in /root/.local/share/virtualenvs/tmp-XVr6zr33/bin/python3.6m
Also creating executable in /root/.local/share/virtualenvs/tmp-XVr6zr33/bin/python
Installing setuptools, pip, wheel...done.

Virtualenv location: /root/.local/share/virtualenvs/tmp-XVr6zr33
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
Updated Pipfile.lock (96e4fc)!
 
real    11m54.703s
user    1m19.665s
sys     0m13.403s

Time on host after removing ~/.cache (different system/setup though):

(tmp-4dad6bef6fe737b) [2:37:43] adys@azura ~/tmp % time pipenv lock --clear
Courtesy Notice: Pipenv found itself running within a virtual environment, so it will automatically use that environment, instead of creating its own for any project.
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
Updated Pipfile.lock (96e4fc)!
pipenv lock --clear  83.75s user 12.97s system 13% cpu 12:13.08 total

So I guess it wasn't a docker thing after all; scikit-learn was getting cached systemwide, which wasn't happening for docker. I'm reinvestigating the network stuff, because my pip cache is now 1.1G and I'm starting to think my local mirror setup wasn't working in docker.

For the record, replacing scikit-learn = {extras = ["alldeps"]} with requests = "*", I get:

Virtualenv location: /root/.local/share/virtualenvs/tmp-XVr6zr33
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
Updated Pipfile.lock (7b8df8)!
 
real    0m14.732s
user    0m5.226s
sys     0m0.505s

@jleclanche jleclanche changed the title Package locking is crazy slow in docker Package locking is crazy slow for scikit-learn Mar 25, 2018
@jleclanche
Copy link
Contributor Author

I'm leaving the debugging here, have to take care of RL for a while -- tough getting network logging set up here, I might be better off using wireshark...

The good news is it's now apparent why re-locking in docker would take so long: Because ~/.cache is ephemeral in docker, every pipenv call that uses that cache would have to redownload everything. I still think there is a legitimate issue here though: If both the lockfile and the venv are present and set up, pipenv should not have to redownload anything, even if there's no cache.

@uranusjr
Copy link
Member

@jleclanche Ah, that makes sense. Thanks for the hard work! Unfortunately Pipenv really needs to download things because how Python’s packaging system currently works. There’s no way to know what dependencies a package has without actually running its setup.py, and the only way to get the file is from PyPI (or the cache). We would need to jump through a lot of hoops to prevent this from happening, if it is at all possible to prevent.

Can you verify whether mounting the host’s cache directory into Docker works (using -v; I’m not sure what the correct term is here)? At least we can add a section to documentation describing possible pitfalls and workarounds to it.

@jleclanche
Copy link
Contributor Author

There’s no way to know what dependencies a package has without actually running its setup.py

On first download, sure, but that first download is fine anyway because you have to download the package somehow.

How about adding the list of each package's dependencies to Pipfile.lock? That seems to be the missing key here. Coming back to yarn, that is in fact exactly what yarn does:

...

aws4@^1.2.1, aws4@^1.6.0:
  version "1.6.0"
  resolved "https://registry.yarnpkg.com/aws4/-/aws4-1.6.0.tgz#83ef5ca860b2b32e4a0deedee8c771b9db57471e"

babel-code-frame@^6.22.0, babel-code-frame@^6.26.0:
  version "6.26.0"
  resolved "https://registry.yarnpkg.com/babel-code-frame/-/babel-code-frame-6.26.0.tgz#63fd43f7dc1e3bb7ce35947db8fe369a3f58c74b"
  dependencies:
    chalk "^1.1.3"
    esutils "^2.0.2"
    js-tokens "^3.0.2"

babel-core@^6.26.0:
  version "6.26.0"
  resolved "https://registry.yarnpkg.com/babel-core/-/babel-core-6.26.0.tgz#af32f78b31a6fcef119c87b0fd8d9753f03a0bb8"
  dependencies:
    babel-code-frame "^6.26.0"
    babel-generator "^6.26.0"
    babel-helpers "^6.24.1"
    babel-messages "^6.23.0"
    babel-register "^6.26.0"
    babel-runtime "^6.26.0"
    babel-template "^6.26.0"
    babel-traverse "^6.26.0"
    babel-types "^6.26.0"
    babylon "^6.18.0"
    convert-source-map "^1.5.0"
    debug "^2.6.8"
    json5 "^0.5.1"
    lodash "^4.17.4"
    minimatch "^3.0.4"
    path-is-absolute "^1.0.1"
    private "^0.1.7"
    slash "^1.0.0"
    source-map "^0.5.6"

babel-generator@^6.26.0:
  version "6.26.1"
  resolved "https://registry.yarnpkg.com/babel-generator/-/babel-generator-6.26.1.tgz#1844408d3b8f0d35a404ea7ac180f087a601bd90"
  dependencies:
    babel-messages "^6.23.0"
    babel-runtime "^6.26.0"
    babel-types "^6.26.0"
    detect-indent "^4.0.0"
    jsesc "^1.3.0"
    lodash "^4.17.4"
    source-map "^0.5.7"
    trim-right "^1.0.1"

babel-helper-builder-binary-assignment-operator-visitor@^6.24.1:
  version "6.24.1"
  resolved "https://registry.yarnpkg.com/babel-helper-builder-binary-assignment-operator-visitor/-/babel-helper-builder-binary-assignment-operator-visitor-6.24.1.tgz#cce4517ada356f4220bcae8a02c2b346f9a56664"
  dependencies:
    babel-helper-explode-assignable-expression "^6.24.1"
    babel-runtime "^6.22.0"
    babel-types "^6.24.1"
...

@jleclanche
Copy link
Contributor Author

Can you verify whether mounting the host’s cache directory into Docker works (using -v; I’m not sure what the correct term is here)?

Bind mounts. It would (probably) work, but actually there's other issues with that. Namely, the docker service runs as root (meaning it writes files as root), and the container itself runs root as well (as is common). Pip will refuse to use a cache from a mismatching UID, so mounting ~1000/.cache/pip will do nothing unless the container itself has a UID 1000, and even if it does, it will write files as root which is not something you want.

@gsemet
Copy link
Contributor

gsemet commented Mar 25, 2018

I usually use pipenv in docker and indeed, you do not lock inside the docker. You run --deploy to build your application from the pipfile.lock. If you want to update the lock, you run pipenv update/lock from outside of the docker (what is the use of running inside the docker?), but you do it left often and in a controlled way (you really need to check the locked file afterwards) and run a new set of non regression tests. Once commited, then you run your app with the updated lock file.

TL;DR: lock the dependencies of your app from your dev env, run from within a docker using --deploy

PS: this should maybe be better document in the pipenv docs, lot of people make this mistake, no?

PPS: I tend to favour RUN pipenv install --deploy without the --system. It prefer not to mess with the system python even from inside the docker. And you put ENTRYPOINT ["pipenv", "run"] so that you can use CMD in a "transparent way.

@jleclanche
Copy link
Contributor Author

@gsemet Docker is my dev env. I'm not using pipenv in production at the moment (and once I am, I will be following that workflow indeed).

I agree with the premise and with documenting it but let's keep this issue on topic. I and the whole python community alike all want pipenv to be blazing fast if we're going to use it daily :)

Waiting to hear some thoughts re. adding dependencies to the lockfile.

@tsiq-oliver
Copy link
Contributor

I +1'ed @jleclanche's most recent comment, but to be specific on which part: Docker as a dev env for Python is a very powerful idea. Pipenv is like the missing link that makes this workflow plausible, so going forward, it would be great to understand the cause of the apparent perf issue.

@gsemet
Copy link
Contributor

gsemet commented Mar 25, 2018

I first saw that in the case of Traefik, to carry all the go deps and build tools in a repetitive way. I'd like to better understand what the "docker as dev env" approach has that a traditional pipenv/virtualenv does not provide already. Except for system dependencies that cannot be packaged inside the venv (ex: python gtk/dbus/...), when correctly controlled, pipenv and virtualenv can provide a fully reproducible environment.

@jleclanche
Copy link
Contributor Author

I'd like to better understand what the "docker as dev env" approach has that a traditional pipenv/virtualenv does not provide already

I'll invite you to look at the environment I posted in the original issue. It is about system dependencies.

Happy to discuss docker further by email if you have questions but at this point I'd like to ask people to keep it out of this particular Github thread and stay on topic.

@techalchemy
Copy link
Member

@jleclanche the thing is, explicitly calling lock is a specific instruction to pipenv to inform it that the dependency graph needs to be recalculated. By nature that requires that we ask our index for updated dependencies. If you want to trust the lockfile as is and only install a new package, then you shouldn't explicitly call pipenv lock but rather pipenv install

Now what I said in the paragraph above was that explicit calls to pipenv lock are a way of specifically telling pipenv to re-download and recalculate the dependency graph. I'm not completely sure about this so I am going to summon the magical genie @ncoghlan -- do you have any thoughts or concerns about storing the dependency graph in a nested format (I know we already agreed to stop doing that), but this time organized hierarchically by top level dependency? You've thought a lot more about this than I have; can we safely store some info about top level packages such that we can trust their dependency graph if re-locking wouldn't update that specific package? That would save the long setup times folks are seeing dealing with ephemeral ~/.cache folders in docker containers

The concern I would have here is if we ever decide to flatten the dependency graph to sub-dependencies, we're right back to square one.

@ncoghlan
Copy link
Member

Something we would like to be faster is pipenv lock --keep-outdated, where we're only updating the lock file to account for changes in Pipfile, not for new releases on PyPI. Even for the plain pipenv lock case, it would be nice to avoid the download to recheck the dependencies of any releases which are the same version we already locked last time.

I don't think that implies storing hierarchical dependency metadata, though - we just need a cache of "project-name-and-version" -> "declared-dependencies" metadata, similar to the way the lock file already stores "project-name-and-version" -> "expected-artifact-hashes".

That way when pipenv starts a lockfile update operation, it can read the old lock file to prepopulate an internal cache, and only have to resort to the local artifact cache or the index server for different versions of packages (wherever those may appear in the overall dependency hierarchy).

@techalchemy
Copy link
Member

@ncoghlan under that approach, what happens if we install a new package and run into a dependency conflict? Should we be caching the per-top-level-package graph somewhere as well with the un-pinned specs so that we can check to see if conflicts can be resolved without re-downloading?

@ncoghlan
Copy link
Member

@techalchemy I'd personally separate more efficient conflict resolution out as a "later" problem, since you're getting into the same kind of territory as https://github.com/pradyunsg/zazo at that point.

The initial naive cache wouldn't help with that, it would purely be about making "Get me the declared dependencies for version X of project Y" faster when both the previous lock file and the updated lock file include the same version. In the --keep-outdated case, that should get a lot of hits, and even for the upgrading case, it will still get hits for packages that haven't changed version.

@exequiel09
Copy link

Even if not inside docker locking packages is too slow. this was fast af during pipenv v9. not only in scikit even installing other packages are affected too.

@techalchemy
Copy link
Member

@exequiel09 I’m not sure this was ever exactly fast. The tradeoff we made was to make sure we include dependencies from setup.py files which are not specific to our own system when crawling the dependency graph in order to build a robust resolver. This means we have to grab any possible candidates from the specified index and resolve them.

I am not sure how we can speed this up but it is quite slow I agree

@techalchemy
Copy link
Member

@jleclanche FYI if you copy your pipenv cache (~/.cache/pipenv on posix) it will be speedy

@techalchemy
Copy link
Member

Closing for now -- recommendation is to mount persistent cache volumes for pip and pipenv

@jleclanche
Copy link
Contributor Author

For the record, copying the cache isn't enough because of permissions. What I've done is share the project folder, then put the .venv in the project with PIPENV_VENV_IN_PROJECT and set XDG_CACHE_HOME to that venv:

    environment:
      PIPENV_VENV_IN_PROJECT: "1"
      XDG_CACHE_HOME: "/path/to/project/source/.venv/.cache"

@jleclanche
Copy link
Contributor Author

Note that PIPENV_CACHE_DIR does not need to be explicitly set since XDG_CACHE_HOME is. AFAIK there's no way to tell pip itself where to cache.

@uranusjr
Copy link
Member

@jleclanche There is PIP_CACHE_DIR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants