Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poetry install does not populate cache #2203

Closed
3 tasks done
seansfkelley opened this issue Mar 19, 2020 · 9 comments
Closed
3 tasks done

poetry install does not populate cache #2203

seansfkelley opened this issue Mar 19, 2020 · 9 comments
Labels
kind/question User questions (candidates for conversion to discussion)

Comments

@seansfkelley
Copy link

  • I am on the latest Poetry version.
  • I have searched the issues of this repo and believe that this is not a duplicate.
  • If an exception occurs when executing a command, I executed it again in debug mode (-vvv option).
  • OS version and name: macOS 10.14.6
  • Poetry version: 1.0.3
  • Contents of your pyproject.toml file:
[tool.poetry]
name = "poetry-test"
version = "0.1.0"
description = ""
authors = ["redacted"]

[tool.poetry.dependencies]
python = "^3.6"
pathlib2 = "^2.3.5"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"

Issue

poetry install does not populate the filesystem cache, but poetry add does. Aside from being a little surprising, this makes caching on CI machines effectively impossible since they only ever install and the cache never changes, so they always have to start with a blank slate.

A shell session demonstrating the issue:

$ poetry init

# ...snip...

Generated file

[tool.poetry]
name = "poetry-test"
version = "0.1.0"
description = ""
authors = ["<redacted>"]

[tool.poetry.dependencies]
python = "^3.6"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"


Do you confirm generation? (yes/no) [yes] yes

$ ls /Users/seankelley/Library/Caches/pypoetry/cache
# no output: the cache directory starts empty

$ poetry add pathlib2
Creating virtualenv poetry-test-D6aiWa4y-py3.6 in /Users/seankelley/Library/Caches/pypoetry/virtualenvs
Using version ^2.3.5 for pathlib2

Updating dependencies
Resolving dependencies... (0.7s)

Writing lock file


Package operations: 2 installs, 0 updates, 0 removals

  - Installing six (1.14.0)
  - Installing pathlib2 (2.3.5)

# the cache directory has some stuff in it from pypi
$ ls /Users/seankelley/Library/Caches/pypoetry/cache/repositories
pypi

# clear the cache manually
$ rm -r /Users/seankelley/Library/Caches/pypoetry/cache/repositories

# delete the virtualenv too for good measure
$ rm -r /Users/seankelley/Library/Caches/pypoetry/virtualenvs/poetry-test-D6aiWa4y-py3.6

$ poetry install
Creating virtualenv poetry-test-D6aiWa4y-py3.6 in /Users/seankelley/Library/Caches/pypoetry/virtualenvs
Installing dependencies from lock file


Package operations: 2 installs, 0 updates, 0 removals

  - Installing six (1.14.0)
  - Installing pathlib2 (2.3.5)


$ ls /Users/seankelley/Library/Caches/pypoetry/cache
# no output: the cache directory is empty even though we just installed stuff!
@seansfkelley seansfkelley added the kind/bug Something isn't working as expected label Mar 19, 2020
@bobwhitelock
Copy link

Does anyone have a suggested workaround for this issue? I'd like to be able to cache my dependencies in CI, but this seems tricky to do due to this issue.

darwin added a commit to darwin/trezor-firmware that referenced this issue Oct 28, 2020
The problem: the firmware is currently built in 4 variants and
each variant downloads all submodules again which wastes bandwitdh and
is pretty slow on my machine.

Also poetry was downloading all its python deps again and again.

We create one docker data volume which persist. It is mapped inside
container to '/root/.cache'. So all tools aware of this standard location
should  naturally benefit. Poetry should benefit in the future when they
fix python-poetry/poetry#2203. Pip is already
using it today.

To improve git submodules situation I adopted following strategy.

1. On first run I clone the repo once into
/root/.cache/repos/trezor-firmware. Let's call it the canonical repo.
2. On each run I perform initial update step which checkouts requested tag
and brings this canonical repo up to date (including its submodules).
3. When building each variant of the firmware, we copy the canonical repo
to /tmp/trezor-firmware instead cloning it. This is faster. We do a copy
because we want to work from scratch. E.g. there should be no left-over
files from previous compilations, because the source canonical repo is clean.

To blow the caches one can run BLOW_CACHES=1 ./build-docker.sh
darwin added a commit to darwin/trezor-firmware that referenced this issue Nov 16, 2020
The problem: the firmware is currently built in 4 variants and
each variant downloads all submodules again which wastes bandwidth and
is pretty slow on my machine.

Also poetry was downloading all its python deps again and again.

We create one docker data volume which persist. It is mapped inside
the container to '/root/.cache'. So all tools aware of this standard location
should  naturally benefit. Poetry should benefit in the future when they
fix python-poetry/poetry#2203. Pip is already
using it today.

To improve git submodules situation I adopted following strategy.

1. On first run I clone the repo once into
/root/.cache/repos/trezor-firmware. Let's call it the canonical repo.
In case of local repo, we can be even faster by doing direct rsync from /local.

2. On each run I perform initial update step which checkouts requested tag
and brings this canonical repo up to date (including its submodules).

3. When building each variant of the firmware, we overlay the canonical repo
in /repo/.cache/ws instead of copying or cloning it. This is instant.
Also using an overlayfs keeps canonical repo read-only and we can easily scratch possible build files/artifacts before each run.

To blow the caches one can run BLOW_CACHES=1 ./build-docker.sh
@dmvianna
Copy link

I confirm the same behaviour using stock Debian containers.

@sdispater
Copy link
Member

This is the expected behavior. poetry install will only use the information present in the lock file for installation.

The data in {cache-dir}/cache/repositories is the cache for the remote metadata so I am not sure why you would want it when doing poetry install.

The lock file is the source of truth for poetry install so that's what should be tracked for changes.

@seansfkelley
Copy link
Author

I don't really understand the relevance of these parts of the comment:

poetry install will only use the information present in the lock file for installation.

(and)

The lock file is the source of truth for poetry install so that's what should be tracked for changes.

This issue is about caching, not correctness. I'm not objecting to what poetry install produces or how that relates to the lockfile (because it's correct), I'm just observing that it doesn't include side effects that I expected and appear to be helpful for CI optimization.

As for the rest of your comment:

The data in {cache-dir}/cache/repositories is the cache for the remote metadata so I am not sure why you would want it when doing poetry install.

It's been a long time since I've written this issue so perhaps I understood the cache wrong or the behavior has changed, but IIRC, the presence of that cache dramatically improved install times. As for why I want it, I said so in the original comment:

...this makes caching on CI machines effectively impossible since they only ever install and the cache never changes, so they always have to start with a blank slate.

To clarify further, I'm using Docker-based builds (in Travis), so they do the thing where they archive a bunch of files on disk so you can unarchive them in later builds in fresh Docker containers, thereby skipping a bunch of work for that later build.

If there is another disk location that would be helpful for CI caching instead of or in addition to that one, that would be helpful to know. I don't see any documentation along these lines.

It's also worth noting that I opened this issue before the addition of --remove-untracked, and I specifically did not want to cache the installed virtualenv because there was not yet a mechanism to remove any dependencies that existed in the Travis-provided archive but had been removed for the current build, thereby making the build impure. That should not be a problem now, but unfortunately I don't have any sizable projects I can try it out on to verify any performance improvements. That said, I imagine that the cached metadata can still help improve performance, especially in the case where it's expensive to retrieve.

@sdispater
Copy link
Member

It's been a long time since I've written this issue so perhaps I understood the cache wrong or the behavior has changed, but IIRC, the presence of that cache dramatically improved install times. As for why I want it, I said so in the original comment:

The data in {cache-dir}/cache/repositories is only relevant for the dependency resolution process not the installation process. Basically, it contains cached remote metadata information. Why it's no longer relevant at the installation time is because the lock file contains this metadata information and, as such, the cache no longer needs to be consulted.

That being said, there is a cache that matters for the installation and it's {cache-dir}/artifacts which holds the cached distributions to avoid downloading them again on subsequent installations.

@bobwhitelock
Copy link

That being said, there is a cache that matters for the installation and it's {cache-dir}/artifacts which holds the cached distributions to avoid downloading them again on subsequent installations.

I didn't know (and couldn't find anything) about this directory when I originally commented on this issue, and I ended up caching my entire virtualenv to speed up CI builds instead. This works fine but isn't very granular, and also isn't really how virtualenvs are intended to be used.

If caching this directory would work instead then possibly the only thing that could be improved for this issue is surfacing this directory in the docs? (If it's not there already, maybe I was just looking in the wrong place).

@seansfkelley
Copy link
Author

Thanks @sdispater, that is very helpful! I think between the existence of --remove-untracked and knowing about the artifacts folder, my performance problems around here should be solved (though again, don't have a good project to try it out on right now).

@bobwhitelock I agree, having it documented somewhere would be really handy for those of us in charge of the build systems. I did a quick skim but no docs sections jumped out at me as being appropriate. Candidates: the entry on the cache-dir configuration, the FAQ, or a new section just about CI. Thoughts?

@dimbleby
Copy link
Contributor

per comments in the thread, there's no bug here: this can be closed

@neersighted neersighted added kind/question User questions (candidates for conversion to discussion) and removed kind/bug Something isn't working as expected labels Oct 30, 2022
Copy link

github-actions bot commented Mar 1, 2024

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/question User questions (candidates for conversion to discussion)
Projects
None yet
Development

No branches or pull requests

6 participants