Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poetry downloading same wheels multiple times within a single invocation #2415

Closed
2 of 3 tasks
bb opened this issue May 13, 2020 · 18 comments · Fixed by #7693
Closed
2 of 3 tasks

Poetry downloading same wheels multiple times within a single invocation #2415

bb opened this issue May 13, 2020 · 18 comments · Fixed by #7693
Labels
kind/enhancement Not a bug or feature, but improves usability or performance

Comments

@bb
Copy link

bb commented May 13, 2020

  • I am on the latest Poetry version.
  • I have searched the issues of this repo and believe that this is not a duplicate.
  • If an exception occurs when executing a command, I executed it again in debug mode (-vvv option).

Issue

When adding a new dependency, it is downloaded multiple times; I observed three downloads, two of those are unneccessary.

Starting with a pyproject.toml as in the Gist given above, I run

poetry add https://github.com/oroszgy/spacy-hungarian-models/releases/download/hu_core_ud_lg-0.3.1/hu_core_ud_lg-0.3.1-py3-none-any.whl

Then I see the following output (XXX added as markers for explanation below):

Updating dependencies XXX
Resolving dependencies... (276.1s)

Writing lock file
XXX

Package operations: 0 installs, 7 updates, 0 removals

  - Updating certifi (2019.11.28 -> 2020.4.5.1)
  - Updating urllib3 (1.25.8 -> 1.25.9)
  - Updating asgiref (3.2.3 -> 3.2.7)
  - Updating pytz (2019.3 -> 2020.1)
  - Updating django (3.0.4 -> 3.0.6)
  - Updating hu-core-ud-lg (0.3.1 -> 0.3.1 https://github.com/oroszgy/spacy-hungarian-models/releases/download/hu_core_ud_lg-0.3.1/hu_core_ud_lg-0.3.1-py3-none-any.whl)
XXX  - Updating psycopg2-binary (2.8.4 -> 2.8.5)

At the positions where the marker XXX is inserted, the same 1.3GB download is done again and again.

Similar, when adding another package later, again XXX marks the cursor position when the big download is done:

$ poetry add djangorestframework

Using version ^3.11.0 for djangorestframework

Updating dependencies
Resolving dependencies... (0.4s)

Writing lock file
XXX

Package operations: 1 install, 1 update, 0 removals

  - Installing djangorestframework (3.11.0)
  - Updating hu-core-ud-lg (0.3.1 -> 0.3.1 https://github.com/oroszgy/spacy-hungarian-models/releases/download/hu_core_ud_lg-0.3.1/hu_core_ud_lg-0.3.1-py3-none-any.whl)
XXX

I'd expect the file to be downloaded a most once and reused.

Slightly related but different issues: #999, #2094

@bb bb added kind/bug Something isn't working as expected status/triage This issue needs to be triaged labels May 13, 2020
@finswimmer finswimmer added kind/enhancement Not a bug or feature, but improves usability or performance and removed kind/bug Something isn't working as expected status/triage This issue needs to be triaged labels May 14, 2020
@dimbleby
Copy link
Contributor

The first two downloads happen in

def get_package_from_url(cls, url: str) -> Package:
file_name = os.path.basename(urllib.parse.urlparse(url).path)
with tempfile.TemporaryDirectory() as temp_dir:
dest = Path(temp_dir) / file_name
download_file(url, str(dest))
package = cls.get_package_from_file(dest)

which indeed use a temporary location that is immediately thrown away.

Presumably the right thing to share with would be the artifact cache as used by the Chef?

@tall-josh
Copy link

tall-josh commented Sep 15, 2022

What's the reasoning for dumping the downloads to a temp_dir as @dimbleby shows in the snippet? Is it so the cache doesn't blow out to a massive size?

I'd be happy to try and contribute. Naively I'd check a cache wherever download_file is called (puzzle/provider.py and repositories/http.py) but there are likely some considerations I'm missing. If someone could advise I could put together a PR.

@dimbleby
Copy link
Contributor

Suspect that code fragment uses a temporary directory for no particularly good reason.

poetry has a cache of downloaded files that it uses during installation, as managed by the curiously named Chef class. I'd think that is the right thing to share with.

Couple of problems though:

  • the chef uses such things as the current interpreter version to decide where to put these file, which an unwanted complication
  • it's not entirely clear how to refactor to make the chef cache available during solving

I'd start with an MR that updates the chef so that get_cache_directory_for_link only cares about the URL that the link is downloaded from - that should be straightforward, and will get maintainer opinion on whether this is a sensible track.

Then if that's accepted, follow up with some sort of rearrangement so that this cache can be shared by the chef and the solving code

@tall-josh
Copy link

Thanks @dimbleby I'll take a look and see what I can do.

@tall-josh tall-josh mentioned this issue Sep 24, 2022
2 tasks
@dhdaines
Copy link

This is a serious problem with packages like PyTorch which are extremely large. Unless there's a workaround for this I will definitely never use Poetry.

@rbracco
Copy link

rbracco commented Oct 12, 2022

Any update on a fix for this? I really like poetry but locking or adding a new dependency now takes > 5 minutes because I have to download wheels for torch, torchaudio, and torchvision. Is there a short-term workaround while a more permanent fix is made? Thank you.

@neersighted
Copy link
Member

I suspect many are reading this issue without actually having experienced the issue -- Poetry downloads Torch once for metadata + hashing, and a second time for actual installation. After the cache is created, Poetry will not re-download Torch. We are downloading distfiles more often than needed as two parts of the code do not share a common cache, but we are not downloading every time poetry add occurs or anything similar.

@rbracco
Copy link

rbracco commented Oct 12, 2022

I suspect many are reading this issue without actually having experienced the issue -- Poetry downloads Torch once for metadata + hashing, and a second time for actual installation. After the cache is created, Poetry will not re-download Torch. We are downloading distfiles more often than needed as two parts of the code do not share a common cache, but we are not downloading every time poetry add occurs or anything similar.

Thanks for the reply. I am an active user of poetry running 1.2.1, I experience the issue as the pytorch wheel downloads every time I do add or lock and it takes around 80 seconds to download.

Kazam_screencast_00002.mp4

@chopeen
Copy link
Contributor

chopeen commented Oct 13, 2022

Every time I run poetry update in my project, a large spaCy model gets download.

It is added to [tool.poetry.dependencies] this way:

en_core_web_lg = { url = "https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz" }`

@nicolascedilnik
Copy link

I think this is related: in a project I have these conditional URL dependencies defined

torch = [
    {url = "https://download.pytorch.org/whl/cu113/torch-1.12.1%2Bcu113-cp39-cp39-linux_x86_64.whl", markers = "sys_platform == 'linux'"},
    {url = "https://download.pytorch.org/whl/cpu/torch-1.12.1-cp39-none-macosx_10_9_x86_64.whl", markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'"},
    {url = "https://download.pytorch.org/whl/cpu/torch-1.12.1-cp39-none-macosx_11_0_arm64.whl", markers = "sys_platform == 'darwin' and platform_machine == 'arm64'"}
]

Every poetry lock operation ends up redownloading the 3 wheels, which are quite large. Isn't there a way to have them cached by poetry?

@a-gn
Copy link

a-gn commented Oct 24, 2022

I think this is related: in a project I have these conditional URL dependencies defined

torch = [
    {url = "https://download.pytorch.org/whl/cu113/torch-1.12.1%2Bcu113-cp39-cp39-linux_x86_64.whl", markers = "sys_platform == 'linux'"},
    {url = "https://download.pytorch.org/whl/cpu/torch-1.12.1-cp39-none-macosx_10_9_x86_64.whl", markers = "sys_platform == 'darwin' and platform_machine == 'x86_64'"},
    {url = "https://download.pytorch.org/whl/cpu/torch-1.12.1-cp39-none-macosx_11_0_arm64.whl", markers = "sys_platform == 'darwin' and platform_machine == 'arm64'"}
]

Every poetry lock operation ends up redownloading the 3 wheels, which are quite large. Isn't there a way to have them cached by poetry?

On my system also, this seemed to make Poetry re-download torch every time it resolved dependencies. It did not happen with other dependencies that were given by name (to be downloaded from PyPI, no URL).

Since PyTorch URLs have to be hard-coded to install properly and PyTorch's wheel takes more than 1GB, this prevents me from migrating the team to Poetry.

@neersighted
Copy link
Member

Ah, looking at this, I realize that all the metadata caching happens in the repository layer. So if you're using direct URL dependencies, Poetry has no caching whatsoever. I personally got turned around here on whether this was a bug or as-designed behavior (currently, the latter is true).

Ideally the artifacts cache could be made agonostic to repositories so that it is keyed on URLs only and we can share it, as @dimbleby has mentioned. On top of that, I wonder if some mechanism to cache metadata (maybe a direct CachedRepository?) could be implemented, as all that code is currently tied up to indexes.

@jace-ys
Copy link
Contributor

jace-ys commented Nov 10, 2022

I'm also experiencing this issue and it's unfortunate as I now have to choose between installing specific torch = { url = "https://download.pytorch.org/whl/cpu/torch-1.9.0-cp38-none-macosx_11_0_arm64.whl", markers = "platform_machine == 'arm64' and platform_system == 'Darwin'" } wheels for my architecture that are quicker to install but make dependency resolution super slow, or installing the full torch = "1.9.0" which takes longer to install but solves the slow resolution times.

It will be great if there was caching for direct URL dependencies as well, as neither option is ideal right now 😞

@leoitcode
Copy link

I'm having the same problem in my project, because if you have any packages with " { url = ... } " poetry add, poetry lock , poetry update, everytime it downloads again. Like a temporary solution, I'm using requirements.txt for URL packages and pyproject.toml for the remaining, waiting for a solution.

@neersighted
Copy link
Member

I think we've pretty firmly established what is going on and what is needed to improve -- I'd ask that people please refrain from "me too" as it's just adding noise right now.

@leoitcode

This comment was marked as off-topic.

@neersighted

This comment was marked as off-topic.

Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/enhancement Not a bug or feature, but improves usability or performance
Projects
None yet