Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: When backtracking prefer packages with METADATA file available #12035

Open
1 task done
notatallshaw opened this issue May 18, 2023 · 12 comments
Open
1 task done
Labels
S: needs triage Issues/PRs that need to be triaged type: feature request Request for a new feature

Comments

@notatallshaw
Copy link
Member

notatallshaw commented May 18, 2023

What's the problem this feature will solve?

A common user complaint when heavily backtracking is downloading lots of large files

Describe the solution you'd like

Currently PEP 658 has just been rolled out on PyPi, when it starts backfilling a lot of packages will hopefully have an associated METADATA file.

To some extent Pip could prefer those packages when backtracking, less important than whether the package is part of a conflict, but perhaps more important than most other backtracking preference criteria.

It may have a few positive benefits:

  1. Less data is needed to be downloaded by the user
  2. Less time is needed to be spent on building packages to extract metadata
  3. Backtracking is less likely to go back to packages that are much older they do not support / have correct METADATA files

If we look at an example like #12028 (comment), then hopefully either error is arrived at much quicker or a requirements solution is found because it is preferred not to backtrack to awscli-0.8.0 because metadata can not be extracted.

Alternative Solutions

Do nothing, still likely to get lots of benefits from METADATA files being available

Additional context

I'm making a broad assumption that packages with accessible METADATA files are better than packages without for dependency resolution, it could be this is not true or at least not true in specific cases.

Code of Conduct

@pradyunsg
Copy link
Member

I'm not sure that's particularly valuable. pypi.org will eventually backfill this metadata across all releases and, IIRC, we need to decide on the preference order before this information is available.

We could prefer candidates ,but there's a well defined ordering for trying those so I don't think this is a tractable approach. 😅

@notatallshaw
Copy link
Member Author

notatallshaw commented May 20, 2023

pypi.org will eventually backfill this metadata across all releases

I guess this is where I am a little confused, would PyPi be able to extract metadata from this release: https://pypi.org/project/awscli/0.8.0/#files? If so how? Pip seemingly can't.

IIRC, we need to decide on the preference order before this information is available.

If this approach would be helpful I think of possible workarounds, e.g. by making customizable what is to get_preferences, having a causes object that has a property like has_metadata_file that will lazily evaluate and cache if it has it

@pfmoore
Copy link
Member

pfmoore commented May 20, 2023

I guess this is where I am a little confused, would PyPi be able to extract metadata from this release: https://pypi.org/project/awscli/0.8.0/#files?

No, because it's sdist-only. But all wheels will ultimately have a metadata file, once the backfilling is complete. And we select wheels as candidates over sdists whenever possible already.

Or are you suggesting that pip would install an older version with a wheel (and hence with a metadata file) in preference to a newer version with only a sdist? Because that would be a significant breaking change. It's basically the same as making --prefer-binary :all: the default. For that, you might want to review #9140 (which is even stronger, making --only-binary :all: the default), which is something we've agreed to in principle, but it's waiting on getting funding (to handle the transition in particular).

@notatallshaw
Copy link
Member Author

No, because it's sdist-only. But all wheels will ultimately have a metadata file, once the backfilling is complete. And we select wheels as candidates over sdists whenever possible already.

But I don't think Pip does prefers one projects release over another based on whether it has a wheel or not, right?

Or are you suggesting that pip would install an older version with a wheel (and hence with a metadata file) in preference to a newer version with only a sdist?

No, I am suggesting when backtracking you need to make a choice on which project to backtrack on, in #12028 for example Pip needs to decide whether to backtrack on awscli or pylint, as all other things are equal Pip ends up going with the order the user provided. What I am suggesting is with the advent of PEP 658 it makes sense to choose whether a particular release has a METADATA file available before consider user order.

This will save build time as and download time as releases with sdist only will not be preferred. And I believe in general will more likely find a solution than not.

Because that would be a significant breaking change. It's basically the same as making --prefer-binary :all: the default. For that, you might want to review #9140 (which is even stronger, making --only-binary :all: the default), which is something we've agreed to in principle, but it's waiting on getting funding (to handle the transition in particular).

No, this would not be a breaking change, it would only affect the preference of how to find a solution when backtracking, it would not choose older versions of the same project over newer versions of the same project.

@pfmoore
Copy link
Member

pfmoore commented May 20, 2023

Ah, I get you now - thanks for the clarification.

Given backfilling, maybe we just prefer projects with (compatible) wheels over sdist-only ones? I'm handwaving a lot here as it's been a while since I reviewed that code and I need to refresh my memory on candidates vs projects vs releases vs... The main point I'm making is that as PyPI will ultimately backfill, let's plan for the future and think in terms of wheels rather than metadata files1. Even so, I don't know how effective such a preference might be - resolvelib's get_preference2 was always fairly deep magic to me, I'm afraid.

Footnotes

  1. Yes, that has different effects on indexes that either don't supply metadata files, or only supply them for some wheels. But I think it's reasonable to optimise for what PyPI does.

  2. I assume that's what we're looking at here?

@notatallshaw
Copy link
Member Author

Yeah I am talking about get_preference, and in those terms I would describe it as "when two or more releases are part of a conflict cause to prefer one project's release that has a compatible wheel over a different project's release that is sdist only".

I will try and come up with a PoC PR once PyPi has backfilled, I may be wrong in thinking it's possible to get this information in time without wildly rearchitecting Pip.

@notatallshaw
Copy link
Member Author

notatallshaw commented Jan 7, 2024

I was thinking a bit about this and I beleive it would be possible to do this in get_preference by iterating through candidates if you could make the various implementations of candidates like LinkCandidate more lazy.

The problem now is by inspecting candidates you end up downloading and building the sdists to get the information you need to avoid downloading and building the sdists.

My idea is that if you were only interested in information that could be determined from the sdist/wheel filename or simple API then you would not actually need to download the package metadata and inspect it. If possible, I think this could be used in other parts of the resolution process to speed things up without worrying about giant wheel downloads or build failures.

@uranusjr
Copy link
Member

Now that PyPI has backfilled most (all?) metadata in past wheels, would this simply become preferring projects with wheels over those only publish sdists? I guess that’s still somewhat useful, but probably niche at best.

@pfmoore
Copy link
Member

pfmoore commented Mar 28, 2024

I agree - the "downloading lots of large files" issue is solved by separate METADATA files, so unless there's evidence that this approach solves some other real-world issue, I'm not sure it's worth it.

@notatallshaw
Copy link
Member Author

notatallshaw commented Mar 28, 2024

Yes, this is still very worthwhile, essentially the idea is that all other things equal metadata files are still less costly, large downloads was just one part of that, in fact I was convinced by @pfmoore that the distinction should be on wheels/sdists not metadata files/no metadata.

When comparing two candidates, one an sdist, and the other a metadata file, sdists are significantly more expensive:

  1. sdists will generally be larger
  2. sdists are expensive to extract the metadata (arbitrarily long build process)
  3. sdists can fail, which breaks the whole resolution (which could be considered an infinite cost)

In the original example I gave, this likely would have saved the users resolution, which broke because it failed on trying to build an sdist when it could have chose a different project that had wheels: #12028 (comment).

And it's not just pip, this approach would make sense generally for dependency resolution algorithms for python packages, uv suffers from different examples where it fails during backtracking on old sdists when it could have checked wheels first: astral-sh/uv#1560

To be clear though, this is not the same as "prefer binary" option. this does not find older version of the same package, thse candidate choices are always between different projects, and choosing one or the other candidate is otherwise consisered relatively equal, sdists are definetly most costly over metadata/wheels.

@uranusjr
Copy link
Member

Thinking about this a bit, preferring backtracking toward wheels may be a good idea regardless of the metadata situation. Versions without one are likely be ancient and “not actually what you want” in practice (with a few exceptions e.g. pyspark). But this is armchair deduction and requires real-world evidence.

@notatallshaw
Copy link
Member Author

Versions without one are likely be ancient and “not actually what you want” in practice (with a few exceptions e.g. pyspark). But this is armchair deduction and requires real-world evidence.

Well this is exactly what happens with both examples I linked to, one for pip, and one for uv (and uv's resolution isn't so significantly different that the same behavior can't happen with pip), so it would be easy to confirm with a real example if someone made a PR.

My plan, if no one else attempts this first, is to make a PR to implement this within the current framework of resolvelib / pip, it would iterate candidates in get_preference, it would change candidates download lazily, and allow looking up attributes about candidate to determine if it's an sdist not require downloading the file (it can be determined entirely from the filename, so this should be fine).

But I think there are more impactful changes to the resolution to focus on first, so I don't know when I'll try tackling this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S: needs triage Issues/PRs that need to be triaged type: feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

4 participants