Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't search secondary repositories if not required #5984

Conversation

dimbleby
Copy link
Contributor

Not sure whether this is controversial or just obviously sensible...

Always searching secondary sources is bad for performance: and generally just isn't what people mostly expect.

Here's a behaviour change that skips secondary sources altogether, for packages that were provided by primary sources.

It's possible that this was the intended behaviour all along - the testcase test_solver_does_not_choose_from_secondary_repository_by_default suggests as much. I'm not sure where the code that makes that test pass is! But anyway this is surely a more efficient way of achieving that goal.

this would fix eg #5959, #5096

@radoering
Copy link
Member

If I'm not wrong it is controversial. What if two sources provide different versions of the same package? That's a common use case.

@dimbleby
Copy link
Contributor Author

dimbleby commented Jul 10, 2022

What if two sources provide different versions of the same package?

Per the new testcase: if the primary can meet the requirement, then only versions from the primary are considered.

What does "secondary" even mean, if packages from secondary sources are considered although a primary source has met a requirement already?

The implication of this MR would be that if they care about getting a particular version of a package, folk should say so in their pyproject.toml. But if they've asked for a broad requirement that a primary source can satisfy: they should not expect poetry to go and look in secondary sources.

@dimbleby
Copy link
Contributor Author

dimbleby commented Jul 10, 2022

To expand on the trade that is being made here. This MR can introduce failures to find a solution in a case like this:

  • the immediate requirement is for any version of foo
  • primary source provides only foo 1.0, so that's all that we consider
  • later in the search we encounter something in conflict with foo 1.0
  • even though a secondary source has foo 2.0 which would have been fine: too bad, we now don't find that

The implicit claim of this MR is that this is an unusual case, which anyway can easily be solved eg by declaring that the requirement on foo is >=2.0, or by specifying the source for foo.

Most of the time when people are using secondary sources, surely it's because they want to get most of their packages from one place (probably PyPI), but they have a small number of private or otherwise special packages in some secondary place. This MR prioritises that use case.

@mkniewallner
Copy link
Member

While I don't have data to back that, I would tend to agree with @dimbleby, as I believe that the main usage of secondary is to declare private dependencies, or additional public ones not available on PyPI, for specific packages, for which users would generally declare the source in the dependency definitions.

If people still want the custom index to be searched for for all dependencies, as you report @radoering, wouldn't not setting secondary achieve that? Although the slight difference is that the custom index would take precedence over PyPI, but it possibly may be fine for most people.

Personally, before needing to use a custom index as secondary for private packages, I would have expected it to only be searched for for dependencies that explicitly declare it as their source, and was kind of surprised that it was not the case (and there's no way to work around that, AFAIK).

Even if there is a need to keep the existing behavior for some users, I still think that the changes in this PR are a more sensible default, since this is what, IMO, most people would expect.

@radoering
Copy link
Member

I agree about the main use case. However, another common use case is to have patched/modified versions of public available packages in your private repository. Especially, if these patches/modifications are not required for the latest version but only for some older ones and these are transient dependencies, this change might make things more complicated. Probably, this use case can still be covered by being more explicit.

If people still want the custom index to be searched for for all dependencies, as you report @radoering, wouldn't not setting secondary achieve that? Although the slight difference is that the custom index would take precedence over PyPI, but it possibly may be fine for most people.

I don't think so. That's exactly what is changed here. Secondary sources are not searched anymore if a dependency is found in the primary source. With the change you can't achieve searching all sources anymore. You have to be more explicit (which might be good or not) and pin a depdendency to a specific source.

That said I'm not completely against this change. I just have concerns it's a breaking change. If we decide to do this, we should at least consider having a setting that restores the old behavior.

@ngnpope
Copy link

ngnpope commented Jul 12, 2022

What about an isolated = true option for sources that are secondary = true? This would only use the source for the packages that explicitly use that source and never use it for any others. Adding such an option would provide a backward compatible way to enforce this behaviour.

There are some private packages that I only ever want installed from a private repository, never from PyPI - everything else can come from PyPI. After all, someone could register some package with the same name as one of my private packages in PyPI and a version of ^999.0.0. It's not just about performance, there's security too.


Semi-related, I had an issue recently where I have three packages referencing a private repository via a secondary source, everything else doesn't specify the source and I'd expect them to use the default, implicit pypi source. Weirdly, however, I had a problem when I added django-q[sentry]. The sentry extra adds django-q-sentry, but for some reason when updating the lock file it specified the source for django-q-sentry as the secondary source. This later caused installation to fail as the package doesn't exist under the secondary source. django-q was still pointing to the default source even though its extra package wasn't. I worked around this by adding an explicit dependency for django-q-sentry with source = "pypi". But it seems to point toward a bug in this behaviour of searching in all repositories. Alas, I haven't managed to work out a minimal reproducer yet.

@mayeroa
Copy link

mayeroa commented Jul 21, 2022

I would like to add that such a feature is important, specifically as a PyTorch user.

For a quick demo, I have some packages that need to be retrieved from pypi and some others (torch and torchvision) need to be retrieved from another source.
The pyproject.toml looks like this:

[tool.poetry]
name = "test-package"
version = "0.1.0"
description = ""
authors = ["test@gmail.com"]
readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.8,<3.11"
numpy = { version = "~1.23", source = "pypi" }
rich = { version = "12.5.1", source = "pypi" }
torchvision = { version = "~0.13.0+cu116", source = "pytorch"}

[[tool.poetry.source]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cu116/"
secondary = true

[build-system]
requires = ["poetry-core>=1.1.0.b3"]
build-backend = "poetry.core.masonry.api"

It results with the following output:

image

For torchvision dependencies, it will also try to use the secondary source.
Moreover, using the secondary source is extremely slow...

Having the proposed isolated keyword may be interested to only restrict specific packages to use the secondary source, instead of looking at all the sources and all the packages during the install.

Poetry version : 1.2.0.b3
Poetry-core version : 1.1.0.b3
Python version : 3.8.13

@pierresouchay
Copy link

Indeed, this is very important to us. As mentioned, on our side we want to use this to use private package and as mentioned, perf is really bad with secondaries (because the way the resolution is done with different APIs than the one's of pypi.

As long as a secondary is present somewhere, the performance drops completely

We patched poetry to have this behavior in: #5472

Here are our results:

Poetry 1.1.13 patched with #5472

poetry update -> ~30s

Poetry 1.1.14 (not patched)

poetry update -> ~1 min 30s

Poetry 1.2.0b3 (not patched)

poetry update -> ~1 min

@mkniewallner
Copy link
Member

What about an isolated = true option for sources that are secondary = true? This would only use the source for the packages that explicitly use that source and never use it for any others. Adding such an option would provide a backward compatible way to enforce this behaviour.

There are some private packages that I only ever want installed from a private repository, never from PyPI - everything else can come from PyPI. After all, someone could register some package with the same name as one of my private packages in PyPI and a version of ^999.0.0. It's not just about performance, there's security too.

This sounds like a reasonable proposal.

Regardless of what we decide, it is important to keep backward compatibility, so we could either:

  • provide isolated = true as suggested, or something similar, to specify that the repository should not be searched in
  • not search into secondary repositories by default, but provide an option to maintain the current behavior -> since this proposal would be a breaking change, I don't see it landing in a patch version (and I don't think we want to delay 1.2 even more for that specific change, but core team may have a more precise opinion on that)

Both proposals make sense, as long as in the end, users have an option to either opt in for this new feature, or can go back to the current behavior, especially since right now, if users have no other choice than to rely on a secondary source, and this source is slow network-wise, this could drastically slow down dependencies resolution.

Personally, letting breaking changes concerns aside, I still think that not searching in secondary sources is a more sensible default, because I think that most of the time, people use secondary sources for specific packages (whether they are private, or because some libraries are not published to PyPI), and may even assume that this is actually what Poetry does.

@dimbleby
Copy link
Contributor Author

Just noting that I'm not personally motivated to drive this one further - I never use secondary repositories myself, so taking this from a quick-n-dirty MR to a redesigned set of configuration options doesn't much interest me.

I think this has been useful in kicking the discussion and establishing that there is interest: but it will likely now need someone who cares more about this to step up and do some work if it is to go further.

@set92
Copy link

set92 commented Aug 5, 2022

A related problem with this, if someone doesn't have access to the secondary source it doesn't work at all, because it always needs to try all the sources.

In this example I didn't turn on the VPN so I didn't have access to the secondary source, but I didn't needed it. And when I tried to force apache-airflow to pipy, with source='pypi', it also breaks due to the derived packages trying to download from the secondary source.

pyproject.toml:

[tool.poetry.dependencies]
python = "~3.7.13"
apache-airflow = { version = "2.2.5", platform = "linux" } 

[[tool.poetry.source]]
name = "external_source"
url = "https://nexus.___.io/repository/___-pypi/simple"
secondary = true

poetry lock:

Updating dependencies
Resolving dependencies... (0.7s)

  RepositoryError

  403 Client Error: Forbidden for url: https://nexus.___.io/repository/___-pypi/simple/apache-airflow/

  at ~/.local/share/pypoetry/venv/lib/python3.8/site-packages/poetry/repositories/legacy_repository.py:393 in _get
      389│             if response.status_code == 404:
      390│                 return
      391│             response.raise_for_status()
      392│         except requests.HTTPError as e:
    → 393│             raise RepositoryError(e)
      394│ 
      395│         if response.status_code in (401, 403):
      396│             self._log(
      397│                 "Authorization error accessing {url}".format(url=response.url),

@dimbleby dimbleby force-pushed the dont-always-search-secondary-repo branch from 4db3ca1 to ac80f1d Compare August 28, 2022 09:37
@neersighted
Copy link
Member

neersighted commented Sep 5, 2022

I think, to preserve backwards compatibility, the long-term path forward is probably two different options (in my example, supplemental and private). I would propose something like the following, as an overview of behaviors:

  • No options means that the repository overlays PyPI, but we will still fall back to PyPI if necessary (e.g. suitable for internal mirrors), pip --extra-index-url-style. This is known internally as primary, and is the current behavior.
  • default = true means that the repository replaces PyPI, pip --index-url-style. This is the current behavior.
  • supplemental = true means that this source is only consulted after a lookup in the default (implicitly PyPI unless configured otherwise) and primary sources fail.
  • private = true means that the repo will only be considered for packages that are explicitly configured with source =, and should be preferred by users for private packages to avoid dependency confusion attacks.

secondary = true would be kept around as a deprecated option (likely with a warning), and would maintain the legacy behavior of being searched exhaustively for backwards compatibility.

@neersighted neersighted added area/solver Related to the dependency resolver status/needs-consensus Consensus among maintainers required impact/changelog Requires a changelog entry impact/deprecation Introduces or relates to a deprecation labels Sep 5, 2022
@graipher
Copy link

graipher commented Sep 6, 2022

secondary = true would be kept around as a deprecated option (likely with a warning), and would maintain the legacy behavior of being searched exhaustively for backwards compatibility.

What is the difference between that new supplemental and the old secondary? Is it that it is only consulted if a lookup actually fails, so no request is made during the discovery phase of the version solving?

@neersighted
Copy link
Member

neersighted commented Sep 6, 2022

supplemental would only be consulted after a failure to find packages in the primary and default sources. Currently, secondary repositories are unconditionally searched, even if another secondary (or default/primary) has results. We cannot change this behavior without breaking existing projects, but we can deprecate and migrate away from it.

@GuillaumeDesforges
Copy link

GuillaumeDesforges commented Sep 16, 2022

@neersighted great proposal, that would be a huge progress! Is there a PR to watch/contribute to for these changes?

@Daniel451
Copy link

@neersighted @graipher this is a very good proposal and would save me lots of headache. What is the progress on this?

@neersighted
Copy link
Member

@neersighted @graipher this is a very good proposal and would save me lots of headache. What is the progress on this?

It is a fairly hairy refactor, in order to introduce these new behaviors while permitting backwards compatibility. It likely wouldn't be looked at until 1.4, and no one has expressed interest in implementing it yet.

@Daniel451
Copy link

@neersighted I understand. Thanks for the update!

@neersighted
Copy link
Member

Closing in favor of #6713

@neersighted neersighted closed this Oct 5, 2022
@b-kamphorst
Copy link
Contributor

I think, to preserve backwards compatibility, the long-term path forward is probably two different options (in my example, supplemental and private). I would propose something like the following, as an overview of behaviors:

  • No options means that the repository overlays PyPI, but we will still fall back to PyPI if necessary (e.g. suitable for internal mirrors), pip --extra-index-url-style. This is known internally as primary, and is the current behavior.
  • default = true means that the repository replaces PyPI, pip --index-url-style. This is the current behavior.
  • supplemental = true means that this source is only consulted after a lookup in the default (implicitly PyPI unless configured otherwise) and primary sources fail.
  • private = true means that the repo will only be considered for packages that are explicitly configured with source =, and should be preferred by users for private packages to avoid dependency confusion attacks.

secondary = true would be kept around as a deprecated option (likely with a warning), and would maintain the legacy behavior of being searched exhaustively for backwards compatibility.

I've been doing some preparatory work to implement this. I'm now at the point where I write tests that express the desired behavior and found out that the private case needs to be made more explicit. Specifically: if a package is configured with source and targets a private repository, what does that mean for finding that package's dependencies?

Currently, such dependencies are not bound to the source, meaning that they may be retrieved from any configured repo including the source repo. Perhaps the most natural option is to, when we introduce private repos and a package's source is set to that private repo, its dependencies can be retrieved from all repositories including the private repo that is the package source (and no other private repos). However, there are alternative scenarios as well. For example, only might suggest that dependencies may not consider any private repo.

The first suggestion seems most appropriate. My private packages have private dependencies, so I'd expect that to work just fine. Still, I'd like to hear your thoughts before I make assumptions.

@neersighted
Copy link
Member

@b-kamphorst would you mind discussing this on #6713 instead? I'd like to centralize the discussion there.

@dimbleby dimbleby deleted the dont-always-search-secondary-repo branch September 9, 2023 11:09
Copy link

github-actions bot commented Mar 3, 2024

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/solver Related to the dependency resolver impact/changelog Requires a changelog entry impact/deprecation Introduces or relates to a deprecation status/needs-consensus Consensus among maintainers required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet