-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pip should record the original index url or find links url of the package #10736
Comments
Please clarify this statement. What do you mean by "alternative" here? It's a packaging/distribution error if two wheels or sdists, claiming to be for the same project/version, are different. So I don't understand how it can be important to know where a package came from, for any legitimate usage. |
There are various places where this might come into play -
In cases like these and others it is vital to know where pip downloaded the package from since you cannot directly trace a package/version (+local version in cases like above) to a package repository it originated from if it's not the default repo set in the pip config. It is also not possible to determine it from pip config since pip could have been provided with extra config vars like pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html from https://pytorch.org/) |
Those are all the sort of "not legitimate" usages I tried to point out in my reply. If the builds differ, they should be distinguished using different local version identifiers added to the version, or something similar. Or the install should be from a direct URL rather than a name/version specifier, if the file has to come from a particular location. Maybe there are issues with doing things like this. But these are the principles on which the current packaging standards are based, and if they aren't sufficient, this should be fixed by changing the standards that all tools follow, not by changing a single tool (pip) to support a non-standard usage. Note that the torch example you give uses a local identifier, so they are following the standards. But you shouldn't be assuming that the package must come from So maybe the issue here isn't with how packages like torch distribute their binaries, but how your workflow is set up to manage package sources. |
Again, the goal of this ticket is around finding provence data so that we can construct accurate bill of materials for a given set of packages. This is so that they can later be scanned and identified. Even if you have a package with a version and local version identifier, it's still impossible to map that back to where it's downloaded from if you want to reproduce your environment.
Yes, and it is exactly for cases like these that I wish to record where pip downloaded the original package from. An example is let's say you are given a package and you would like to verify it's integrity after the fact or reproduce the original environment. You cannot do that unless you know the original source where the package was downloaded from.
I am not asking pip to support anything rather than record some metadata as to where it fetched packages from. It will not change the install behavior and nor will it support any new ways of installing packages. This is just to have a record or audit log of operations that pip performed. It can also be stored in some pip specific file instead of dist info in case that violates certain specs. |
Pip doesn't provide that level of audit trail for where installed packages come from. You need to manage that sort of provenance outside of pip, IMO. Longer term, you may be interested in https://discuss.python.org/t/pep-665-take-2-a-file-format-to-list-python-dependencies-for-reproducibility-of-an-application/11736 which is proposing a reproducible lockfile format that might suit your needs. |
As I understand this, this request is born out of wanting to generate a SBOM (Software bill of materials) which is increasingly being used in corporate spaces1 to track what pieces of software are being used. anchore/syft#680 is probably where this came out of, and that project tries to generate the SBOM post-facto IIUC. I tend to agree that this isn't something that pip can solve on its own -- this needs someone to look into how this metadata should be captured for Python packages as a whole (i.e. design a general standard, similar to https://www.python.org/dev/peps/pep-0610/; expanding something like that to non-direct-url installations). Footnotes
|
/cc @woodruffw who worked on https://github.com/trailofbits/pip-audit/, and likely has a better idea of what generating an SBOM for Python environments looks like / needs. |
Ah, that's interesting as background. And that does tend to push me even further towards thinking that this should be handled outside of pip. Organisations which want this sort of tracking can mandate that a specific tool1 is used for all installs, and that tool can layer the necessary audit trail management on top of the basic install. I also agree with @pradyunsg in terms of being uncomfortable with the ideal of volunteers designing and developing, and even more so maintaining such a mechanism, which is directed squarely at commercial users of pip. Even if someone came up with funding for the development of such a feature within pip, it would still be volunteers handling the support, and dealing with frustrated users when their auditors are pushing for data they don't have. I'd much rather see this sort of requirement satisfied by a commercial, paid for service2 that uses pip internally, than have it become a pip feature.
That metadata would need to be standardised, and only then would pip implement that standard. If someone wants to propose such a metadata standard, then that's perfectly fine. Because in that case, we'd do what the standard says, it wouldn't be on us as volunteers to debate or decide whether the metadata we write addresses the "bill of materials" requirement. Footnotes |
With #53 and the new Note that the format for that file is experimental for now (to allow us to make changes, based on initial feedback) and that this will need to live on the "build" tooling that you have -- you'll need to actually wrap pip, with something that'll ensure that the installation report is generated by pip and transform the information in the installation report to whatever SBOM format you like/use. I'll flag that https://github.com/trailofbits/pip-audit is a thing that exists as well. The relevant documentation, on the installation report format is available at https://pip.pypa.io/en/stable/reference/installation-report/ |
I labeled this issue with |
For folks who are following this issue thread, there exists a PEP for such a standard: https://discuss.python.org/t/pep-710-recording-the-provenance-of-installed-packages/25428 |
What's the problem this feature will solve?
Currently users may download packages from various pip repositories (apart from pypi). These packages may contain alternative versions of packages from pypi. In order to better capture origin information, pip should record which package repository it downloaded a specific package from.
Describe the solution you'd like
pip should record this information in the dist-info or egg-info folders it generates
Alternative Solutions
Do nothing
Additional context
This would be useful in capturing the original info and provenance data about pip packages. Specifically package-url (a well defined package id spec) relies on a parameter called repository_url to define this information. If pip records this information downstream sbom generation tools can use this. Related issue - anchore/syft#680
Code of Conduct
The text was updated successfully, but these errors were encountered: