Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate the contents of identity centric metadata #8635

Open
dstufft opened this issue Sep 30, 2020 · 11 comments
Open

Validate the contents of identity centric metadata #8635

dstufft opened this issue Sep 30, 2020 · 11 comments
Labels
needs discussion a product management/policy issue maintainers and users should discuss

Comments

@dstufft
Copy link
Member

dstufft commented Sep 30, 2020

Currently if I'm looking at a project on PyPI, it can be difficult to determine if it's "real" or not. I can look and see the user names that are publishing the project as well as certain key pieces of metadata such as the project home page, the source repository, etc.

Unfortunately, there's no way to verify that a project that has say.. https://github.com/pypa/pip in it's home page, is actually the real pip, and isn't a fake imposter pip. The same could go for other URLs, or email addresses etc. Thus it would be useful if there was some way to actually prove ownership of those URLs/emails, and either differentiate them in the UI somehow, or hide them completely unless they've been proven to be owned by one of the publishing users.

@jamadden
Copy link
Contributor

I recall issues like this coming up at least once, if not a few times, over in pypi-support. Someone would fork a repository, change the name in setup.py, and then upload it to PyPI, with or without any other changes. All of the documentation and other links would remain pointed to the originals. This was confusing for users and frustrating to owners of the original package.

So I'm 👍 for some sort of blue verified checkmark or something from that perspective.

With my publisher hat on, though, I would hope this would be completely automated and I wouldn't have to do anything special to earn that blue checkmark.

@di di added the needs discussion a product management/policy issue maintainers and users should discuss label Sep 30, 2020
@ewjoachim
Copy link
Contributor

ewjoachim commented Oct 2, 2020

One idea: we could add a blue checkmark for all links in the sidebar that contain a link back to the project's pypi page or pip install <project-name>. This would force us to load those links on the server, but it would be zero-effort for most packages.

That being said, it wouldn't help if they point to forked versions, but in that case, the github star count might be a tell.

@calebbrown
Copy link

👍

Any progress on this issue?

I've been looking at malware from PyPI and it is common for the author_email to be "spoofed" (either pointing to nowhere, or using somebody else's email address).

Some related context is this HN discussion: https://news.ycombinator.com/item?id=33438678 Many commenters are asking about providing this sort of information.

I see some considerations that need discussion:

  • how do you support small personal projects (one or two maintainers) and large projects (big team, company) at the same time?
  • how do you provide verification signals without creating a false sense of security? (e.g. hide unvalidated data from the UI versus showing a blue checkmark against it)

Some validation is easier than others as well - e.g. email validation is pretty straightforward, but homepage validation would require something like the ACME protocol.

@ewjoachim
Copy link
Contributor

ewjoachim commented Nov 9, 2022

Haha, rereading my 2-year-old comment above about a blue check marks seems to resonate strangely in today's terms 😅

Who would have guessed...

@di
Copy link
Member

di commented Nov 15, 2022

My general thoughts here is that for metadata that we can 'verify', we should probably elevate that metadata in the UI over 'unverified' metadata.

We can already validate email addresses that correspond to verified emails of maintainers. That won't include the ability to verify mailinglist-style emails, but that could potentially be added to organizations once that feature lands.

With #12465, we'll be able to 'validate' the source repository as well, so any metadata that either references the given upstream source repository can be considered verified as well.

I agree that domains/urls will need to use the ACME protocol or something similar. I think there's probably a UX question on how these would be done per-project, if we wanted to go that route.

@ewjoachim
Copy link
Contributor

Mastodon has a link verification system, that might be nice.

That's never going to be foolproof though.

@miketheman
Copy link
Member

Related: #8462 #10917

@jayaddison
Copy link
Contributor

From attempting to perform identity-assurance checks on packages manually: bidirectional references can be a reassuring indicator.

In context here: when a PyPi package points to a GitHub repository as its source code, then that's interpretable as a useful but as-yet-untrusted statement. When up-to-date references are inspected within the contents of the cloned linked repository and they point back to the same original package on PyPi, then confidence in the statement increases.

For reproducible-build-compliant packages the situation improves further: any third party can confirm not only that the source origin and package destination are in concordance, but also whether the published artifact from the destination is bit-for-bit genuine by comparing it a build-from scratch of the corresponding raw origin source materials. This can be verified on both a historic and ongoing basis.

So that's two orthogonal identity validation mechanisms:

  • Is-where-it-claims-to-be (it's possible to navigate a graph from the content distribution location A to the source at B and then from B back again to A).
  • Is-what-it-claims-to-be (the claimed source code for a package and version builds into the bit-for-bit identical content found at the content distribution location for the same package and version).

These don't prevent an attacker copying the source in entirety and creating a duplicate under a different name with an internally-consistent reference graph. Given widespread free communication I think it's reasonable to expect that enough of the package consumer population will be (or become) aware of and gravitate towards the authentic package to solve that problem.

@di
Copy link
Member

di commented Mar 19, 2024

Following on to my previous comment, here's a mockup of what I'm imagining to separate the metadata we can verify today (source repository, maintainer email, GitHub statistics, Owner/Maintainers) from the unverifiable metadata:

image

  • "Source" should only be verified as a project link if the project has a Trusted Publisher with a matching URL, and the publisher has been used at least once to publish to this project
  • "Metadata" should only be emails that are included in Author-Email or Maintainer-Email that are also verified user emails for any collaborator on the project
  • "GitHub Statistics" should only be included if "Source" is verified
  • "Owner"/"Maintainers" should always be included.

Over time we can move things from below the fold to above it, but this should be a big improvement as-is for now.

I pushed the diff for the mockup here, there's some hacky stuff in there just to get the mockup to look good, but it could be a good starting point.

@javanlacerda
Copy link
Contributor

javanlacerda commented Mar 21, 2024

I'm starting working on this for creating the verified session and adding "Owner"/"Maintainers" on that :)

@ewjoachim
Copy link
Contributor

ewjoachim commented Mar 22, 2024

I wonder if it makes more sense to have verified details and then unverified details or to have each category with a verfified sub-section and a non-verified sub-section. It feels weird to break the project links apart from one another. When your eyes have reached the place where the repository is, it's not very clear that if the documentation isn't there, you have to look a some place else entirely to find a different link section that might contain the link to the docs.

I'd even argue that in this case, the whole thing would look more readable if the project doesn't use trusted publishers, which is

What about something like this ? (not arguing it's better, just a suggestion for the discussions)
image
(also, this needs a link, or a hover infobox or something to lead people to the documentation that says what this means, what this certifies, and how to get the various parts of their metadata certified)

(Would @nlhkabu have an opinion on the matter ?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs discussion a product management/policy issue maintainers and users should discuss
Projects
None yet
Development

No branches or pull requests

8 participants