Validate the contents of identity centric metadata #8635

dstufft · 2020-09-30T14:06:51Z

Currently if I'm looking at a project on PyPI, it can be difficult to determine if it's "real" or not. I can look and see the user names that are publishing the project as well as certain key pieces of metadata such as the project home page, the source repository, etc.

Unfortunately, there's no way to verify that a project that has say.. https://github.com/pypa/pip in it's home page, is actually the real pip, and isn't a fake imposter pip. The same could go for other URLs, or email addresses etc. Thus it would be useful if there was some way to actually prove ownership of those URLs/emails, and either differentiate them in the UI somehow, or hide them completely unless they've been proven to be owned by one of the publishing users.

The text was updated successfully, but these errors were encountered:

jamadden · 2020-09-30T15:12:53Z

I recall issues like this coming up at least once, if not a few times, over in pypi-support. Someone would fork a repository, change the name in setup.py, and then upload it to PyPI, with or without any other changes. All of the documentation and other links would remain pointed to the originals. This was confusing for users and frustrating to owners of the original package.

So I'm 👍 for some sort of blue verified checkmark or something from that perspective.

With my publisher hat on, though, I would hope this would be completely automated and I wouldn't have to do anything special to earn that blue checkmark.

ewjoachim · 2020-10-02T12:17:38Z

One idea: we could add a blue checkmark for all links in the sidebar that contain a link back to the project's pypi page or pip install <project-name>. This would force us to load those links on the server, but it would be zero-effort for most packages.

That being said, it wouldn't help if they point to forked versions, but in that case, the github star count might be a tell.

calebbrown · 2022-11-02T23:23:05Z

👍

Any progress on this issue?

I've been looking at malware from PyPI and it is common for the author_email to be "spoofed" (either pointing to nowhere, or using somebody else's email address).

Some related context is this HN discussion: https://news.ycombinator.com/item?id=33438678 Many commenters are asking about providing this sort of information.

I see some considerations that need discussion:

how do you support small personal projects (one or two maintainers) and large projects (big team, company) at the same time?
how do you provide verification signals without creating a false sense of security? (e.g. hide unvalidated data from the UI versus showing a blue checkmark against it)

Some validation is easier than others as well - e.g. email validation is pretty straightforward, but homepage validation would require something like the ACME protocol.

ewjoachim · 2022-11-09T13:22:18Z

Haha, rereading my 2-year-old comment above about a blue check marks seems to resonate strangely in today's terms 😅

Who would have guessed...

di · 2022-11-15T21:32:22Z

My general thoughts here is that for metadata that we can 'verify', we should probably elevate that metadata in the UI over 'unverified' metadata.

We can already validate email addresses that correspond to verified emails of maintainers. That won't include the ability to verify mailinglist-style emails, but that could potentially be added to organizations once that feature lands.

With #12465, we'll be able to 'validate' the source repository as well, so any metadata that either references the given upstream source repository can be considered verified as well.

I agree that domains/urls will need to use the ACME protocol or something similar. I think there's probably a UX question on how these would be done per-project, if we wanted to go that route.

ewjoachim · 2022-11-19T11:36:24Z

Mastodon has a link verification system, that might be nice.

That's never going to be foolproof though.

miketheman · 2023-09-11T14:15:11Z

Related: #8462 #10917

jayaddison · 2024-01-28T19:00:38Z

From attempting to perform identity-assurance checks on packages manually: bidirectional references can be a reassuring indicator.

In context here: when a PyPi package points to a GitHub repository as its source code, then that's interpretable as a useful but as-yet-untrusted statement. When up-to-date references are inspected within the contents of the cloned linked repository and they point back to the same original package on PyPi, then confidence in the statement increases.

For reproducible-build-compliant packages the situation improves further: any third party can confirm not only that the source origin and package destination are in concordance, but also whether the published artifact from the destination is bit-for-bit genuine by comparing it a build-from scratch of the corresponding raw origin source materials. This can be verified on both a historic and ongoing basis.

So that's two orthogonal identity validation mechanisms:

Is-where-it-claims-to-be (it's possible to navigate a graph from the content distribution location A to the source at B and then from B back again to A).
Is-what-it-claims-to-be (the claimed source code for a package and version builds into the bit-for-bit identical content found at the content distribution location for the same package and version).

These don't prevent an attacker copying the source in entirety and creating a duplicate under a different name with an internally-consistent reference graph. Given widespread free communication I think it's reasonable to expect that enough of the package consumer population will be (or become) aware of and gravitate towards the authentic package to solve that problem.

di · 2024-03-19T21:27:06Z

Following on to my previous comment, here's a mockup of what I'm imagining to separate the metadata we can verify today (source repository, maintainer email, GitHub statistics, Owner/Maintainers) from the unverifiable metadata:

"Source" should only be verified as a project link if the project has a Trusted Publisher with a matching URL, and the publisher has been used at least once to publish to this project
"Metadata" should only be emails that are included in Author-Email or Maintainer-Email that are also verified user emails for any collaborator on the project
"GitHub Statistics" should only be included if "Source" is verified
"Owner"/"Maintainers" should always be included.

Over time we can move things from below the fold to above it, but this should be a big improvement as-is for now.

I pushed the diff for the mockup here, there's some hacky stuff in there just to get the mockup to look good, but it could be a good starting point.

javanlacerda · 2024-03-21T19:36:57Z

I'm starting working on this for creating the verified session and adding "Owner"/"Maintainers" on that :)

ewjoachim · 2024-03-22T22:27:07Z

I wonder if it makes more sense to have verified details and then unverified details or to have each category with a verfified sub-section and a non-verified sub-section. It feels weird to break the project links apart from one another. When your eyes have reached the place where the repository is, it's not very clear that if the documentation isn't there, you have to look a some place else entirely to find a different link section that might contain the link to the docs.

I'd even argue that in this case, the whole thing would look more readable if the project doesn't use trusted publishers, which is

What about something like this ? (not arguing it's better, just a suggestion for the discussions)

(also, this needs a link, or a hover infobox or something to lead people to the documentation that says what this means, what this certifies, and how to get the various parts of their metadata certified)

(Would @nlhkabu have an opinion on the matter ?)

di added the needs discussion a product management/policy issue maintainers and users should discuss label Sep 30, 2020

javanlacerda mentioned this issue Apr 8, 2024

Creating verified information section for packages #15737

Merged

javanlacerda mentioned this issue Apr 25, 2024

Verified project links for project details. #15862

Merged

javanlacerda mentioned this issue May 4, 2024

Create and populate verified field for ReleaseUrl #15891

Open

cofiem mentioned this issue May 5, 2024

Important project links are pushed down into the "unverified details" section #15903

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate the contents of identity centric metadata #8635

Validate the contents of identity centric metadata #8635

dstufft commented Sep 30, 2020

jamadden commented Sep 30, 2020

ewjoachim commented Oct 2, 2020 •

edited

calebbrown commented Nov 2, 2022

ewjoachim commented Nov 9, 2022 •

edited

di commented Nov 15, 2022

ewjoachim commented Nov 19, 2022

miketheman commented Sep 11, 2023

jayaddison commented Jan 28, 2024

di commented Mar 19, 2024

javanlacerda commented Mar 21, 2024 •

edited

ewjoachim commented Mar 22, 2024 •

edited

Validate the contents of identity centric metadata #8635

Validate the contents of identity centric metadata #8635

Comments

dstufft commented Sep 30, 2020

jamadden commented Sep 30, 2020

ewjoachim commented Oct 2, 2020 • edited

calebbrown commented Nov 2, 2022

ewjoachim commented Nov 9, 2022 • edited

di commented Nov 15, 2022

ewjoachim commented Nov 19, 2022

miketheman commented Sep 11, 2023

jayaddison commented Jan 28, 2024

di commented Mar 19, 2024

javanlacerda commented Mar 21, 2024 • edited

ewjoachim commented Mar 22, 2024 • edited

ewjoachim commented Oct 2, 2020 •

edited

ewjoachim commented Nov 9, 2022 •

edited

javanlacerda commented Mar 21, 2024 •

edited

ewjoachim commented Mar 22, 2024 •

edited