Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract and host wheel METADATA on upload #9972

Closed
wants to merge 11 commits into from

Conversation

abitrolly
Copy link
Contributor

@abitrolly abitrolly commented Aug 30, 2021

Extract METADATA on wheel upload as *.whl.metadata (#8254)

This allows to download just the .metadata file for dependency
resolution instead of full wheels as it happens today
#8254 (comment)

The filename convention and download location is covered by PEP-0658
https://www.python.org/dev/peps/pep-0658/#specification

@abitrolly abitrolly changed the title Reuse temporary_filename in upload code Extract and host wheel METADATA on upload Aug 30, 2021
@abitrolly abitrolly marked this pull request as ready for review August 31, 2021 04:39
@@ -1330,11 +1331,19 @@ def file_upload(request):
"Binary wheel '{filename}' has an unsupported "
"platform tag '{plat}'.".format(filename=filename, plat=plat),
)
# Extract .metadata file
# https://www.python.org/dev/peps/pep-0658/#specification
with zipfile.ZipFile(temporary_filename) as zfp:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the user uploads a non-zip file as a wheel? Is it possible to have zip files that effectively DoS the server with a maliciously crafted file?

(that's totally possible, especially if someone's trying to break PyPI or something)

Copy link
Contributor Author

@abitrolly abitrolly Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check if the uploaded wheel is valid is made here - https://github.com/pypa/warehouse/blob/b9c2e2a45529b64a44c4e8b225c20efab0442bfd/warehouse/forklift/legacy.py#L711 - It could be extended to include all kind of checks, but..

Is it possible to have zip files that effectively DoS the server with a maliciously crafted file?

Yes, because the zipfile doesn't protect against zip bombs. We can try to throw a zip bomb at PyPI, but such security hazards are better be managed in another issue - this PR doesn't increase the attack surface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there is https://nvd.nist.gov/vuln/detail/CVE-2019-9674 which Python core developers decided to fix with a warning. How do you suppose me to fix this in my free time?

@abitrolly
Copy link
Contributor Author

Can't say I didn't expect the tests to fail, but I had a hope that the new functionality will just not be covered.

Test are failing, because they are monkeypatching a lot of calls including .whl file itself, and because new zipfile.ZipFile is not monkeypatched, it fails with E zipfile.BadZipFile: File is not a zip file. Looks like I need to figure out monkeypatching section and copy/paste it to 29 failing tests. A price to pay for linear readable piece of code.

@abitrolly
Copy link
Contributor Author

I would really prefer a fixture with real wheel, to ensure that my code works. Otherwise I feel like I am testing my expectations how my code should work, and not what would really happen after deployment.

@abitrolly abitrolly mentioned this pull request Sep 2, 2021
@abitrolly
Copy link
Contributor Author

It appeared that the fix needs to be applied to just one test - 29 failures are just parametrisation for different platforms.

It is now ready for review.

@abitrolly
Copy link
Contributor Author

Fixed black warning and rebased.

@abitrolly
Copy link
Contributor Author

@pradyunsg can you unblock the tests?

# Extract .metadata file
# https://www.python.org/dev/peps/pep-0658/#specification
with zipfile.ZipFile(temporary_filename) as zfp:
metafile = wheel_info.group("namever") + ".dist-info/METADATA"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is this the way PIP reads the metadata ? Is there a way we have multiple metadata files in a wheel such that the metadata says we depend on package A, but in reality when installing via pip, we depend on package B ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe upload API should do these checks too. Right now it checks for the WHEEL presence and allowed platform tags .

https://github.com/pypa/warehouse/blob/e5cb0e251898e2b60e9d801054d97e12ae3b8e7b/warehouse/forklift/legacy.py#L711-L727

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we have multiple metadata files in a wheel such that the metadata says we depend on package A, but in reality when installing via pip, we depend on package B ?

According to this code pip should fail if there are multiple .dist-info dirs
https://github.com/pypa/pip/blob/bf91a079791f2daf4339115fb39ce7d7e33a9312/src/pip/_internal/utils/wheel.py#L100

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically a wheel can contain multiple .dist-info directories as long as only one matches the name-version pair in the wheel’s file name. But pip adds this additional restriction to simplify implementation, and afaik no-one has complained yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those extra .dist-info directories will not be used, because a wheel ships only one package-version build, and the .dist-info that will is used should match the name-version pair in filename.

Copy link
Member

@ewdurbin ewdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the major concern missing so far in implementation is storing a boolean on the File model and rendering the data-dist-info-metadata value in the simple detail template/view per the PEP.

The rendering isn't strictly required for this PR, but we should at least be able to keep track of what files we have stored metadata for.

Ideally this implementation would be as a general purpose utility that we could point at a given wheel file on disk, that would allow for extracting/storing the metadata during uploads, but also allow us to backfill over time.

@ewdurbin
Copy link
Member

ewdurbin commented Sep 3, 2021

Oh, hmmm, and it also seems like we should be calculating/storing a hash for the metadata file.

@abitrolly
Copy link
Contributor Author

The rendering isn't strictly required for this PR, but we should at least be able to keep track of what files we have stored metadata for.

Yes, I would like to avoid big changes and https://warehouse.readthedocs.io/development/submitting-patches.html recommends keeping them small. After this PR is merged, all new wheels will have the metadata stored, so the upload datetime will be enough for now.

Ideally this implementation would be as a general purpose utility that we could point at a given wheel file on disk, that would allow for extracting/storing the metadata during uploads, but also allow us to backfill over time.

Yes, this should be available as a library, but I don't know which. Exactly for backfilling job.

Oh, hmmm, and it also seems like we should be calculating/storing a hash for the metadata file.

That's not a problem to calculate them when DB part is ready.

@pradyunsg pradyunsg removed their request for review September 7, 2021 21:26
@abitrolly
Copy link
Contributor Author

@ewdurbin looks like it is easy to add hashes to .metadata upload. Are all three are needed?

@abitrolly
Copy link
Contributor Author

abitrolly commented Sep 8, 2021

Oh no. There is no File entry created for .metadata file. A migration will be needed in any case. Is it possible not to touch the DB in the PR, splitting it into 3 instead?

  1. Extract and host metadata (this one)
  2. DB changes to store metadata hash
  3. View changes to show metadata hash in simple API

@dstufft
Copy link
Member

dstufft commented Sep 14, 2021

I would probably change the order of those slightly to add the DB changes first. Since storing the data doesnt help us without data in the DB to tell us that there is a metadata file associated with this release.

@abitrolly
Copy link
Contributor Author

@dstufft all new wheel files after this change will have the metadata file, so it is not strictly necessary to store info that metadata exists in DB. The only thing that needs to be stored is the hash, but I would prefer to have a the discussion about that in a separate issue.

If you can point me to example how to traverse storage filesystem, I can try to make migration script that will check and add metadata for the wheels that are already uploaded.

@ewdurbin
Copy link
Member

I agree with @dstufft that without DB storage... extracting/storing the metadata file isn't as useful unless the data-dist-info-metadata part of the PEP is implemented, clients would need to issue many requests that would end with a 404.

Also for backfill, rather than needing to traverse our file storage, we'd only need to iterate over File objects where packagetype is bdist_wheel and has_metadata is NULL or metadata_hash is NULL to backfill.

@abitrolly
Copy link
Contributor Author

@ewdurbin I can try to add NULLable metadata_hash.

Is it enough to just add new field to File class https://github.com/pypa/warehouse/blob/3802138b4eed60a1b8218168919e6de3792e9105/warehouse/packaging/models.py#L558-L560 and then generate migration as described in https://warehouse.readthedocs.io/development/database-migrations.html ?

@abitrolly abitrolly force-pushed the extract-wheel-metadata branch 3 times, most recently from f84c28b to 33a030c Compare September 17, 2021 11:35
"""
global _wheel_file_re
filename = os.path.basename(path)
namever = _wheel_file_re.match(filename).group("namever")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the group ever be None here?

@abitrolly
Copy link
Contributor Author

#12702 created conflicts. Trying to resolve them now.

This allows to download just the .metadata file for dependency
resolution instead of full wheels as it happens today
pypi#8254 (comment)

The filename convention and download location is covered by PEP-0658
https://www.python.org/dev/peps/pep-0658/#specification
Tests were failing, because wheel upload test used .tar.gz
content, and zipfile could not open that content to extract
METADATA (which was also absent).

The fix adds two helpers to avoid repeated code.
I changed tested content-type to the more appropriate for .zip
Twine sends application/octet-stream.
This will be useful for backfill scripts.
@abitrolly abitrolly requested a review from a team as a code owner April 30, 2023 08:23
@abitrolly
Copy link
Contributor Author

Rebased and squashed some CI linter commits.

@ewdurbin ewdurbin mentioned this pull request May 11, 2023
@ewdurbin
Copy link
Member

superseded by #13649

@ewdurbin ewdurbin closed this May 11, 2023
@abitrolly
Copy link
Contributor Author

@ewdurbin you could at least give me a credit when copying the code into new PR, so that I could reference that trying to find a job.

Now that this code is merged, it is at least possible to test .whl bomb protection live #10504.

Leaving aside an acute seizure of justice, I am glad to see it finally merged. Thanks everyone for support.

@astrojuanlu
Copy link

I also found it very weird to see this PR coming "out of the blue" (which of course it was not, I'm sure lots of discussion went internally about it - it was just not visible for bystanders) with parts that were almost identical to this one, without at least retaining attribution in 1 commit, or acknowledgement of some sort.

But anyway, glad to see #8254 addressed. Thanks everyone.

@ewdurbin
Copy link
Member

the long story short here is that abitrolly has consistently behaved across multiple contribution attempts in ways that were dismissive of maintainer feedback, overscoped, financially dubious, and often made me feel uncomfortable or unnecessarily pressured.

i have no interest in collaborating with this contributor.

PEP 658 is important as Metadata 2.2 (Dynamic) near implementation completion and getting back in the quagmire of interacting with abitrolly wasn't something i have capacity for.

@abitrolly
Copy link
Contributor Author

@ewdurbin given that I've never offended or held anything personal towards you, that's very unprofessional, man. Not everybody is a nice person when it comes to doing extra work in their free time, but you guys are at least being paid for maintaining things here, and "uncomfortable" and "unnecessarily pressured" is a normal job situation for many of us paid professionals. I take the blame and full responsibility for pushing these thing while you guys are finding other things to do.

With that being said, I apologize to you for my bad and uncomfortable behavior. Should I know that it is the reason to hold back my PR, I would of course take the measures to apologize earlier, just to deal with it and forget. Even if I may sound passive aggresive, I appreciate you maintainership work that resulted in these changes finally merged in.

@cosmicexplorer
Copy link

cosmicexplorer commented Jul 27, 2023

Thank you so much @ewdurbin for courageously speaking directly and publicly about the reasons this work was so difficult to accept, and for putting in the effort yourself on #13649 to get this team effort across the frontend and backend over the finish line. I am replying here because a lot of what I hope are really significant perf improvements for pip and pex on very common tasks were blocking on this data being published by pypi and I'm super excited to try them out now. I'm very sorry you were subjected to that treatment and really appreciate your impactful contributions here that have enabled so much other exciting work.

Please others don't reply further on this thread, I shouldn't have done it myself but I wanted the importance of this work to be known to others.

@abitrolly
Copy link
Contributor Author

image

@cosmicexplorer
Copy link

I'm glad that the python community can now move on from interactions like these. Sorry, everyone.

@pypi pypi locked as resolved and limited conversation to collaborators Jul 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.