-
Notifications
You must be signed in to change notification settings - Fork 963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract and host wheel METADATA on upload #9972
Conversation
temporary_filename
in upload code
warehouse/forklift/legacy.py
Outdated
@@ -1330,11 +1331,19 @@ def file_upload(request): | |||
"Binary wheel '{filename}' has an unsupported " | |||
"platform tag '{plat}'.".format(filename=filename, plat=plat), | |||
) | |||
# Extract .metadata file | |||
# https://www.python.org/dev/peps/pep-0658/#specification | |||
with zipfile.ZipFile(temporary_filename) as zfp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the user uploads a non-zip file as a wheel? Is it possible to have zip files that effectively DoS the server with a maliciously crafted file?
(that's totally possible, especially if someone's trying to break PyPI or something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check if the uploaded wheel is valid is made here - https://github.com/pypa/warehouse/blob/b9c2e2a45529b64a44c4e8b225c20efab0442bfd/warehouse/forklift/legacy.py#L711 - It could be extended to include all kind of checks, but..
Is it possible to have zip files that effectively DoS the server with a maliciously crafted file?
Yes, because the zipfile
doesn't protect against zip bombs. We can try to throw a zip bomb at PyPI, but such security hazards are better be managed in another issue - this PR doesn't increase the attack surface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there is https://nvd.nist.gov/vuln/detail/CVE-2019-9674 which Python core developers decided to fix with a warning. How do you suppose me to fix this in my free time?
Can't say I didn't expect the tests to fail, but I had a hope that the new functionality will just not be covered. Test are failing, because they are monkeypatching a lot of calls including |
I would really prefer a fixture with real wheel, to ensure that my code works. Otherwise I feel like I am testing my expectations how my code should work, and not what would really happen after deployment. |
It appeared that the fix needs to be applied to just one test - 29 failures are just parametrisation for different platforms. It is now ready for review. |
4a8404d
to
e5cb0e2
Compare
Fixed |
@pradyunsg can you unblock the tests? |
warehouse/forklift/legacy.py
Outdated
# Extract .metadata file | ||
# https://www.python.org/dev/peps/pep-0658/#specification | ||
with zipfile.ZipFile(temporary_filename) as zfp: | ||
metafile = wheel_info.group("namever") + ".dist-info/METADATA" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, is this the way PIP reads the metadata ? Is there a way we have multiple metadata files in a wheel such that the metadata says we depend on package A, but in reality when installing via pip, we depend on package B ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's how pip finds the .dist-info folder as far as I can tell https://github.com/pypa/pip/blob/bf91a079791f2daf4339115fb39ce7d7e33a9312/src/pip/_internal/utils/wheel.py#L84-L114
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe upload API should do these checks too. Right now it checks for the WHEEL
presence and allowed platform tags .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we have multiple metadata files in a wheel such that the metadata says we depend on package A, but in reality when installing via pip, we depend on package B ?
According to this code pip
should fail if there are multiple .dist-info
dirs
https://github.com/pypa/pip/blob/bf91a079791f2daf4339115fb39ce7d7e33a9312/src/pip/_internal/utils/wheel.py#L100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically a wheel can contain multiple .dist-info
directories as long as only one matches the name-version pair in the wheel’s file name. But pip adds this additional restriction to simplify implementation, and afaik no-one has complained yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those extra .dist-info
directories will not be used, because a wheel ships only one package-version
build, and the .dist-info
that will is used should match the name-version
pair in filename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the major concern missing so far in implementation is storing a boolean on the File model and rendering the data-dist-info-metadata
value in the simple detail template/view per the PEP.
The rendering isn't strictly required for this PR, but we should at least be able to keep track of what files we have stored metadata for.
Ideally this implementation would be as a general purpose utility that we could point at a given wheel file on disk, that would allow for extracting/storing the metadata during uploads, but also allow us to backfill over time.
Oh, hmmm, and it also seems like we should be calculating/storing a hash for the metadata file. |
Yes, I would like to avoid big changes and https://warehouse.readthedocs.io/development/submitting-patches.html recommends keeping them small. After this PR is merged, all new wheels will have the metadata stored, so the upload datetime will be enough for now.
Yes, this should be available as a library, but I don't know which. Exactly for backfilling job.
That's not a problem to calculate them when DB part is ready. |
@ewdurbin looks like it is easy to add hashes to |
Oh no. There is no
|
I would probably change the order of those slightly to add the DB changes first. Since storing the data doesnt help us without data in the DB to tell us that there is a metadata file associated with this release. |
@dstufft all new wheel files after this change will have the metadata file, so it is not strictly necessary to store info that metadata exists in DB. The only thing that needs to be stored is the hash, but I would prefer to have a the discussion about that in a separate issue. If you can point me to example how to traverse storage filesystem, I can try to make migration script that will check and add metadata for the wheels that are already uploaded. |
I agree with @dstufft that without DB storage... extracting/storing the metadata file isn't as useful unless the Also for backfill, rather than needing to traverse our file storage, we'd only need to iterate over |
@ewdurbin I can try to add NULLable Is it enough to just add new field to |
f84c28b
to
33a030c
Compare
""" | ||
global _wheel_file_re | ||
filename = os.path.basename(path) | ||
namever = _wheel_file_re.match(filename).group("namever") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the group ever be None here?
#12702 created conflicts. Trying to resolve them now. |
This allows to download just the .metadata file for dependency resolution instead of full wheels as it happens today pypi#8254 (comment) The filename convention and download location is covered by PEP-0658 https://www.python.org/dev/peps/pep-0658/#specification
Tests were failing, because wheel upload test used .tar.gz content, and zipfile could not open that content to extract METADATA (which was also absent). The fix adds two helpers to avoid repeated code.
I changed tested content-type to the more appropriate for .zip Twine sends application/octet-stream.
This will be useful for backfill scripts.
a10d3f5
to
cab426f
Compare
Co-authored-by: Éric <merwok@netwok.org>
cab426f
to
86cc141
Compare
Rebased and squashed some CI linter commits. |
superseded by #13649 |
@ewdurbin you could at least give me a credit when copying the code into new PR, so that I could reference that trying to find a job. Now that this code is merged, it is at least possible to test Leaving aside an acute seizure of justice, I am glad to see it finally merged. Thanks everyone for support. |
I also found it very weird to see this PR coming "out of the blue" (which of course it was not, I'm sure lots of discussion went internally about it - it was just not visible for bystanders) with parts that were almost identical to this one, without at least retaining attribution in 1 commit, or acknowledgement of some sort. But anyway, glad to see #8254 addressed. Thanks everyone. |
the long story short here is that abitrolly has consistently behaved across multiple contribution attempts in ways that were dismissive of maintainer feedback, overscoped, financially dubious, and often made me feel uncomfortable or unnecessarily pressured. i have no interest in collaborating with this contributor. PEP 658 is important as Metadata 2.2 (Dynamic) near implementation completion and getting back in the quagmire of interacting with abitrolly wasn't something i have capacity for. |
@ewdurbin given that I've never offended or held anything personal towards you, that's very unprofessional, man. Not everybody is a nice person when it comes to doing extra work in their free time, but you guys are at least being paid for maintaining things here, and "uncomfortable" and "unnecessarily pressured" is a normal job situation for many of us paid professionals. I take the blame and full responsibility for pushing these thing while you guys are finding other things to do. With that being said, I apologize to you for my bad and uncomfortable behavior. Should I know that it is the reason to hold back my PR, I would of course take the measures to apologize earlier, just to deal with it and forget. Even if I may sound passive aggresive, I appreciate you maintainership work that resulted in these changes finally merged in. |
Thank you so much @ewdurbin for courageously speaking directly and publicly about the reasons this work was so difficult to accept, and for putting in the effort yourself on #13649 to get this team effort across the frontend and backend over the finish line. I am replying here because a lot of what I hope are really significant perf improvements for pip and pex on very common tasks were blocking on this data being published by pypi and I'm super excited to try them out now. I'm very sorry you were subjected to that treatment and really appreciate your impactful contributions here that have enabled so much other exciting work. Please others don't reply further on this thread, I shouldn't have done it myself but I wanted the importance of this work to be known to others. |
I'm glad that the python community can now move on from interactions like these. Sorry, everyone. |
Extract METADATA on wheel upload as *.whl.metadata (#8254)
This allows to download just the .metadata file for dependency
resolution instead of full wheels as it happens today
#8254 (comment)
The filename convention and download location is covered by PEP-0658
https://www.python.org/dev/peps/pep-0658/#specification