diff --git a/docs/blog/posts/2025-08-07-wheel-archive-confusion-attacks.md b/docs/blog/posts/2025-08-07-wheel-archive-confusion-attacks.md new file mode 100644 index 000000000000..1c04096faf84 --- /dev/null +++ b/docs/blog/posts/2025-08-07-wheel-archive-confusion-attacks.md @@ -0,0 +1,144 @@ +--- +title: Preventing ZIP parser confusion attacks on Python package installers +description: PyPI will begin warning and will later reject wheels that contain differentiable ZIP features or incorrect RECORD files. +authors: + - sethmlarson +date: 2025-08-07 +tags: + - security + - publishing + - deprecation +--- + +The Python Package Index is introducing new restrictions to protect +Python package installers and inspectors from confusion attacks arising +from ZIP parser implementations. This has been done in response to +the discovery that the popular installer uv has a different extraction behavior +to many Python-based installers that use the ZIP parser implementation +provided by the `zipfile` standard library module. + +## Summary + +* ZIP archives constructed to exploit ZIP confusion attacks are now rejected by PyPI. +* There is no evidence that this vulnerability has been exploited using PyPI. +* PyPI is deprecating wheel distributions with incorrect `RECORD` files. + +Please see [this blog post](https://astral.sh/blog/uv-security-advisory-cve-2025-54368) and [CVE-2025-54368](https://github.com/astral-sh/uv/security/advisories/GHSA-8qf3-x8v5-2pj8) +for more information on uv's patch. + + + +## Wheels are ZIPs, and ZIPs are complicated + +Python package "wheels" (or "binary distributions"), like many other file formats, +actually a ZIP in disguise. The [ZIP archive standard](https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT) was created in 1989, where large archives +might need to be stored across multiple distinct storage units due to size constraints. This requirement influenced +the design of the ZIP archive standard, such as being able to update or delete already-archived +files by appending new records to the end of a ZIP instead of having to rewrite the entire ZIP +from scratch which might potentially be on another disk. + +These design considerations meant that the ZIP standard is complicated to implement, and +in many ways is ambiguous in what the "result" of extracting a valid ZIP file should be. + +The ["Binary Distribution Format" specification](https://packaging.python.org/en/latest/specifications/binary-distribution-format/#binary-distribution-format) +defines how a wheel is [meant to be installed](https://packaging.python.org/en/latest/specifications/binary-distribution-format/#installing-a-wheel-distribution-1-0-py32-none-any-whl). +However, the specification leaves many of the details on how exactly to extract the archive +and handle ZIP-specific features to implementations. The most detail provided is: + +> Although a specialized installer is recommended, a wheel file may be installed by simply unpacking into site-packages with the standard ‘unzip’ tool while preserving enough information to spread its contents out onto their final paths at any later time. + +This means that ZIP ambiguities are unlikely to be caught by installers, as there are no +restrictions for which ZIP features are allowed in a valid wheel archive. + +There's also a Python packaging specific mechanism for which files are meant to be included +in a wheel. The `RECORD` file included inside wheel `.dist-info` directories +lists files by name and optionally a checksum (like SHA256). +The [specification for the `.dist-info` directory](https://packaging.python.org/en/latest/specifications/binary-distribution-format/#the-dist-info-directory) +details how installers are supposed to check the contents of the ZIP archive against `RECORD`: + +> Apart from `RECORD` and its signatures, installation will fail if any file in the archive is not both mentioned and correctly hashed in `RECORD`. + +However, most Python installers today do not do this check and extract the contents +of the ZIP archive similar to `unzip` and then amend the installed `RECORD` within the +virtual environment so that uninstalling the package works as expected. + +This means that there is no forcing function on Python projects and +packaging tools to follow packaging standards or normalize their use of ZIP archive features. +This leads to the ambiguous situation today where no one installer can start +enforcing standards without accidentally "breaking" projects and archives +that already exist on PyPI. + +PyPI is adopting a few measures to prevent attackers from abusing the complexities +of ZIP archives and installers not checking `RECORD` files to smuggle files past +manual review processes and automated detection tools. + +## What is PyPI doing to prevent ZIP confusion attacks? + +The correct method to unpack a ZIP is to first check the Central Directory +of files before extracting entries. See this [blog post](https://www.crowdstrike.com/en-us/blog/how-to-prevent-zip-file-exploitation/) +for a more detailed explanation of ZIP confusion attacks. + +PyPI is implementing the following logic to prevent ZIP confusion attacks on +the upload of wheels and ZIPs: + +* Rejecting ZIP archives with invalid record and framing information. +* Rejecting ZIP archives with duplicate filenames in Local File and Central Directory headers. +* Rejecting ZIP archives where files included in Local File and Central Directory headers don't match. +* Rejecting ZIP archives with trailing data or multiple End of Central Directory headers. +* Rejecting ZIP archives with incorrect End of Central Directory Locator values. + +PyPI already implements ZIP and tarball compression-bomb detection +as a part of upload processing. + +PyPI will also begin sending emails to **warn users when wheels are published +whose ZIP contents don't match the included `RECORD` metadata file**. After 6 months of warnings, +on February 1st, 2026, PyPI will begin **rejecting** newly uploaded wheels whose ZIP contents +don't match the included `RECORD` metadata file. + +We encourage all Python installers to use this opportunity to +implement cross-checking of extracted wheel contents with the `RECORD` metadata file. + +## `RECORD` and ZIP issues in top Python packages + +Almost all the top 15,000 Python packages by downloads (of which 13,468 publish wheels) +have no issues with the ZIP format or the `RECORD` metadata file. +This makes us confident that we can deploy +these changes without major disruption of existing Python project +development. + +| Status | Number of Projects | +|-------------------------------------|--------------------| +| No `RECORD` or ZIP issues | 13,460 | +| Missing file from `RECORD` | 4 | +| Mismatched `RECORD` and ZIP headers | 2 | +| Duplicate files in ZIP headers | 2 | +| Other ZIP format issues | 0 | + +Note that there are more occurrences of ZIP and `RECORD` issues +that have been reported for other projects on PyPI, but those projects +are not in the top 15,000 by downloads. + +## What actions should I take? + +The mitigations above mean that +users of PyPI, regardless of their installer, don't need to take immediate action +to be safe. We recommend the following actions to users of PyPI to ensure +compliance with Python package and ZIP standards: + +* **For users installing PyPI projects**: Make sure your installer tools are up-to-date. +* **For maintainers of PyPI projects**: If you encounter an error during upload, + read the error message and update your own build process or report the issue + to your build tool, if applicable. +* **For maintainers of installer projects**: Ensure that your ZIP implementation follows the ZIP standard + and checks the Central Directory before proceeding with decompression. + See the CPython `zipfile` module for a ZIP implementation that implements this + logic. Begin checking the `RECORD` file against ZIP contents and erroring + or warning the user that the wheel is incorrectly formatted. + +## Acknowledgements + +Thanks to Caleb Brown (Google Open Source Security Team) and Tim Hatch (Netflix) for reporting this issue. + +This level of coordination across Python ecosystem projects requires significant +engineering time investment. Thanks to [Alpha-Omega](https://alpha-omega.dev) who sponsors the security-focused +[Developer-in-Residence](https://www.python.org/psf/developersinresidence/) positions at the Python Software Foundation. diff --git a/tests/unit/email/test_init.py b/tests/unit/email/test_init.py index e871b26440ac..8c9fc824ade3 100644 --- a/tests/unit/email/test_init.py +++ b/tests/unit/email/test_init.py @@ -6092,6 +6092,92 @@ def test_pep427_emails( ) ] + def test_wheel_record_mismatch_email( + self, + pyramid_request, + pyramid_config, + monkeypatch, + ): + stub_user = pretend.stub( + id="id", + username="username", + name="", + email="email@example.com", + primary_email=pretend.stub(email="email@example.com", verified=True), + ) + subject_renderer = pyramid_config.testing_add_renderer( + "email/wheel-record-mismatch-email/subject.txt" + ) + subject_renderer.string_response = "Email Subject" + body_renderer = pyramid_config.testing_add_renderer( + "email/wheel-record-mismatch-email/body.txt" + ) + body_renderer.string_response = "Email Body" + html_renderer = pyramid_config.testing_add_renderer( + "email/wheel-record-mismatch-email/body.html" + ) + html_renderer.string_response = "Email HTML Body" + + send_email = pretend.stub( + delay=pretend.call_recorder(lambda *args, **kwargs: None) + ) + pyramid_request.task = pretend.call_recorder(lambda *args, **kwargs: send_email) + monkeypatch.setattr(email, "send_email", send_email) + + pyramid_request.db = pretend.stub( + query=lambda a: pretend.stub( + filter=lambda *a: pretend.stub( + one=lambda: pretend.stub(user_id=stub_user.id) + ) + ), + ) + pyramid_request.user = stub_user + pyramid_request.registry.settings = {"mail.sender": "noreply@example.com"} + + project_name = "Test_Project" + filename = "Test_Project-1.0-py3-none-any.whl" + + result = email.send_wheel_record_mismatch_email( + pyramid_request, + {stub_user}, + project_name=project_name, + filename=filename, + ) + + assert result == { + "project_name": project_name, + "filename": filename, + } + subject_renderer.assert_(project_name=project_name) + body_renderer.assert_(project_name=project_name) + html_renderer.assert_(project_name=project_name) + + assert pyramid_request.task.calls == [pretend.call(send_email)] + assert send_email.delay.calls == [ + pretend.call( + f"{stub_user.username} <{stub_user.email}>", + { + "sender": None, + "subject": "Email Subject", + "body_text": "Email Body", + "body_html": ( + "\n
\n" + "Email HTML Body
\n\n" + ), + }, + { + "tag": "account:email:sent", + "user_id": stub_user.id, + "additional": { + "from_": "noreply@example.com", + "to": stub_user.email, + "subject": "Email Subject", + "redact_ip": False, + }, + }, + ) + ] + class TestUserTermsOfServiceUpdateEmail: def test_user_terms_of_service_updated( diff --git a/tests/unit/forklift/test_legacy.py b/tests/unit/forklift/test_legacy.py index 496023bab443..06d4e0176df2 100644 --- a/tests/unit/forklift/test_legacy.py +++ b/tests/unit/forklift/test_legacy.py @@ -10,6 +10,7 @@ import zipfile from cgi import FieldStorage +from textwrap import dedent from unittest import mock import pretend @@ -91,6 +92,17 @@ def _get_whl_testdata(name="fake_package", version="1.0"): zfp.writestr( f"{name}-{version}.dist-info/licenses/LICENSE.APACHE", "Fake License" ) + zfp.writestr( + f"{name}-{version}.dist-info/RECORD", + dedent( + f"""\ + {name}-{version}.dist-info/METADATA, + {name}-{version}.dist-info/licenses/LICENSE.MIT, + {name}-{version}.dist-info/licenses/LICENSE.APACHE, + {name}-{version}.dist-info/RECORD, + """, + ), + ) return temp_f.getvalue() @@ -3367,7 +3379,7 @@ def test_upload_fails_with_unsupported_wheel_plat( "400 Binary wheel .* has an unsupported platform tag .*", resp.status ) - def test_upload_fails_with_missing_metadata_wheel( + def test_upload_fails_with_missing_record_wheel( self, monkeypatch, pyramid_config, db_request ): user = UserFactory.create() @@ -3413,6 +3425,278 @@ def test_upload_fails_with_missing_metadata_wheel( resp = excinfo.value + assert resp.status_code == 400 + assert re.match( + "400 Wheel .* does not contain the required RECORD file: .*", resp.status + ) + + def test_upload_warns_with_mismatched_wheel_and_zip_contents( + self, monkeypatch, pyramid_config, db_request + ): + user = UserFactory.create() + pyramid_config.testing_securitypolicy(identity=user) + db_request.user = user + db_request.user_agent = "warehouse-tests/6.6.6" + EmailFactory.create(user=user) + project = ProjectFactory.create() + release = ReleaseFactory.create(project=project, version="1.0") + RoleFactory.create(user=user, project=project) + + temp_f = io.BytesIO() + project_name = project.normalized_name.replace("-", "_") + with zipfile.ZipFile(file=temp_f, mode="w") as zfp: + zfp.writestr("some_file", "some_data") + zfp.writestr(f"{project_name}-{release.version}.dist-info/METADATA", "") + zfp.writestr( + f"{project_name}-{release.version}.dist-info/RECORD", + f"{project_name}-{release.version}.dist-info/RECORD,", + ) + + filename = "{}-{}-cp34-none-any.whl".format( + project.normalized_name.replace("-", "_"), + release.version, + ) + filebody = temp_f.getvalue() + + db_request.POST = MultiDict( + { + "metadata_version": "1.2", + "name": project.name, + "version": release.version, + "filetype": "bdist_wheel", + "pyversion": "cp34", + "md5_digest": hashlib.md5(filebody).hexdigest(), + "content": pretend.stub( + filename=filename, + file=io.BytesIO(filebody), + type="application/zip", + ), + } + ) + + monkeypatch.setattr( + legacy, "_is_valid_dist_file", lambda *a, **kw: (True, None) + ) + + storage_service = pretend.stub(store=lambda path, filepath, meta: None) + db_request.find_service = lambda svc, name=None, context=None: { + IFileStorage: storage_service, + }.get(svc) + + send_email = pretend.call_recorder(lambda *a, **kw: None) + monkeypatch.setattr(legacy, "send_wheel_record_mismatch_email", send_email) + + resp = legacy.file_upload(db_request) + assert send_email.calls == [ + pretend.call( + db_request, + {user}, + project_name=project.name, + filename=filename, + ), + ] + assert resp.status_code == 200 + + def test_upload_record_does_not_warn_with_zip_dir( + self, monkeypatch, pyramid_config, db_request + ): + """ + ZIP archives can contain directory "members". + These shouldn't cause a warning, as RECORD + only contains files, not directories. + """ + + user = UserFactory.create() + pyramid_config.testing_securitypolicy(identity=user) + db_request.user = user + db_request.user_agent = "warehouse-tests/6.6.6" + EmailFactory.create(user=user) + project = ProjectFactory.create() + release = ReleaseFactory.create(project=project, version="1.0") + RoleFactory.create(user=user, project=project) + + temp_f = io.BytesIO() + project_name = project.normalized_name.replace("-", "_") + with zipfile.ZipFile(file=temp_f, mode="w") as zfp: + zfp.mkdir("some-dir/") # Directories! + zfp.mkdir(f"{project_name}-{release.version}.dist-info/") + + zfp.writestr(f"{project_name}-{release.version}.dist-info/METADATA", "") + zfp.writestr( + f"{project_name}-{release.version}.dist-info/RECORD", + dedent( + f"""\ + {project_name}-{release.version}.dist-info/METADATA, + {project_name}-{release.version}.dist-info/RECORD, + """, + ), + ) + + filename = "{}-{}-cp34-none-any.whl".format( + project.normalized_name.replace("-", "_"), + release.version, + ) + filebody = temp_f.getvalue() + + db_request.POST = MultiDict( + { + "metadata_version": "1.2", + "name": project.name, + "version": release.version, + "filetype": "bdist_wheel", + "pyversion": "cp34", + "md5_digest": hashlib.md5(filebody).hexdigest(), + "content": pretend.stub( + filename=filename, + file=io.BytesIO(filebody), + type="application/zip", + ), + } + ) + + monkeypatch.setattr( + legacy, "_is_valid_dist_file", lambda *a, **kw: (True, None) + ) + + storage_service = pretend.stub(store=lambda path, filepath, meta: None) + db_request.find_service = lambda svc, name=None, context=None: { + IFileStorage: storage_service, + }.get(svc) + + send_email = pretend.call_recorder(lambda *a, **kw: None) + monkeypatch.setattr(legacy, "send_wheel_record_mismatch_email", send_email) + + resp = legacy.file_upload(db_request) + + assert send_email.calls == [] + assert resp.status_code == 200 + + def test_upload_record_does_not_warn_windows_path_separators( + self, monkeypatch, pyramid_config, db_request + ): + """ + RECORD files can use '/' or '\' for path separators. + We should handle both and not send unnecessary emails. + """ + + user = UserFactory.create() + pyramid_config.testing_securitypolicy(identity=user) + db_request.user = user + db_request.user_agent = "warehouse-tests/6.6.6" + EmailFactory.create(user=user) + project = ProjectFactory.create() + release = ReleaseFactory.create(project=project, version="1.0") + RoleFactory.create(user=user, project=project) + + temp_f = io.BytesIO() + project_name = project.normalized_name.replace("-", "_") + with zipfile.ZipFile(file=temp_f, mode="w") as zfp: + zfp.writestr(f"{project_name}-{release.version}.dist-info/METADATA", "") + zfp.writestr( + f"{project_name}-{release.version}.dist-info/RECORD", + dedent( + f"""\ + {project_name}-{release.version}.dist-info\\METADATA, + {project_name}-{release.version}.dist-info\\RECORD, + """, + ), + ) + + filename = "{}-{}-cp34-none-any.whl".format( + project.normalized_name.replace("-", "_"), + release.version, + ) + filebody = temp_f.getvalue() + + db_request.POST = MultiDict( + { + "metadata_version": "1.2", + "name": project.name, + "version": release.version, + "filetype": "bdist_wheel", + "pyversion": "cp34", + "md5_digest": hashlib.md5(filebody).hexdigest(), + "content": pretend.stub( + filename=filename, + file=io.BytesIO(filebody), + type="application/zip", + ), + } + ) + + monkeypatch.setattr( + legacy, "_is_valid_dist_file", lambda *a, **kw: (True, None) + ) + + storage_service = pretend.stub(store=lambda path, filepath, meta: None) + db_request.find_service = lambda svc, name=None, context=None: { + IFileStorage: storage_service, + }.get(svc) + + send_email = pretend.call_recorder(lambda *a, **kw: None) + monkeypatch.setattr(legacy, "send_wheel_record_mismatch_email", send_email) + + resp = legacy.file_upload(db_request) + + assert send_email.calls == [] + assert resp.status_code == 200 + + def test_upload_fails_with_missing_metadata_wheel( + self, monkeypatch, pyramid_config, db_request + ): + user = UserFactory.create() + pyramid_config.testing_securitypolicy(identity=user) + db_request.user = user + EmailFactory.create(user=user) + project = ProjectFactory.create() + release = ReleaseFactory.create(project=project, version="1.0") + RoleFactory.create(user=user, project=project) + + temp_f = io.BytesIO() + project_name = project.normalized_name.replace("-", "_") + with zipfile.ZipFile(file=temp_f, mode="w") as zfp: + zfp.writestr("some_file", "some_data") + zfp.writestr( + f"{project_name}-{release.version}.dist-info/RECORD", + dedent( + f"""\ + some_file, + {project_name}-{release.version}.dist-info/RECORD, + """, + ), + ) + + filename = "{}-{}-cp34-none-any.whl".format( + project_name, + release.version, + ) + filebody = temp_f.getvalue() + + db_request.POST = MultiDict( + { + "metadata_version": "1.2", + "name": project.name, + "version": release.version, + "filetype": "bdist_wheel", + "pyversion": "cp34", + "md5_digest": hashlib.md5(filebody).hexdigest(), + "content": pretend.stub( + filename=filename, + file=io.BytesIO(filebody), + type="application/zip", + ), + } + ) + + monkeypatch.setattr( + legacy, "_is_valid_dist_file", lambda *a, **kw: (True, None) + ) + + with pytest.raises(HTTPBadRequest) as excinfo: + legacy.file_upload(db_request) + + resp = excinfo.value + assert resp.status_code == 400 assert re.match( "400 Wheel .* does not contain the required METADATA file: .*", resp.status diff --git a/tests/unit/utils/test_zipfiles.py b/tests/unit/utils/test_zipfiles.py new file mode 100644 index 000000000000..6e64d5b8f363 --- /dev/null +++ b/tests/unit/utils/test_zipfiles.py @@ -0,0 +1,153 @@ +# SPDX-License-Identifier: Apache-2.0 + +import io +import os +import pathlib +import struct + +import pytest + +from warehouse.forklift.legacy import _is_valid_dist_file +from warehouse.utils import zipfiles + +ZIPDATA_DIR = pathlib.Path(__file__).absolute().parent / "zipdata" + + +def zippath(filename: str): + return str(ZIPDATA_DIR / filename) + + +@pytest.mark.parametrize( + ("filename", "error"), + [ + ("reject/8bitcomment.zip", "Filename not in central directory"), + ("reject/cd_extra_entry.zip", "Duplicate filename in central directory"), + ("reject/cd_missing_entry.zip", "Filename not in central directory"), + ("reject/data_descriptor_bad_crc_0.zip", "Unknown record signature"), + ("reject/dupe_eocd.zip", "Truncated central directory"), + ( + "reject/eocd64_locator_mismatch.zip", + "Mis-matched EOCD64 record and locator offset", + ), + ("reject/eocd64_non_locator.zip", "Malformed zip file"), + ("reject/eocd64_without_eocd.zip", "Malformed zip file"), + ("reject/eocd64_without_locator.zip", "Malformed zip file"), + ("reject/missing_local_file.zip", "Missing filename in local headers"), + ("reject/extra3byte.zip", "Malformed zip file"), + ("reject/non_ascii_original_name.zip", "Filename not unicode"), + ("reject/not.zip", "File is not a zip file"), + ("reject/prefix.zip", "Unknown record signature"), + ("reject/second_unicode_extra.zip", "Filename not in central directory"), + ("reject/shortextra.zip", "Corrupt extra field 7075 (size=9)"), + ("reject/suffix_not_comment.zip", "Trailing data"), + ("reject/unicode_extra_chain.zip", "Filename not in central directory"), + ("reject/wheel-1.0-py3-none-any.whl", "Duplicate filename in local headers"), + ("reject/zip64_eocd_confusion.zip", "Filename not in central directory"), + ("reject/zip64_eocd_extensible_data.zip", "Bad offset for central directory"), + ("reject/zip64_extra_csize.zip", "Malformed zip file"), + ("reject/zip64_extra_too_long.zip", "Mis-matched data size"), + ( + "reject/zip64_extra_too_short.zip", + "Corrupt zip64 extra field. Compress size not found.", + ), + ("reject/zip64_extra_usize.zip", "Malformed zip file"), + ("reject/zipinzip.zip", "Filename not in central directory"), + ], +) +def test_bad_zips(filename, error): + result = zipfiles.validate_zipfile(zippath(filename)) + assert result[0] is False, error + assert result[1] == error + + # Also test as a ZIP provided as a dist + # is rejected if uploaded. The message + # might be different, as this function + # also checks ZIP validity. + result = _is_valid_dist_file(zippath(filename), "sdist") + assert result[0] is False + + +@pytest.mark.parametrize("filename", list(os.listdir(ZIPDATA_DIR / "accept"))) +def test_good_zips(filename): + result = zipfiles.validate_zipfile(zippath(f"accept/{filename}")) + assert result[0] is True + assert result[1] is None + + +def test_local_file_header(): + # Positive case! + header = struct.pack("+ This email is notifying you of an upcoming deprecation that we have + determined may affect you as a result of your recent upload to + '{{ project_name }}'. +
+
+ In the future, PyPI will require the file contents of all newly uploaded
+ wheel distributions to match their included RECORD file.
+ Specifically, your recent upload of '{{ filename }}' has file contents
+ that do not match the included RECORD file.
+
+ Any files already uploaded can remain in place as-is and do + not need to be updated or removed. +
++ In most cases, this can be resolved by upgrading the version of your build + tooling to a later version that correctly records wheel contents. You do + not need to remove the file. +
++ Please read this + PyPI blog post + for more information. If you have questions, you can email + admin@pypi.org to communicate with the + PyPI administrators. +
+{% endblock %} diff --git a/warehouse/templates/email/wheel-record-mismatch/body.txt b/warehouse/templates/email/wheel-record-mismatch/body.txt new file mode 100644 index 000000000000..51d412e1d330 --- /dev/null +++ b/warehouse/templates/email/wheel-record-mismatch/body.txt @@ -0,0 +1,19 @@ +{# SPDX-License-Identifier: Apache-2.0 -#} + +{% extends "email/_base/body.txt" %} + +{% block content %} + +This email is notifying you of an upcoming deprecation that we have determined may affect you as a result of your recent upload to '{{ project_name }}'. + +Specifically, your recent upload of '{{ filename }}' has file contents that do not match the included RECORD file. In the future, PyPI will require the file contents of all newly uploaded wheel distributions to match their included RECORD file. + +Any files already uploaded will remain in place as-is and do not need to be updated or removed. + +In most cases, this can be resolved by upgrading the version of your build tooling to a later version that correctly records wheel contents. You do not need to remove the file. + +Please read this PyPI blog post for more information: https://blog.pypi.org/posts/2025-08-07-wheel-archive-confusion-attacks + +If you have questions, you can email admin@pypi.org to communicate with the PyPI administrators. + +{% endblock %} diff --git a/warehouse/templates/email/wheel-record-mismatch/subject.txt b/warehouse/templates/email/wheel-record-mismatch/subject.txt new file mode 100644 index 000000000000..9fe5c6dc571c --- /dev/null +++ b/warehouse/templates/email/wheel-record-mismatch/subject.txt @@ -0,0 +1,5 @@ +{# SPDX-License-Identifier: Apache-2.0 -#} + +{% extends "email/_base/subject.txt" %} + +{% block subject %}Deprecation notice for recent wheel distribution upload to '{{ project_name }}'{% endblock %} diff --git a/warehouse/utils/zipfiles.py b/warehouse/utils/zipfiles.py new file mode 100644 index 000000000000..8ca10422f3bc --- /dev/null +++ b/warehouse/utils/zipfiles.py @@ -0,0 +1,312 @@ +# SPDX-License-Identifier: Apache-2.0 + +import os +import struct +import typing +import zipfile + +RECORD_SIG_CENTRAL_DIRECTORY = b"\x50\x4b\x01\x02" +RECORD_SIG_LOCAL_FILE = b"\x50\x4b\x03\x04" +RECORD_SIG_EOCD = b"\x50\x4b\x05\x06" +RECORD_SIG_EOCD64 = b"\x50\x4b\x06\x06" +RECORD_SIG_EOCD64_LOCATOR = b"\x50\x4b\x06\x07" +RECORD_SIG_DATA_DESCRIPTOR = b"\x50\x4b\x07\x08" + + +class InvalidZipFileError(Exception): + """Internal exception used by this module""" + + +def _seek_check(fp: typing.IO[bytes], amt: int, /) -> None: + """Call seek and check that the seeked amount + is correct. Returns True if the seeked amount + is less than what is expected. + """ + if amt < 0: # pragma: no cover + raise InvalidZipFileError("Negative offset") + fp.seek(amt, os.SEEK_CUR) + + +def _read_check(fp: typing.IO[bytes], amt: int, /) -> bytes: + """Read and assert there was enough data available.""" + if amt < 0: # pragma: no cover + raise InvalidZipFileError("Negative offset") + data = fp.read(amt) + if len(data) != amt: + raise InvalidZipFileError("Malformed zip file") + return data + + +def _handle_local_file_header( + fp: typing.IO[bytes], zipfile_files_and_sizes: dict[str, int] +) -> bytes: + """ + Parses the body of a Local File header. Returns + the contained filename field of the record. + + See section 4.3.7 of APPNOTE.TXT. + """ + data = _read_check(fp, 26) + gpbf, compress_method, compressed_size, filename_size, extra_size = struct.unpack( + "bytes: + """ + Parses the body of a Central Directory (CD) header. + Returns the contained filename field of the record. + + See section 4.3.12 of APPNOTE.TXT. + """ + data = _read_check(fp, 42) + compressed_size, filename_size, extra_size, comment_size, offset = struct.unpack( + "None: + """ + Parses the body of an End of Central Directory (EOCD) record. + + See section 4.3.16 of APPNOTE.TXT. + """ + data = _read_check(fp, 18) + ( + cd_offset, + comment_size, + ) = struct.unpack(" None: + """ + Parses the body of an ZIP64 End of Central Directory (EOCD64) record. + + See section 4.3.14 of APPNOTE.TXT. + """ + data = _read_check(fp, 8) + (eocd64_size,) = struct.unpack(" int: + """ + Parses the body of an ZIP64 End of Central Directory Locator record. + + See section 4.3.15 of APPNOTE.TXT. + """ + data = _read_check(fp, 16) + (eocd64_offset,) = struct.unpack("tuple[bool, str | None]: + """ + Validates that a ZIP file would parse the same through + a ZIP implementation that checks the Central Directory + and an implementation that streams Local File headers + without checking the Central Directory (CD). + + This is done mostly by ensuring there are no duplicate + or mismatched files between Local Files and CD. + + Implemented using the ZIP standard (APPNOTE.TXT): + https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT + """ + + # Process the zipfile through Python's + # zipfile processor, the same used by + # pip and other Python installers. + try: + zfp = zipfile.ZipFile(zip_filepath, mode="r") + # Store compression sizes from the CD for use later. + zipfile_files = {zfi.filename: zfi.compress_size for zfi in zfp.filelist} + except zipfile.BadZipfile as e: + return False, e.args[0] + + with open(zip_filepath, mode="rb") as fp: + # Track filenames that have been seen in + # Local File and Central Directory headers + # to avoid duplicates or missing entries. + local_filenames = set() + cd_filenames = set() + + # These variables enforce the requirements + # of EOCD for ZIP64. ZIP64 has its own EOCD + # record, but that record may be followed by + # a EOCD64 Locator and/or a '0xFF'-filled + # non-ZIP64 EOCD record. + expected_eocd64_offset = None + actual_eocd64_offset = None + + while True: + try: + signature = _read_check(fp, 4) + + # Only accept EOCD after an EOCD64 if we've + # seen the EOCD64 Locator first. + if ( + signature == RECORD_SIG_EOCD + and expected_eocd64_offset is not None + and actual_eocd64_offset is None + ): + return False, "Malformed zip file" + + # Only accept a single EOCD64 Locator after EOCD64. + if signature == RECORD_SIG_EOCD64_LOCATOR and ( + expected_eocd64_offset is None or actual_eocd64_offset is not None + ): + return False, "Malformed zip file" + + # If we've seen an EOCD64 record then we only + # accept an EOCD64 Locator or an EOCD. + if ( + signature not in (RECORD_SIG_EOCD64_LOCATOR, RECORD_SIG_EOCD) + and expected_eocd64_offset is not None + ): + return False, "Malformed zip file" + + # Central Directory File Header + if signature == RECORD_SIG_CENTRAL_DIRECTORY: + filename = _handle_central_directory_header(fp) + if filename in cd_filenames: + raise InvalidZipFileError( + "Duplicate filename in central directory" + ) + if filename not in local_filenames: + raise InvalidZipFileError("Missing filename in local headers") + cd_filenames.add(filename) + + # Local File Header + elif signature == RECORD_SIG_LOCAL_FILE: + filename = _handle_local_file_header(fp, zipfile_files) + if filename in local_filenames: + raise InvalidZipFileError("Duplicate filename in local headers") + local_filenames.add(filename) + + # End of Central Directory + elif signature == RECORD_SIG_EOCD: + _handle_eocd(fp) + break # This always means the end of a ZIP. + + # End of Central Directory (ZIP64) + elif signature == RECORD_SIG_EOCD64: + # We cross-check this value if + # we see EOCD64 Locator later. + # -4 because we just read signature bytes. + expected_eocd64_offset = fp.tell() - 4 + _handle_eocd64(fp) + + # End of Central Directory (ZIP64) Locator + elif signature == RECORD_SIG_EOCD64_LOCATOR: + actual_eocd64_offset = _handle_eocd64_locator(fp) + + # Cross-check the offset specified in the EOCD64 Locator + # record with the one we ourselves recorded earlier. + if ( + expected_eocd64_offset is None + or expected_eocd64_offset != actual_eocd64_offset + ): + return False, "Mis-matched EOCD64 record and locator offset" + + # Note that there are other record types, + # but I didn't find any on PyPI, and they don't + # seem relevant to Python packaging use-case + # ie: encrypted ZIP files. So maybe we want + # to reject these anyway? + else: + return False, "Unknown record signature" + + except InvalidZipFileError as e: + return False, e.args[0] + + # Defensive, this shouldn't be possible in regular operation. + if cd_filenames != local_filenames: # pragma: no cover + return False, "Mis-matched local headers and central directory" + + # Detect whether there is trailing data + # after the end of the zip file. + # This can indicate ZIP files that are + # concatenated together. + cur = fp.tell() + fp.seek(0, os.SEEK_END) + if cur != fp.tell(): + return False, "Trailing data" + + return True, None