Skip to content

Deterministically normalize wheel ZIP metadata #2344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tabbyrobin opened this issue Apr 1, 2025 · 11 comments
Open

Deterministically normalize wheel ZIP metadata #2344

tabbyrobin opened this issue Apr 1, 2025 · 11 comments

Comments

@tabbyrobin
Copy link

Description

It would be nice for cibuildwheel to include by default a post-processing step
normalizing wheels for determinism/reproducibility.

This could be a significant step toward widespread verifiably-reproducible
builds of PyPI-hosted wheels.

Background

When using cibuildwheel to build a straightforward Cython package, I found that
by default, the resulting wheels were never bit-for-bit reproducible, because of
ZIP metadata (timestamps etc).

That is, the wheels always came out with a different checksum after each run. An
inspection of the wheels showed that the files contained in the archive were in
fact bit-for-bit reproducible, and the differences were purely due to ZIP
metadata. In particular, the problem was with timestamps (and potentially also
the ordering of entries).

When I added a post-processing step using either python-stripzip or Debian's
strip-nondeterminism, wheels were bit-for-bit reproducible.

The cibuildwheel docs mention: "Because the builds are happening in manylinux
Docker containers, they're perfectly reproducible." This is generally true for
the build itself, but is not true for the final artifacts, because of the ZIP
timestamps.

Considerations

There are a number of tools for this.

To my knowledge, Debian's strip-nondeterminism is the most mature and featureful one.

Also of particular interest is python-stripzip.

Other tools include:

There are some issues which arise, notably:

  • Ordering of ZIP entries.
  • What timestamp(s) should be used in the ZIP metadata.
  • Whether to respect SOURCE_DATE_EPOCH for this usage.

Note that if desiring to implement reproducible builds for a specific project,
one can just pick a strategy, stick with it, and be done with it. But if aiming
to implement a blanket solution in a centralized tool, it's probably worth
investigating the details.

I might tentatively suggest python-stripzip for use in cibuildwheel, because
it does less modifications than strip-nondeterminism. In particular, it
doesn't change order of entries, so if .dist-info files are placed at end of
the ZIP (as is best practice), it will leave them so.

Here is a script using cibuildwheel and python-stripzip which demonstrates
successfully generating bit-for-bit reproducible wheels:
https://gist.github.com/tabbyrobin/d6c5cf5323fe54a50004c1291da39315#file-build-wheels-sh

Build log

No response

CI config

No response

@joerick
Copy link
Contributor

joerick commented Apr 1, 2025

Hey there. I can see that this would be a nice property. Have you come across SOURCE_DATE_EPOCH? I think the goal is the same - to fix the timestamps in order to produce deterministic builds. I remember it being discussed a while back, but I don't know the extent to which the Python tooling supports it - it would be worth trying it though.

@henryiii
Copy link
Contributor

henryiii commented Apr 1, 2025

Could you use it in a custom repair step already? Then it just might just need some documentation.

@tabbyrobin
Copy link
Author

tabbyrobin commented Apr 2, 2025

Hi, thanks for the responses!

About a custom repair step:

  • It is possible to use it in a custom repair step. In the example script I linked, it's done on this line: https://gist.github.com/tabbyrobin/d6c5cf5323fe54a50004c1291da39315#file-build-wheels-sh-L40
  • I'm suggesting it may be a good fit to be done automatically by cibuildwheel. Cibuildwheel already provides some guarantees around reproducibility, and "this tool generates your wheels in a specific way" is already in-scope for the project.
  • Or, if not automatically, for cibuildwheel to provide a variable to enable the behavior and/or a good spot for a hook to invoke the relevant command. (Currently cibuildwheel has CIBW_REPAIR_WHEEL_COMMAND var which could be a handy entrypoint, but this isn't quite right for this.)

About SOURCE_DATE_EPOCH:

  • There appears to be reasonably widespread support for it within the Python tooling. For example, setuptools supports it, and so does Hatchling.
  • However: When running cibuildwheel on the project I used as a test, whichever component is responsible for generating the whl ZIP apparently does not support it -- at least not as far as the ZIP metadata is concerned. [CORRECTION: Setting SOURCE_DATE_EPOCH does cause auditwheel to set the ZIP metadata timestamps to $SOURCE_DATE_EPOCH. Auditwheel already supports SOURCE_DATE_EPOCH and implements it for whl ZIP metadata.]
  • If I understand correctly, because of build backend pluggability, there is a potentially large/undefined number of components that will handle generating whl ZIPs. This means that tracking down each one of them to implement deterministic ZIP metadata and/or SOURCE_DATE_EPOCH is difficult. Many of them are also likely effectively legacy/unmaintained. A centralized solution seems more tenable.
  • It is not yet clear to me whether SOURCE_DATE_EPOCH is a natural fit for whl ZIP timestamps. Note that Yocto project respects SOURCE_DATE_EPOCH, and also has a variable REPRODUCIBLE_TIMESTAMP_ROOTFS: "When building packages, various timestamps can be controlled by SOURCE_DATE_EPOCH. This, however, does not work for building images. Images contain various scattered timestamps, ..."

I think it's worth observing what the current behavior of cibuildwheel is:

  • Behavior seems to be to pass on (unmodified) whatever the invoked components spit out.
  • For example, if I run cibuildwheel on Apr 2 around 8am, it may embed the timestamp: 386D3 Last Mod Time 5A827925 'Wed Apr 2 08:09:10 2025' (as shown by zipdetails command).
  • Question: Is this really the intended behavior of cibuildwheel? It was surprising behavior to me, since with the project I tested, everything was perfectly reproducible up until the very last moment. The ZIP metadata was the only element missing.

I've considered filing a similar issue with auditwheel, but it seems like a more natural fit for cibuildwheel; auditwheel's scope is more narrow.

As a side note: I am working on some in-depth notes about deterministic Python wheels in general. (Not just as relevant to cibuildwheel/this issue.) I'll share a link when they're ready. I think this topic is probably worth starting a discussion with other projects about what ideal behavior, if any, "ought to" look like for whl ZIP metadata.

@Czaki
Copy link
Contributor

Czaki commented Apr 2, 2025

However: When running cibuildwheel on the project I used as a test, whichever component is responsible for generating the whl ZIP apparently does not support it -- at least not as far as the ZIP metadata is concerned.

Maybe this should be fixed on auditwheel/delocate side?

@tabbyrobin
Copy link
Author

My apologies, a correction to my previous comment.

Setting SOURCE_DATE_EPOCH does cause auditwheel to set the ZIP metadata timestamps to $SOURCE_DATE_EPOCH. Auditwheel already supports SOURCE_DATE_EPOCH and implements it for whl ZIP metadata.

I said this, which is incorrect:

About SOURCE_DATE_EPOCH: However: When running cibuildwheel on the project I used as a test, whichever component is responsible for generating the whl ZIP apparently does not support it -- at least not as far as the ZIP metadata is concerned.

That said, my main question still stands: should the default behavior of cibuildwheel be to output whl ZIPs with "current time" timestamps, or to perform some normalization? SOURCE_DATE_EPOCH affects more than just the ZIP timestamps, and it is relatively heavy-handed way to achieve reproducibility.

@Czaki
Copy link
Contributor

Czaki commented Apr 2, 2025

Hm. So you suggest to set SOURCE_DATE_EPOCH to commit creation time? This should produce the same wheel when restarted?

@tabbyrobin
Copy link
Author

tabbyrobin commented Apr 2, 2025

I filed this issue about improving the cibuildwheel tool itself... if you're looking for recommendations about implementing reproducible wheel builds for your own project, we should probably discuss that elsewhere. (You're welcome to comment on the gist I posted. There's a lot of ways to go, git commit time is not the only way.) [EDIT: My apologies @Czaki, when I wrote this I was focused on the mention of commit creation time, and didn't realize you are involved with cibuildwheel project.]

As regards cibuildwheel in general, yes, current observable behavior is: If you set SOURCE_DATE_EPOCH (to whatever you want), then when/if auditwheel is run, it will normalize the ZIP metadata timestamps.

There's a few caveats to note:

  • I haven't tested delocate (for MacOS) or delvewheel (for Windows).
  • This (probably) only works if auditwheel is run. Cibuildwheel runs it by default, but it won't run if it's been overridden, for example with CIBW_REPAIR_WHEEL_COMMAND.

@tabbyrobin
Copy link
Author

For potential cibuildwheel default behavior, setting SOURCE_DATE_EPOCH to commit creation time is not something I would recommend. That's a fine choice for an individual project, but not for cibuildwheel. For at least two reasons:

  • cibuildwheel supports building from directories, not just from git. In an arbitrary directory (or extracted sdist), VCS commit metadata is not available.
  • Setting SOURCE_DATE_EPOCH is relatively heavy-handed: it affects more than just the ZIP timestamps. (Incidentally, cf when Rubygems decided to set (if unset) SOURCE_DATE_EPOCH env var, and then later realized that this created problems for other code, and modified their approach to be less invasive.

@Czaki
Copy link
Contributor

Czaki commented Apr 2, 2025

cibuildwheel supports building from directories, not just from git. In an arbitrary directory (or extracted sdist), VCS commit metadata is not available.

I mean only for build for git.

Setting SOURCE_DATE_EPOCH is relatively heavy-handed: it affects more than just the ZIP timestamps. (Incidentally, cf when Rubygems decided to set (if unset) SOURCE_DATE_EPOCH env var, and then later realized that this created problems for other code, and modified their approach to be less invasive.

SO I do not fully understand your proposition. If we do not get timestamp from git commit, then restart of cibuilwheel job, then we will get a different binary file.

I haven't tested delocate (for MacOS) or delvewheel (for Windows).

Based on my previous contribution to delocate their maintainer will be happy with such thing if it is missed.

@tabbyrobin
Copy link
Author

I don't have a fully-formed proposition yet, because I'm not deeply familiar with cibuildwheel, and because there are some open questions about wheel ZIP metadata that I think merit discussion with other projects/the wider Python community. For now, I only have tentative suggestions...

My original (very tentative) suggestion was for cibuildwheel to run python-stripzip or similar. Stripzip uses a hardcoded default -- it sets last_mod_time to 0 and last_mod_date to 0x21. (Note that this is not the same as unix epoch time zero. ZIPs use DOS timestamps...)

To that, I would add that if SOURCE_DATE_EPOCH is set, cibuildwheel should avoid interfering with it. That could maybe (?) mean something like if it's set, then don't run the stripzip step.

Personally, I wouldn't suggest cibuildwheel implement different behavior for VCS vs non-VCS. Seems counterintuitive to me.

Happy to know that delocate project is likely favorable toward determinism.

@tabbyrobin
Copy link
Author

I have put together some notes about deterministic Python wheels:
https://github.com/tabbyrobin/expt-repro-python-wheels/blob/main/notes-on-wheel-determinism.md

And I started a thread about the subject here, to hopefully spark discussion with various projects/the wider Python community:
https://discuss.python.org/t/best-practices-for-deterministically-normalizing-wheel-zip-metadata/90662


@Czaki My apologies again for completely misinterpreting your message the other day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants