-
Notifications
You must be signed in to change notification settings - Fork 264
Deterministically normalize wheel ZIP metadata #2344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey there. I can see that this would be a nice property. Have you come across SOURCE_DATE_EPOCH? I think the goal is the same - to fix the timestamps in order to produce deterministic builds. I remember it being discussed a while back, but I don't know the extent to which the Python tooling supports it - it would be worth trying it though. |
Could you use it in a custom repair step already? Then it just might just need some documentation. |
Hi, thanks for the responses! About a custom repair step:
About SOURCE_DATE_EPOCH:
I think it's worth observing what the current behavior of cibuildwheel is:
I've considered filing a similar issue with auditwheel, but it seems like a more natural fit for cibuildwheel; auditwheel's scope is more narrow. As a side note: I am working on some in-depth notes about deterministic Python wheels in general. (Not just as relevant to cibuildwheel/this issue.) I'll share a link when they're ready. I think this topic is probably worth starting a discussion with other projects about what ideal behavior, if any, "ought to" look like for whl ZIP metadata. |
Maybe this should be fixed on auditwheel/delocate side? |
My apologies, a correction to my previous comment. Setting SOURCE_DATE_EPOCH does cause auditwheel to set the ZIP metadata timestamps to I said this, which is incorrect:
That said, my main question still stands: should the default behavior of cibuildwheel be to output whl ZIPs with "current time" timestamps, or to perform some normalization? SOURCE_DATE_EPOCH affects more than just the ZIP timestamps, and it is relatively heavy-handed way to achieve reproducibility. |
Hm. So you suggest to set SOURCE_DATE_EPOCH to commit creation time? This should produce the same wheel when restarted? |
I filed this issue about improving the cibuildwheel tool itself... if you're looking for recommendations about implementing reproducible wheel builds for your own project, we should probably discuss that elsewhere. (You're welcome to comment on the gist I posted. There's a lot of ways to go, git commit time is not the only way.) [EDIT: My apologies @Czaki, when I wrote this I was focused on the mention of commit creation time, and didn't realize you are involved with cibuildwheel project.] As regards cibuildwheel in general, yes, current observable behavior is: If you set SOURCE_DATE_EPOCH (to whatever you want), then when/if auditwheel is run, it will normalize the ZIP metadata timestamps. There's a few caveats to note:
|
For potential cibuildwheel default behavior, setting SOURCE_DATE_EPOCH to commit creation time is not something I would recommend. That's a fine choice for an individual project, but not for cibuildwheel. For at least two reasons:
|
I mean only for build for git.
SO I do not fully understand your proposition. If we do not get timestamp from git commit, then restart of cibuilwheel job, then we will get a different binary file.
Based on my previous contribution to |
I don't have a fully-formed proposition yet, because I'm not deeply familiar with cibuildwheel, and because there are some open questions about wheel ZIP metadata that I think merit discussion with other projects/the wider Python community. For now, I only have tentative suggestions... My original (very tentative) suggestion was for cibuildwheel to run python-stripzip or similar. Stripzip uses a hardcoded default -- it sets last_mod_time to 0 and last_mod_date to 0x21. (Note that this is not the same as unix epoch time zero. ZIPs use DOS timestamps...) To that, I would add that if SOURCE_DATE_EPOCH is set, cibuildwheel should avoid interfering with it. That could maybe (?) mean something like if it's set, then don't run the stripzip step. Personally, I wouldn't suggest cibuildwheel implement different behavior for VCS vs non-VCS. Seems counterintuitive to me. Happy to know that |
I have put together some notes about deterministic Python wheels: And I started a thread about the subject here, to hopefully spark discussion with various projects/the wider Python community: @Czaki My apologies again for completely misinterpreting your message the other day. |
Description
It would be nice for cibuildwheel to include by default a post-processing step
normalizing wheels for determinism/reproducibility.
This could be a significant step toward widespread verifiably-reproducible
builds of PyPI-hosted wheels.
Background
When using cibuildwheel to build a straightforward Cython package, I found that
by default, the resulting wheels were never bit-for-bit reproducible, because of
ZIP metadata (timestamps etc).
That is, the wheels always came out with a different checksum after each run. An
inspection of the wheels showed that the files contained in the archive were in
fact bit-for-bit reproducible, and the differences were purely due to ZIP
metadata. In particular, the problem was with timestamps (and potentially also
the ordering of entries).
When I added a post-processing step using either
python-stripzip
or Debian'sstrip-nondeterminism
, wheels were bit-for-bit reproducible.The cibuildwheel docs mention: "Because the builds are happening in manylinux
Docker containers, they're perfectly reproducible." This is generally true for
the build itself, but is not true for the final artifacts, because of the ZIP
timestamps.
Considerations
There are a number of tools for this.
To my knowledge, Debian's
strip-nondeterminism
is the most mature and featureful one.Also of particular interest is
python-stripzip
.Other tools include:
There are some issues which arise, notably:
SOURCE_DATE_EPOCH
for this usage.Note that if desiring to implement reproducible builds for a specific project,
one can just pick a strategy, stick with it, and be done with it. But if aiming
to implement a blanket solution in a centralized tool, it's probably worth
investigating the details.
I might tentatively suggest
python-stripzip
for use in cibuildwheel, becauseit does less modifications than
strip-nondeterminism
. In particular, itdoesn't change order of entries, so if
.dist-info
files are placed at end ofthe ZIP (as is best practice), it will leave them so.
Here is a script using cibuildwheel and python-stripzip which demonstrates
successfully generating bit-for-bit reproducible wheels:
https://gist.github.com/tabbyrobin/d6c5cf5323fe54a50004c1291da39315#file-build-wheels-sh
Build log
No response
CI config
No response
The text was updated successfully, but these errors were encountered: