New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] Sorted PKG-INFO file fields to make it reproducible #1305
Conversation
|
This is only a problem on Python 2 and Python <3.5, right? For Python 3.6+ anything that came from a dictionary should be sorted by insertion order already, and thus entirely reproducible, right? If you need backwards compatibility, presumably you can pass an Not that I think sorting the outputs is a serious problem, but it's worth noting that the trade-off is that we make it so that anyone who is already using ordered containers for these things will now have the order changed from whatever order they specified to sorted order. |
| @@ -74,7 +74,7 @@ def write_pkg_file(self, file): | |||
| ('Maintainer-email', 'maintainer_email'), | |||
| ) | |||
|
|
|||
| for field, attr in optional_fields: | |||
| for field, attr in sorted(optional_fields): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This already has a fixed order, there's no need to sort it.
| file.write('Platform: %s\n' % platform) | ||
| else: | ||
| self._write_list(file, 'Platform', self.get_platforms()) | ||
| self._write_list(file, 'Platform', sorted(self.get_platforms())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of these _write_list calls take a list as input, so its order will be reproducible.
|
@stevenengler Do you have an example of a |
|
@pganssle Thanks for all your comments. I may have gone overboard with the So one example I've seen this for is pyopenssl, which has: Then I think these I have been compiling with Python 3.6, which I would think should have these in a fixed order, but that doesn't seem to be happening. Plus it would be nice if it worked on older versions of Python as well. What do you think if we sort the order here (dist.py#L410) instead? Then we can remove all the other |
|
@stefanholek From my perspective it doesn't really matter much if it gets sorted early or late if it gets sorted somewhere along the pipeline. Sorting it early does allow users to screw with the order after it's been read from If there's some valid use case, then I think no sorting should happen at all, and if you want reproducible builds you either set Either way, if this PR is going to be accepted it should probably have some tests. @jaraco @benoit-pierre Any thoughts on whether there's a valid reason not to sort |
|
@pganssle Makes sense to me. The last thing I want to do is break a common workflow with setuptools. If you think it's worthwhile to sort anything, or maybe just sort dictionaries, I'll be happy to make the changes and write some tests. If not, that's okay and I think setting |
|
I'd rather not sort items unnecessarily. Wherever possible, let's retain the order indicated by the packager. Reproducible builds are worthwhile, so let's sort but only where necessary.
That's surprising and unexpected. I'd love to see an example where that's not the case. Please take another stab at it. Start with tests that illustrate the failure modes (even if intermittent). Once we have that, then we can be more confident that the changes are correcting the failure and not over-correcting. |
|
@jaraco I'm away for the next two weeks, but I'll look at this again once I'm back. I had figured out the reason for the random sort order on Python 3.6, but I've since forgotten it (I think it had to do with the data being converted between types and the order being lost during the conversion). |
|
I'm actually still kinda on the side of sorting everything, unless we can come up with a reason not to do so. Unfortunately, it seems like the only way to figure out whether or not people are relying on the current ordering behavior is to just break it and see if it causes any problems. |
|
Not sure about other fields, but at least A failure log: https://tests.reproducible-builds.org/archlinux/community/buildbot/buildbot-1.7.0-1-any.pkg.tar.xz.html (buildbot package for Arch Linux) I write a tiny script to mimc what setuptools does when building this package: extras_require = {
'test': [],
'bundle': [],
'tls': [],
'docs': [],
}
s = set()
for k in extras_require.keys():
s.add(k)
for k in s:
print(k)Two sample runs of this script: Actualy I'm not sure why |
|
Can we get this issue sorted out? It persists for quite a while and sorting isn't really an intrusive operation considering all the other regular code-path and computation done during execution. |
|
@anthraxx I think for the people who care about reproducible builds, this is not really an issue. Setting |
|
Well, I'm from Arch Linux and we are taking reproducible builds seriously. Quite frankly the solution here is not to set language specific settings for influencing internal data structure behavior, we are talking about a trivia of making a non deterministic iteration deterministic by ensuring a defined order of iteration to something that is a direct part of an output artifact.
The only real non workaround fix is doing it where the iteration is performed wrongly and its fixes as trivial as sorting the list that ultimately is part of the output artifact.
|
|
@anthraxx I am not saying that we cannot change this, just explaining that this is less critical than it may seem because it's just a default setting that is simple to work around, and from my understanding is not even the most serious problem with reproducible builds in Python. The best thing to do if you want to get this moving would be to make a PR that ensures that Alternatively, you could provide some sort of compelling evidence that no one is relying on the ordering behavior of their ordered containers, in which case we can just sort everything when producing Obviously the second option would be preferable from a maintenance point of view, but it does unfortunately require you to prove a negative. If you have the ability to play around with the Arch package repository's python2.7 packages, there are probably some pretty strong signals you can get by testing their builds against a version of |
|
Thanks for reigniting the interest in this change. I'm fairly confident that the order of requirements is meaningful, as it affects the order that dependencies are resolved... and some packages rely on that order to avoid certain bugs or to enforce certain overrides. I'm unsure if the test suite captures that expectation or if the changes proposed herein would affect that expectation, but it's one example of how making a change like this should be selective and surgical and itself include tests that capture the expectation. If the code doesn't have tests capturing the expectation, the code can be subsequently changed back with a similar pull request. Correct me if I missed something, but the only use-case reported herein about non-deterministic ordering of PKG-INFO is the |
|
It's been a long time and I'm not working in this area anymore, but a year ago I built the ~3000 most popular conda-forge packages and I believe there were more than just |
|
@stevenengler Was that in Python 3.6? I think there are definitely other places in |
|
I went through ~100 non-reproducible python packages in Arch and i think i only found |
|
See also #894215 in Debian. |
|
I've merged #1690 and I'm hopeful that will make builds reproducible on modern Pythons (and even on Python 2.7 if PYTHONHASHSEED is not set for the build). As I result, I expect not to need this change, which is more invasive to the effect of the change. Will be happy to revisit if one can identify other areas where builds are not reproducible. Thanks for your patience on this issue. |
I have sorted the output to the
PKG-INFOfile in order to make the file reproducible. Without sorting, fields can be in different orders each time it is run. For example each time I build a certain package, I get changes such as:vs
Another benefit is that the sorted output has a more consistent formatting. I'm not affiliated with them, but reproducible-builds.org explains why reproducible builds are important.